Skip to content Skip to footer

Introducing BigCodeBench by BigCode: The New Benchmark for Assessing Sizeable Language Models in Practical Coding Assignments.

BigCode, a leading developer of large language models (LLMs), has launched BigCodeBench, a new benchmark for comprehensively assessing the programming capabilities of LLMs. This concurrent approach addresses the limitations of existing benchmarks like HumanEval, which has been criticized for its simplicity and scant real-world relevance. BigCodeBench comprises 1,140 function-level tasks which require the LLMs to execute user-oriented instructions and draw together multiple function calls from 139 different libraries. The tasks, designed to emulate real-world scenarios, necessitate complex problem-solving skills.

BigCodeBench consists of two primary components, namely, BigCodeBench-Complete and BigCodeBench-Instruct. The former is designed for code completion, where LLMs are expected to complete the implementation of a function based on specific docstring instructions, thus validating their capacity to generate functional and correct code snippets from partial information. Conversely, BigCodeBench-Instruct is used to assess instruction-tuned LLMs following natural language instructions, using a more conversational style, demonstrating how users may interact with these models in practical settings.

BigCode has provided an accessible evaluation framework to streamline the evaluation process. This can be accessed via PyPI and has precise setup instructions with pre-configured Docker images for code generation and execution. The performance of models on BigCodeBench is evaluated via the calibrated Pass@1 metric, which gauges the percentage of tasks that the models solve correctly on their first attempt. This metric is then refined using the Elo rating system, similar to that applied in chess, to rank models based on individual task performance.

In the interest of community engagement, BigCode encourages the AI community to critique and contribute to the development of BigCodeBench. All associated elements of BigCodeBench, such as tasks, test cases, and evaluation framework, are open source and can be accessed on platforms like GitHub and Hugging Face. In the future, BigCode intends to constantly improve BigCodeBench by catering for multilingual support, expanding the complexity of test cases, and ensuring that the benchmark evolves alongside advancements in programming libraries and tools.

The launch of BigCodeBench by BigCode marks a major progression in evaluating LLMs for programming tasks. With a comprehensive and challenging benchmark, BigCode aims to test the limits of these models, thus furthering the development of AI in software programming. The company actively promotes feedback and participation from the AI community on its product, with all linked aspects open-sourced on various platforms. Future enhancements will focus on features like multilingual support and increased complexity of test cases.

Leave a comment

0.0/5