Skip to content Skip to footer

CodeEditorBench: An AI-based Mechanism for Assessing the Efficiency of Extensive Language Models (LLMs) in Code Modification Tasks.

A group of researchers have created a novel assessment system, CodeEditorBench, designed to evaluate the effectiveness of Large Language Models (LLMs) in various code editing tasks such as debugging, translating, and polishing. LLMs, which have greatly advanced due to the rise of coding-related jobs, are mainly used for programming activities such as code improvement and remediation. They are gaining popularity as programming tools, particularly those generated for coding jobs, despite the fact that most evaluation methods focus on code creation, overlooking the crucial role of code editing in software development.

Creating a divergence from other benchmarks focused on code creation, CodeEditorBench centers on real-world applications and practical aspects of software development. The team selected coding situations and issues from five distinctive sources, capturing a wide array of programming languages, levels of difficulty, and editing tasks. This approach enables the evaluation to take into account the diversity and intricacy of challenges seen in real-world coding environments.

The team’s evaluation of 19 different LLMs using the CodeEditorBench framework revealed that closed-source models, notably Gemini-Ultra and GPT-4, performed better than their open-source counterparts. Such findings underscore how significant the model’s architecture and training data are when determining performance, specifically when varying prompt sensitivity and problem categories.

CodeEditorBench aims to provide a consistent method for evaluating LLMs, and includes tools for further analysis, training, and visualization within the framework. Also, the team intends to make all evaluation-related data publicly available to encourage more in-depth investigations into LLM features. Plans are already underway to introduce more evaluation measures for a comprehensive assessment in the future.

On mapping the current state of LLMs, the study found that the most potent base model available to the public is OpenCIDS-33B, followed by OpenCI-DS-6.7B and DS-33B-INST. Typically, non-public models like Gemini, GPT, and GLM outperform the public ones. However, OpenCIDS-33B and DS-33B-INST, which are instruction-tuned models with more than 30 billion parameters, tend to narrow this performance gap.

CodeEditorBench also aims to highlight the limitations of LLMs, mainly in terms of rewriting and revising code. For instance, while GPT-4 performs well in three out of four categories, it falls short in code-polishing. Gemini Ultra struggles with modifying code requirements. Highlighting these shortcomings reinforces the need for improvements in these areas during LLM training and development.

In conclusion, CodeEditorBench was created to stimulate advancements in LLMs by providing a robust platform for thorough analysis of code editing aptitude. The valuable insights from this comprehensive assessment framework will guide future developments in large language models.

Leave a comment

0.0/5