Artificial Intelligence (AI) continues to evolve rapidly, with large language models (LLMs) demonstrating vast potential across diverse fields. However, optimizing the potential of LLMs in the field of computer science has been a challenge due to the lack of comprehensive assessment tools. Researchers have conducted studies within computer science, but they often either broadly evaluate benchmarks or explore specific applications, neglecting a more holistic investigation of LLMs’ foundation knowledge and reasoning abilities in this field.
The importance of analyzing the capabilities of LLMs in computer science brought forth a study by the Beijing University of Posts and Telecommunications researchers, who created CS-Bench, the first-ever benchmark centered on evaluating the performance of LLMs in computer science. CS-Bench consists of about 5,000 meticulously curated test items extending over 26 sections within four main computer science spheres. It evaluates performance via different task formats, including multiple-choice queries, fill-in-the-blank, and open-ended questions, to simulate real-life scenarios while gauging LLMs’ adaptability towards diverse tasks. CS-Bench supports bilingual evaluation in Chinese and English and includes both knowledge-type and reasoning-type questions.
The four core domains covered in CS-Bench are Data Structure and Algorithm, Computer Organization, Computer Network, and Operating System. To enrich the assessment approach and better mimic real-world scenarios, the benchmark consists of 26 finely divided subdomains and varying task formats. The data for CS-Bench comes from a range of resources, with a team of computer science graduates working to parse, label, and check the quality of the questions and answers. The benchmark, with its 4,838 samples, allows bilingual assessment in various task formats.
The evaluation shows an overall score range between 39.86% to 72.29% for various AI models. The best performing models are GPT-4 and GPT-4o, surpassing 70% proficiency. LLMs typically perform better in Data Structure and Algorithm and worse in Operating Systems. This information highlights the need for improved performance in reasoning aspects compared to knowledge ones and the significance of interconnecting computer science, mathematics, and coding abilities.
This research opens doors for enhancing large language models’ effectiveness and applicability by providing unique insights into their performance. Even the highest-achieving models like GPT-4o have substantial scopes for refinement. The advent of CS-Bench is significant as it paves the path to future breakthroughs in AI and computer science while spotlighting LLMs’ cross-abilities and potential uses across diverse fields.