The realm of artificial intelligence has been widely influenced by the emergence of large language models (LLMs), with their potential being seen across multiple fields. However, the task of enabling these models to efficiently utilize knowledge of computer science and to benefit humanity remains a challenge. Although many studies have been conducted across various disciplines, including in computer science, there is a lack of comprehensive evaluation specifically centered around LLMs’ performance in the field of computer science.
Most studies on LLMs within the realm of computer science can be divided into two main categories. The first considers broader evaluation benchmarks, where computer science is just a small fraction of what is being measured. The second focuses on specific LLM applications within computer science, neither of these approaches thoroughly evaluates the foundational knowledge and reasoning abilities of LLMs within computer science.
In response to this, researchers from the Beijing University of Posts and Telecommunications have proposed the first ever benchmark dedicated to evaluating the performance of LLMs in computer science, known as CS-Bench. It exhibits high quality, multiple task forms of varying capacities and a bilingual method of evaluation.
The tool is made up of approximately 5,000 carefully picked test items split into 26 sections, spanning across four key areas of computer science. The evaluation carried out by CS-Bench includes both knowledge-type and reasoning-type questions. It also supports bilingual evaluation, being available in both English and Chinese.
The domains covered by CS-Bench include Data Structure and Algorithm (DSA), Computer Organization (CO), Computer Network (CN), and Operating System (OS). Its 26 fine-grained subfields and diverse tasks should enrich the assessment dimensions and imitate real-world scenarios.
The evaluation results vary, with models scoring between 39.86% to 72.29%. GPT-4 and GPT-4o represent the highest standards on CS-Bench as they are the only models that exceed 70% proficiency. Meanwhile, open-source models such as Qwen1.5-110B and Llama3-70B have exceeded strongly performing closed-source models. Newer models marked a significant improvement as compared to earlier models.
Despite all models performing worse on reasoning as compared to knowledge scores, this indicates that reasoning poses a greater challenge. Generally, LLMs tend to perform best in Data Structure and Algorithm and perform worst in Operating Systems. Stronger models demonstrated better ability to use knowledge for reasoning and have shown more robustness across different task formats.
The introduction of CS-Bench aims to provide crucial insights into LLMs’ performance in computer science. The benchmark highlights the close relationships between mathematics, coding abilities, and computer science in LLMs. These findings offer directions for improving LLMs’ capabilities in the field and open up new possibilities for future advancements in AI and computer science.