Large classroom sizes in computing education are making it crucial to use automation for student success. Automated feedback generation tools are becoming increasingly popular for their ability to rapidly analyze and test. Among these, large language models (LLMs) like GPT-3 are showing promise. However, concerns about their accuracy, reliability, and ethical implications do exist.
Historically, the focus of LLMs in computing education has been primarily on identifying mistakes rather than providing constructive feedback. While some studies show that LLMs can identify issues in student code, they have been found to be inconsistent and inaccurate. In addition, current models struggle to provide feedback on par with humans on programming exercises. Therefore, the idea of using one LLM to judge the output of another, known as LLMs-as-judges, is becoming popular and showing promising results.
A recent study by researchers from Aalto University, the University of Jyväskylä, and The University of Auckland evaluates the effectiveness of LLMs, including open-source models, in providing feedback for student-written programs. The study establishes a baseline by comparing the feedback from GPT-4 with human expert ratings and then looks at how well other LLMs do as compared to proprietary models like GPT-4.
The study uses data from an introductory programming course at Aalto University, which included student help requests and feedback generated by GPT-3.5. Feedback quality, completeness, and perception were assessed, both qualitatively and automatically using GPT-4. A GPT-4-graded rubric was used to judge the feedback generated by various LLMs.
The study found that while most of the feedback was perceptive, only a little over half was complete, and much of the feedback contained misleading content. Moreover, GPT-4 was found to grade feedback more positively than human annotators, indicating possible positive bias. In terms of classification performance, GPT-4 did well in completeness classification, had lower performance in selectivity, and scored higher on perceptivity due to data skew.
To summarize, the study suggests that open-source LLMs have potential for providing programming feedback and GPT-4 shows promise as a tool to evaluate automatically generated feedback. This implies that LLM-generated feedback could be a cost-effective and accessible resource for educators. However, it is also important to remember that LLMs also have limitations and they may still require human help, especially in more complex cases.