Skip to content Skip to footer

Is it Safe to Rely on Vast Language Models for Assessment? Introducing SCALEEVAL: A Framework for Meta-Evaluation Aided by Agent Debate, Which Utilizes the Skills of Various Communication-Heavy LLM Agents.

Large language models (LLMs) have proven beneficial across various tasks and scenarios. However, their evaluation process is riddled with complexities, primarily due to the lack of sufficient benchmarks and the required significant human input. Therefore, researchers urgently need innovative solutions to assess the capabilities of LLMs in all situations accurately.

Many techniques primarily lean on automated metrics, using LLMs for their evaluation. Despite a thorough meta-evaluation involving rigorous testing, some tasks require more scrutiny, raising doubts about the reliability of LLMs as evaluators. Researchers from top universities and labs, including Shanghai Jiao Tong University and Carnegie Mellon University, have introduced SCALEEVAL, a meta-evaluation framework that addresses these challenges. SCALEEVAL uses numerous communicative LLM agents through an agent-debate approach which eases the work of human annotators, aiding them in identifying proficient LLMs for evaluation. The agent-debate tactic significantly decreases the need for annotations traditionally necessary for meta-evaluation.

This framework leverages the multi-agent debate for a reliable meta-evaluation process of LLMs. Here, the LLM agents engage in discussions to assess responses based on user-defined criteria, thus minimizing the dependence on extensive human annotation and ensuring scalability. The experimental design involves pairwise response comparisons, focusing on LLMs such as gpt-3.5-turbo. A human expert vetting endorses the reliability of this method by simultaneously applying the agent-debate-assisted and human expert annotation protocols. Thus, it achieves a balance between efficiency and human discernment in accurate assessments.

The performance of LLMs as evaluators diminishes when specific letters in criteria prompts are masked, or guiding phrases are removed. However, LLMs like gpt-4-turbo and gpt-3.5-turbo maintain a consistent rate of agreement across various criteria formats. In comparison, others like Claude-2 display confusion and reluctance, proving their inconsistency, especially under adversarial conditions. Such observations highlight room for further improvement in the design and application of LLMs, despite their advanced capabilities.

In summation, SCALEEVAL serves as a viable solution to the limitations of conventional meta-evaluation methods, offering a scalable framework that assesses LLMs as evaluators using an agent-debate approach. While the study confirms the reliability of SCALEEVAL, it also reveals the abilities and limitations of LLMs in diverse scenarios, contributing to the development of scalable solutions for evaluating LLMs, an advancement that is vital for expanding their applications.

Leave a comment

0.0/5