Skip to content Skip to footer

The AI study by Cohere explores the assessment of models using a massive assembly of language model evaluators, also known as PoLL.

In the field of artificial intelligence, the evaluation of Large Language Models (LLMs) poses significant challenges; particularly with regard to data adequacy and the quality of a model’s free-text output. One common solution is to use a singular large LLM, like GPT-4, to evaluate the results of other LLMs. However, this methodology has drawbacks, including high costs, intra-model bias, and the overreliance on large models.

An alternative strategy, which attempts to overcome these issues, revolves around a Panel of LLM evaluators (PoLL). Instead of using one large LLM, the PoLL method employs several smaller LLMs to appraise output quality. It integrates various smaller models from multiple model families, thereby reducing the intra-model bias found when relying solely on one large model.

The PoLL framework also provides a cost-effective advantage of more than seven times over the use of a single large model in evaluations. Utilizing six different datasets and three different judge settings, researchers have shown that the PoLL approach performs better than using a single large model.

The test scenarios included single-hop and multi-hop question-answering, and the Chatbot Arena. The research showed that the PoLL approach not only more closely correlates with human evaluations but also highlights those instances where GPT-4 deviates significantly in rating.

The key takeaways of this research include the proposal of the PoLL framework as an innovative evaluation method for LLMs, the demonstration of its cost-effectiveness and the reduction of intra-model scoring biases through the integration of a diverse panel of evaluators.

This significant shift in evaluating large language models presents not only a new model for assessment but also a way to mitigate costs and potential bias within the field of AI. The PoLL framework underscores the emerging potential for using cooperative evaluations from a heterogenous group of smaller models, paving the way for more accurate and economical appraisals in the future.

Leave a comment

0.0/5