Natural Language Processing (NLP) involves computers understanding and interacting with human language through language models (LMs). These models generate responses across various tasks, making the quality assessment of responses challenging. However, as proprietary models like GPT-4 increase in sophistication, they often lack transparency, control, and affordability, thus prompting the need for reliable open-source alternatives.
Existing open-source models fail to deliver key functionalities, such as direct assessment and pairwise ranking, and their output scores often misalign with human reviews. They prioritize general factors like usefulness and harmlessness, resulting in inconsistent and unreliable assessments. This limitation highlights the need for evaluators that replicate human judgements more accurately.
In response to these issues, a research team from KAIST AI, LG AI Research, Carnegie Mellon University, MIT, and others developed an open-source evaluator called Prometheus 2. This model aims to assess language models transparently, reliably, and cost-effectively.
Prometheus 2 combines two evaluation LMs: one trained for direct assessment and another for pairwise ranking. These merged models create a comprehensive evaluator that excels in both formats. This amalgamation has been refined using the Preference Collection dataset that comprises 1,000 evaluation criteria.
Benchmark tests revealed that Prometheus 2 outperformed its predecessors in terms of correlation and evaluation accuracy, surpassing the 85% mark on four pairwise ranking benchmarks. Interestingly, the gap between Prometheus 2’s performance and that of proprietary evaluators like GPT-4 and human evaluations has been significantly reduced.
In summary, the challenge of creating a transparent, adaptable, and scalable language model evaluator mirroring human judgement in NLP has been targeted by researchers developing Prometheus 2. It uses a unique linear merging approach, combining two separately trained models. On benchmark tests, this unified model demonstrated high accuracy, correlation, and far less discrepancy in performance compared to the proprietary models, making Prometheus 2 a big leap forward in open-source evaluation.
All credit goes to the researchers of this project. More information about this research can be found on their paper and GitHub.