Prometheus-Eval is an innovative repository that offers tools for training, evaluating, and using Natural Language Processing (NLP) models. Developed by researchers from several institutes including the KAIST AI, MIT, and the University of Illinois Chicago, the tool is particularly adept at evaluating other language models. Using the Prometheus-eval Python package, users can effectively evaluate instruction-response pairs through both absolute and relative grading methods, providing comprehensive evaluations of language models.
The Prometheus-Eval stands out due to its ability to replicate human judgments and its unique use of LM-based evaluations. It provides a fair and affordable evaluation framework, eliminating the need for closed-source models for assessment. It also allows users to establish their internal evaluation pipelines regardless of GPT version updates, and can be operated using consumer-grade GPUs.
Prometheus 2, an upgraded version, has been introduced with improvements over its predecessor. It supports both direct assessment (absolute grading) and pairwise ranking (relative grading) formats, thereby improving the flexibility and accuracy of evaluations. On a 5-point Likert scale across multiple direct assessment benchmarks, Prometheus 2 exhibits a Pearson correlation of 0.6 to 0.7 with GPT-4-1106, and shows a 72%-85% agreement with human judgments across multiple pairwise ranking benchmarks.
Prometheus 2 (8x7B), designed to be both accessible and efficient, only requires 16 GB of VRAM, making it suitable for consumer GPUs. This improves its usability, allowing more researchers to leverage its advanced evaluation capabilities without the need for expensive hardware. A lighter version, the Prometheus 2 (7B), provides at least 80% of the 8x7B model’s evaluation performance, surpassing models like Llama-2-70B and matching Mixtral-8x7B.
The Prometheus-Eval package provides an easy-to-use interface for evaluating instruction-response pairs with Prometheus 2. The tool supports batch grading and allows users to easily shift between absolute and relative grading modes by changing the input prompt formats and system prompts. It supports the integration of various datasets, making it highly efficient for large-scale evaluations.
In conclusion, both Prometheus-Eval and Prometheus 2 meet the need for reliable and transparent evaluation tools in NLP. They provide a robust, fair, and accessible framework for evaluating language models. With Prometheus 2 offering more advanced evaluation capabilities and impressive performance metrics, researchers can confidently assess their models using this comprehensive and accessible tool.