In a joint effort, researchers from DeepMind and Stanford University have developed an AI agent that fact-checks Large language models (LLMs), enabling the benchmarking of their factuality. These advanced AI models sometimes concoct false facts in their responses, which becomes more likely as the length of the response increases. Prior to this development, there was no known means of gauging the factuality of LLMs’ long-form responses.
The researchers initially used GPT-4, another AI model, to generate LongFact – a set of 2,280 prompts pertaining to 38 different subjects. These prompts coax out long-form responses from the LLM undergoing testing.
Following that, the team constructed an AI agent using GPT-3.5-turbo, which uses Google Search to substantiate the truthfulness of the responses generated by the LLM. This method was named Search-Augmented Factuality Evaluator (SAFE). The LLM’s response is initially broken down into separate facts. Each fact is then put through a Google Search, and the truthfulness of the fact is decided based on the information present in the search results.
SAFE has been found to be surprisingly effective; it matches 72% of human annotations, and in cases where it disagreed, it was found to be correct 76% of the time. Moreover, SAFE was 20 times cheaper than human evaluators, making LLMs more efficient and cost-effective fact-checkers.
The model’s performance was measured by combining the number of factoids in its response with the factualness of each factoid. A metric, F1@K, was applied to estimate the ‘ideal’ number of facts in a response, measured against ‘real’ facts.
The investigation utilised LongFact to prompt 13 LLMs from different families, including Gemini, GPT, Claude, and PaLM-2, and SAFE was used to evaluate the factuality of the responses. GPT-4-Turbo was identified as the most factual model in generating long-form responses.
SAFE offers an efficient and economical way to measure LLM long-form factuality, ensuring rapid and cost-effective fact-checking. However, it still hinges on the correctness of the information Google returns in its search results. Nonetheless, DeepMind has made SAFE publicly available, indicating its potential use in improving LLM fact-checking via enhanced pretraining and finetuning, essentially enabling an LLM to verify its information before presenting the output to the user.