Artificial Intelligence (AI) systems, such as Vision-Language Models (VLVMs), are becoming increasingly advanced, integrating text and visual inputs to generate responses. These models are being used in critical contexts, such as medical diagnostics and autonomous driving, where accuracy is paramount. However, researchers have identified a significant issue in these models, which they refer to as ‘hallucinations’.
Hallucinations in VLVMs are essentially coherent but factually incorrect responses. These are plausible but incorrect details that the models generate about an image. There are significant risks associated with these inaccuracies, most notably that they can misinform decisions in critical applications. The challenge for researchers is in detecting these hallucinations and developing effective methods to mitigate them, so as to maintain the reliability of VLVM outputs.
The primary method used to evaluate hallucinations in VLVMs has been through responses to limited query formats, such as yes/no questions about specific objects or attributes within an image. This method is not comprehensive as it fails to account for more complex, open-ended hallucinations that can occur in different real-world applications. This failure results in a significant gap in understanding and mitigating a broad spectrum of hallucinations that VLVMs can generate.
To overcome this gap, researchers from the University of Oxford and AWS AI Labs have introduced a new framework referred to as THRONE (Text-from-image Hallucination Recognition with Object-probes for open-ended Evaluation). Unlike previous methods, THRONE leverages publicly available language models to evaluate the hallucinations in free-form responses generated by various VLVMs, thus providing a more rigorous and comprehensive approach.
THRONE uses multiple metrics to measure hallucinations across different VLVMs quantitatively. This includes precision and recall metrics along with a class-wise F0.5 score, where precision is emphasized twice as much as recall. This score is particularly important in scenarios where false positives (incorrect but plausible responses) are more harmful than false negatives.
The evaluation of THRONE revealed useful data about the prevalence and characteristics of hallucinations in current VLVMs. The results indicate that even with the advanced approach of THRONE, many VLVMs still exhibit a high rate of hallucinations. Some models produce responses with about 20% of the observed objects being hallucinations, showing that significant work still needs to be done to improve the reliability of VLVM outputs.
In conclusion, the THRONE framework is an active step towards a better understanding and evaluation of hallucinations in vision-language models, specifically in relation to Type I hallucinations in free-form responses. While certain benchmarks have found difficulty in effectively capturing these nuanced inaccuracies, THRONE uses a combination of publicly available language models and a robust metric system. Notwithstanding these advancements, the consistently high rate of detected hallucinations underlines the need for further research to improve the accuracy and trustworthiness of VLVMs in their practical application. The full research paper is available here.