Processing visual information effectively is a key step towards achieving Artificial General Intelligence (AGI). Although much progress has been made in artificial intelligence technologies, conventional Visual Question Answering (VQA) systems are still restricted by the inability to process and reason about more than one image at a time. The “Multi-Image Question Answering” (MIQA) task seeks to address this issue by enabling AI systems to interpret and analyze whole collections of visual data.
Google’s “Needle-In-A-Haystack” (NIAH) challenge has emerged as a popular tool for analyzing an AI model’s capacity to process “long contexts” or large sets of input data such as extended documents, videos, or hundreds of images. In NIAH, critical information (the “needle”) relevant to a specific question is hidden within a vast amount of data (the “haystack”).
To explore AI’s potential further in visual-centric long-context reasoning capabilities, the Visual Haystacks (VHs) benchmark was introduced. VHs, designed to test Large Multimodal Models (LMMs), focuses on visual retrieval and reasoning across large, uncorrelated image sets. This pioneering benchmark has two main challenges, Single-Needle and Multi-Needle, each of which tests a model’s ability to accurately locate and analyze relevant images before answering questions.
The VHs benchmark reveals that current LMMsWith struggle with ‘Visual Distractors’, have difficulty reasoning across multiple images, and the position of the ‘needle’ image in the input sequence can significantly impact outcomes. To address these issues, the open-source MIRAGE (Multi-Image Retrieval Augmented Generation) was introduced. MIRAGE leverages a query-aware compression model, uses a retriever trained to predict if an image is relevant and augments existing single-image instruction fine-tuning data with multi-image reasoning data.
Tests with MIRAGE showed it was able to handle more images while achieving state-of-the-art performance on most single-needle tasks. It also outperformed others on multi-image tasks and delivered competitive single-image Question-Answer performance. MIRAGE’s co-trained retriever also performed significantly better than CLIP without sacrificing efficiency.
The Visual Haystacks (VHs) benchmark identifies three prevalent deficiencies in current Large Multimodal Models: problems with visual distraction, difficulties in reasoning across multiple images, and issues with image sequence positioning. The MIRAGE, as a visual Retriever-Augmented Generator, offers a way to address these problems. Developers and researchers are encouraged to use the Visual Haystacks framework to help identify and rectify deficiencies in future LMM projects, advancing the frontiers of AGI.