In the domain of visual question answering (VQA), the Multi-Image Visual Question Answering (MIQA) remains a major hurdle. It entails generating pertinent and grounded responses to natural language prompts founded on a vast assortment of images. While large multimodal models (LMMs) have proven competent in single-image VQA, they falter when dealing with queries involving an extensive set of images. This shortcoming hampers their effectiveness when used in applications akin to sifting through sizable photo albums, gleaning specific details from the web, or observing environmental alterations from satellite imagery.
Current VQA techniques, which primarily center around single-image evaluation, are not apt for processing complex queries spanning multiple images. Models like the Gemini 1.5-pro and GPT-4V can process multiple images but struggle in efficiently retrieving and integrating pertinent images from substantial datasets. This results in computational inefficiency and downward trending performance as the volume and diversity of images accumulates. These models are also afflicted by positional bias, falter in merging visual data across unrelated images, leading to the degradation of accuracy and functionality.
To mitigate these constraints, researchers from the University of California conceived MIRAGE (Multi-Image Retrieval Augmented Generation), a unique framework for MIQA. By implementing various innovative features – a compact image encoder, a retrieval-oriented, query-conscious relevance filter, and fortified training with specific synthetic and actual MIQA data – MIRAGE is capable of efficiently managing large image contexts and enhancing accuracy in MIQA tasks. The model demonstrated an 11% accuracy improvement over proprietary models like GPT-4o in the Visual Haystacks (VHs) benchmark. It also registered a maximum efficacy improvement of 3.4x over conventional text-centric multi-stage strategies.
The MIRAGE framework includes a compressive image encoding feature via a Q-former that reduces the token intensity per image from 576 to 32 tokens, thus accommodating more images within the same context budget. The query-aware relevance filter, a single-layer MLP, predicts an image’s relevance to a query and accordingly selects pertinent images for comprehensive analysis. The training procedure utilizes existing MIQA datasets and synthetic data culled from single-image QA datasets. This strengthens the model’s performance and adaptability across varied MIQA situations.
The evaluation found the MIRAGE to significantly outperform existing models on the VHs benchmark by achieving an 11% accuracy improvement over proprietary models like the GPT-4o for single-needle queries. Moreover, MIRAGE showed improved performance consistency as the number of image sets expanded, validating its potential to handle expansive visual contexts.
In conclusion, the MIRAGE framework signifies a substantial stride forward in MIQA. Its ability to efficiently extract and incorporate pertinent images from vast datasets to quell complex visual queries is impressive. With superior performance levels and processing efficiency compared to other models, MIRAGE promises to enable more powerful AI applications capable of processing extensive visual data in real-world situations.