Researchers from the University of Pennsylvania, University of Washington, Allen Institute for AI, University of California, and Columbia University have developed a novel benchmark study for evaluating core visual perception abilities in multimodal large language models (LLMs), called ‘Blink.’ The study suggests that current methods of evaluating LLMs conflate perception with linguistic understanding and reasoning.
Blink puts an emphasis on visual perception abilities that have been overlooked in previous evaluations. The benchmark comprises fourteen classic computer vision challenges ranging from basic pattern matching through intermediate reasoning to advanced visual understanding. The tasks are designed to require an authentic understanding of the content of an image rather than just superficial labeling.
Each task was turned into a question-and-answer session with either a picture or orthographic answer. Blink consists of 3,800 questions and 7,300 images varying from indoor and outdoor scenes to cityscapes and nature. Typically a human or a dataset generates the questions and images to be answered.
The researchers conducted thorough assessments of seventeen multimodal LLMs using Blink. The results showed that while humans find most tasks relatively simple, achieving on average a 95.70% success rate, current LLMs struggle with them. For example, even the advanced GPT-4V model managed only an average accuracy of 51.26%.
When Blink was used to compare multimodal LLMs with expert vision models, the latter performed significantly better. For example, a specialist outperformed GPT-4V on visual correspondence estimation by 62.8%, relative depth estimation by 38.7%, and multi-view reasoning by 34.6%.
The findings suggest that previous estimates of the perceptual abilities of multimodal LLMs may have been overstated. And they indicate that these models could be improved by incorporating insights from specialist models that excel in certain areas. The team sees Blink as a tool for exploring how multimodal LLMs can fuse traditional notions of perception with state-of-the-art generative capabilities, fostering new developments in the field.