Skip to content Skip to footer

MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers, in collaboration with the MIT-IBM Watson AI Lab, have developed a new metric, the “minimum viewing time” (MVT), to measure the difficulty of recognizing an image. The researchers aimed to close the gap between the performance of deep learning-based AI models and humans in recognizing and understanding visual data. The metric was developed after observing that both humans and AI models struggle with complex images that don’t instantly reveal their content.

The researchers observed that while large datasets have been instrumental in advancing AI, there is no gauge for the explicit difficulty of an image or dataset. This makes it challenging to objectively evaluate progress towards human-level performance and to enhance the challenge a dataset poses. Object recognition models do well on current data sets, which are often easier and are designed to intentionally challenge machines with debiased images or distribution shifts. However, they fall short when compared to human abilities.

Using subsets of ImageNet and ObjectNet, two popular datasets in machine learning, the researchers showed images to participants for varying durations, and asked them to identify the correct object from a set of 50 options. They found that existing test sets, including ObjectNet, are skewed toward easier images. Moreover, larger models showed more progress on simpler images but demonstrated less progress on more challenging ones.

David Mayo, an MIT PhD student in electrical engineering and computer science and a CSAIL affiliate, said the findings revealed some images inherently took longer to recognize. He suggested examining the brain’s activity during this process and how it relates to machine learning models. Researchers also found the trait of complex images recruiting additional brain areas not typically associated with visual processing. In practical situations like healthcare where understanding visual complexity matters, AI models are challenged to interpret medical images due to the diversity and difficulty distribution of such images.

Using the “minimum viewing time” metric, the team introduced a new tool to measure the difficulty of an image for existing benchmarks and for a variety of applications. This included measuring the difficulty of a test set before deploying real-world systems, identifying how the brain processes image difficulty, and closing the gap between benchmark and real-world performance.

The researchers acknowledged that their approach primarily focuses on object recognition and leaves out the complexities introduced by cluttered images. Nonetheless, it’s an important step towards a more robust benchmark for AI performance in image recognition tasks. Simon Kornblith PhD ’17, Anthropic tech member, concurs and states that the MIT team’s research focuses on images that people can only recognize if given enough time. He confirms these are generally harder for computer vision systems. But Alan L. Yuille, Bloomberg Distinguished Professor of Cognitive Science and Computer Science at Johns Hopkins University, stresses it will help develop more realistic benchmarks.

Leave a comment

0.0/5