Skip to content Skip to footer

IsoBench: A Benchmark Dataset for Artificial Intelligence covering Four Broad Domains: Mathematics, Science, Algorithms, and Gaming.

Large language models and multimodal foundation models like GPT4V, Claude, and Gemini, that blend visual encoders and language models, have made profound strides in the realms of Natural Language Processing (NLP) and Natural Language Generation (NLG). They show impressive performance when working with text-only inputs or a combination of image and text-based inputs. Nonetheless, queries persist about how their capabilities may alter based on the type of input they are receiving.

Addressing this uncertainty, a group of researchers has introduced IsoBench, a benchmark dataset containing challenges from four key areas: games, science, mathematics, and algorithms. Every issue in IsoBench has several isomorphic (i.e. structurally alike) representations, which could be textual, mathematical, or graphical in nature. This diversity in representation types allows a comprehensive examination of performance distinctions resulting from different forms of representation.

IsoBench can be utilized as a diagnostic tool for discrepancies in model performance caused by the input representation by providing detailed feedback. It has been observed that several foundation models display a preference for textual representations in comparison to other input forms. For instance, Claude-3 Opus’ performance decreases by 28.7 points when provided with images instead of text across all issues in IsoBench. Similarly, GPT-4 Turbo and Gemini Pro show a drop in performance by 18.7 and 14.9 points respectively when presented with image inputs rather than text.

To combat these biases and enhance model performance, the team proposed two prompting strategies, IsoCombination and IsoScratchPad. IsoScratchPad emphasizes translations between multiple input forms, while IsoCombination considers combinations of different input representations. They can potentially decrease the performance gaps between foundation models by utilizing the strengths of various input modalities. The research indicates that IsoCombination and IsoScratchPad both enhance model performance, offering intriguing prospects for further exploration and progress in multimodal AI systems.

The team’s main contributions are the introduction of IsoBench, an extensive test dataset with 1,630 samples that cover a variety of topics, including chess, physics, chemistry, and discrete and applied mathematics. IsoBench allows for comprehensive multimodal performance assessments due to the many isomorphic input representations for each sample, including domain-specific text formats and visual formats. Through IsoBench, the team evaluated eight popular foundation models and discovered a recurring pattern where multimodal models favor text over image-based prompts.

Two strategies, IsoScratchPad and IsoCombination, have been proposed to bridge the performance gaps among different input modalities. IsoScratchPad translates visual inputs into text during inference, while IsoCombination integrates input modalities. According to the research, in some cases, IsoCB and IsoSP can improve the performance of multimodal foundation models by nearly ten percentage points. By employing these methods, the observed bias towards text inputs is reduced, resulting in better model performance across various input modalities.

The original research and associated project can be accessed for more in-depth information. The researchers of the project are to be credited for this research. This is further testament to the continuous research, exploration, and innovation happening in the realm of artificial intelligence.

Leave a comment

0.0/5