As AI models become increasingly integrated into various sectors, understanding how they function is crucial. By interpreting the mechanisms underlying these models, we can audit them for safety and biases, potentially deepening our understanding of intelligence. Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have been working to automate this interpretation process, specifically for artificial vision models.
The team developed Multimodal Automated Interpretability Agent (MAIA), a system that uses a vision-language model backbone and a range of tools for experimenting on other AI systems. MAIA can generate hypotheses, design experiments to test them, and refine its understanding through iterative analysis. The system responds to user queries by running experiments on specific models until it can provide a comprehensive answer.
MAIA demonstrated its capabilities by labeling individual components inside vision models, cleaning up image classifiers, and uncovering possible biases in AI systems. Yet, MAIA’s flexibility is its advantage, according to the researchers; it can answer many types of interpretability queries and design on-the-fly experiments to investigate them.
To illustrate how MAIA works, the researchers presented a task where a user asks MAIA to define the concepts a specific neuron inside a vision model detects. MAIA generates hypotheses and tests them by creating and editing synthetic images. The results are evaluated based on the accuracy of MAIA’s interpretations and how well they predict neuron behavior on unseen data.
In one experiment, MAIA helped uncover a bias in a model that was routinely misclassifying images of black labradors, favoring yellow-furred retrievers. There are certain limitations, including the quality of external tools on which MAIA relies, possible confirmation bias, and overfitting. However, the researchers believe the system could play a critical role in understanding and effectively overseeing AI systems in the future.