Skip to content Skip to footer

Interpreting the functions and behaviors of large-scale neural networks remains a complex task and a significant challenge in the field of Artificial Intelligence. To tackle this problem, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a strategy that uses AI models to investigate the computations inside other AI systems. 

Central to this approach is the “automated interpretability agent” (AIA) which emulates scientific experimental processes, conducting tests on computational systems and in return, generating descriptive and clear explanations of these systems’ operations and failures. The AIA is designed to create hypotheses, carry out experiments, and learn iteratively, enhancing its understanding of other systems in real-time. This is in contrast to existing interpretability procedures, which are largely passive in their actions.

In conjunction with the AIA method, a benchmark test suite called “function interpretation and description” (FIND) was also developed. This benchmark provides a set of tasks and functions that resemble computations inside trained networks, alongside descriptions of their behavior, thereby providing a standard for evaluating interpretability procedures. Example tasks in FIND test bed include synthetic neurons that mimic real neurons inside language models. AIAs, via black-box access, design inputs to test these synthetic neurons, thus enabling deeper understanding and purposeful testing of system behavior.

Despite the promise shown by the AIA and FIND, the interpretation of networks’ behavior is still far from being fully automated. It is revealed that AIAs, while outperforming existing methods, fail to accurately describe nearly half of the functions in the benchmark. This limitation may stem from an insufficient exploration of specific areas during the AIA’s initial exploratory data phase. The researchers found that the accuracy of interpretation can be significantly improved by guiding the AIAs’ exploration using specific, relevant inputs.

The team is also working on a toolkit to aid AIAs in undertaking more precise experiments on neural networks. The ultimate goal is to develop fully automated interpretability procedures that can aid in auditing real-world systems such as autonomous driving or face recognition algorithms, to identify potential failure modes or biases prior to deployment.

The research team predicts that advanced AIAs could potentially generate new types of experiments and questions beyond human scientists’ initial considerations. This innovation presents a possible leap in AI research, with the potential to render AI systems more understandable and more reliable. Commenting on the study, Martin Wattenberg, a Harvard University professor who was not involved in the study, praised the development of the benchmark and the AIA’s ability to simplify complex AI systems.

The research was presented at the NeurIPS 2023 conference in December and was sponsored partly by the MIT-IBM Watson AI Lab, Open Philanthropy, Amazon research,
Hyundai NGV, U.S. Army Research Laboratory, and the National Science Foundation, among others.

Leave a comment

0.0/5