Skip to content Skip to footer

Artificial intelligence (AI) and particularly large language models (LLMs) are not as robust at performing tasks in unfamiliar scenarios as they are positioned to be, according to a study by researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).

The researchers focused on the performance of models like GPT-4 and Claude when handling “default tasks,” normal scenarios a model is trained and tested on, and “counterfactual scenarios,” which deviate from the normal and are usually unfamiliar to the models. In an effort to take the study outside the models’ comfort zones, the researchers modified existing tasks, using a variety of datasets and benchmarks tailored specifically to the models’ capabilities.

A prime example is base-10 arithmetic. While models perform well with this familiar number base, performance typically decreases substantially when working with other number bases. This suggests their arithmetic skills are not as generalized as they may seem. Other tasks such as chess problems with altered starting positions or musical chord fingering also saw drops in performance, implying difficulty in adapting to new situations without rote memorization.

The study’s lead author, Zhaofeng Wu, describes this observation as a fascinating aspect of LLMs. He emphasizes the importance of understanding this limitation as we aim to expand the models’ application horizons.

However, the study has limitations. The tasks and settings focused on do not fully represent the range of challenges the models could encounter in real-world applications, suggesting the need for more diverse testing environments in the future. Despite these limitations, the study provides vital insights into the workings and limitations of LLMs and has the potential to influence the design of future models for improved robustness.

Furthermore, the researchers aim to make LLMs more interpretable, to have a better understanding of their decision-making processes. This will help in discerning whether these models are genuinely generalizing to unseen tasks, or simply memorizing training data.

Assistant Professor Hao Peng at the University of Illinois at Urbana-Champaign praised the study for addressing the large question about the capabilities of state-of-the-art LLMs. He believes the research could inspire further study in identifying LLMs’ failure modes and developing better models.

The research team presented their work at the North American Chapter of the Association for Computational Linguistics (NAACL) last month. The study was supported in part by the MIT–IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation.

Leave a comment

0.0/5