A group of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) recently conducted a series of tests to understand whether AI models like ChatGPT are actually capable of reasoning through problems, or if they are merely echoing correct answers from their training data.
The series of tests, which they referred to as “counterfactual tasks”, were structured slightly differently from the standard tasks traditionally used to evaluate AI models. The aim was to assess whether the models could successfully adapt to new tasks as well as unseen instances of known tasks.
The group tested several AI models, including GPT-4, GPT-3.5 Turbo, Claude 1.3, and PaLM2. The tests involved prompts such as math problems in both base-10 and base-9, aiming to determine whether the AI could apply mathematical reasoning across different number bases.
Results revealed that while these models performed well on standard tasks, their performance significantly degraded in situations that deviated slightly from the norm. Despite the drop, the performance on these counterfactual tasks was still better than random guessing, indicating that the models did attempt to reason through the tasks, albeit not very successfully.
This suggests that the admirable performance of AI models in typical tests, akin to college exams, leans more towards recall of training data rather than actual reasoning. It was also noted that these models struggle when faced with unfamiliar scenarios, shedding light on their current lack of adaptability.
These findings emphasize the importance of distinguishing a model’s observed task performance from its abstract task ability. Just because a model performs well in training does not guarantee it will maintain the same proficiency when presented with a novel problem to solve.
To improve AI models’ ability to generalize, the researchers propose exposing the AI models to more real-world contextualized data and semantic representation. They also argue against the prevailing “train-to-test” approach, suggesting it does not provide a true evaluation of how a model will perform when tasked with new challenges to reason through.
In conclusion, the AI models tested, while able to generate accurate responses based on their training data, were found to lack the ability to reason through tasks that deviated from their training. Improvements to their training data and methods could enhance these AI models’ adaptability and ability to reason through new and unseen tasks.