Despite impressive advances in AI, the cognitive reasoning abilities of large language models (LLMs) like GPT-4o still fall short when it comes to solving basic problems most humans or even children could figure out. Discussions about the intellectual capacity of AI have been as varied as they are conflicted, with some experts like Geoffrey Hinton, otherwise known as the ‘godfather of AI’, asserting that machines could potentially overtake us in intelligence. Yet, other specialists such as Yann LeCun, the Chief AI Scientist at Meta, believe AI is far from achieving even ‘dog-level’ intelligence.
Testing these hypotheses, users have demonstrated both the strengths and weaknesses within AI capabilities. A good exemplification of these limitations came from a series of experiments involving the classic river crossing puzzle. For those unfamiliar, the puzzle consists of a farmer who needs to transport a wolf, a goat and a cabbage across a river, but his boat can only carry one item at a time, while leaving some items unattended together would result in them being eaten. Although ChatGPT can solve the puzzle when given a prompt, it was revealed that this likely originates from the model’s training data, with known variations of the puzzle already familiar to it.
This dependence on training data further came to light when variations of the puzzle were introduced that required simple logic but deviated from the formats in ChatGPT’s training cache. One such test conducted by British Mathematics Professor Sir William Timothy Gowers outlined a simplified version of the puzzle that the model failed to solve correctly. This seemed to prove that, rather than logically reasoning through the problem, the model seemed to be trying to recall a pre-existing answer.
Another significant AI model, Claude Sonnet 3.5, showed similar struggles when faced with this problem. Even Meta AI, powered by Llama 3, failed to solve the puzzle when it was posed. LeCun attributed these failures to the inherent limitations of these models, stating that LLMs lack common sense, understating of the world and the ability to reason.
However, this may not be the full picture. There’s an aspect of training data that seems to play a huge role in these interactions. When the vocabulary used to pose the puzzle varied from that typically used in traditional variations, it appeared that the LLMs were able to solve it. This suggests that the models are heavily influenced by the wording used, recalling solutions from previous similar problems.
This doesn’t definitively answer whether AI models possess true intelligence or whether they’re simply machines that predict the next token. However, these results underscore the influence of training data on the outcomes with models like GPT-4o, raising questions around whether AI’s perceived problem-solving is a result of intellectual reasoning or result-memory recall. Understanding the mechanisms inside these virtually ‘black box’ AI models will continue to be a debate until we gain more clarity on their underlying processes.