Skip to content Skip to footer

Stanford and UC Berkeley’s AI Research highlights the evolution of ChatGPT’s conduct over time.

Large Language Models (LLMs) such as GPT 3.5 and GPT 4 have recently garnered substantial attention in the Artificial Intelligence (AI) community for their ability to process vast amounts of data, detect patterns, and simulate human-like language in response to prompts. These LLMs are capable of self-improvement over time, drawing upon new information and user feedback to enhance their performance and adaptability. Yet, the translucent nature of these AI models makes it difficult to predict how modifications will impact their output, thus creating challenges in using these models in intricate systems based on their outputs.

The unreliability in the output of these models impedes the reproducibility of results due to the inconsistency in the model’s performance across time. A recent research study focusing on two versions of GPT-3.5 and GPT-4, released in March and June 2023, highlighted the problems concerning the behavioural shifts of these models following updates. The study evaluated these models over a wide array of tasks, including answering surveys, solving mathematical problems, writing codes, passing US medical license tests and even visual reasoning tasks.

Significant variability was discovered in the performance and behavioural characteristics of these models. An illustration of this issue was the noted decline in GPT-4’s accuracy, from 84% to 51%, in distinguishing between prime numbers and composite numbers over the period from March to June 2023. The study suggested this was due to the decline in the model’s response to prompts requiring sequential thought connection. In contrast, GPT-3.5 showed marked improvement in the same activity by June.

Other changes noted were that GPT-4 became less responsive to delicate or opinion-based questions but performed better with multi-hop knowledge-intensive questions in June as compared to March. GPT-3.5 became less efficient at processing multi-hop questions, while both models showed greater formatting problems in code creation. Importantly, the research also uncovered decreased capacity for GPT-4 to follow human commands over time, highlighting the fluid nature of LLM behaviour even over short time spans.

Hence, the study underscored the need for consistent monitoring and evaluation of LLMs to ensure their reliability and efficacy across diverse applications. To promote further research in this space, researchers publicly shared a collection of curated questions and corresponding responses from the GPT-3.5 and GPT-4, along with the analysis and visualization code, to secure dependability and veracity in future LLM applications.

Leave a comment

0.0/5