The UK’s AI Safety Institute (AISI) has conducted a study revealing that AI chatbots can be manipulated into producing harmful, illegal, or explicit responses. The AISI tested five large language models (LLMs), referred to by colour codes, using harmful prompts from an academic paper from 2024, along with a new set of harmful prompts unmodified by safeguards. The models tested were all found to be susceptible to basic jailbreaks and some even provided harmful outputs without circumvention attempts.
The models were found highly vulnerable to relatively simple attacks, such as a queued response initiated with an accommodating phrase. The LLM’s abilities and limitations were also explored in the study. Some demonstrated extensive knowledge in fields like chemistry and biology, while others fell short on university-level cyber security challenges but managed simple high school level challenges. However, while two LLMs completed short-term tasks requiring planning, they failed at tasks demanding complex action sequences.
The AISI has plans to broaden their evaluations in response to high risk situations like advanced scientific planning and execution in certain fields, realistic cyber security scenarios and autonomous system risks. While the study did not declare any of the models as completely ‘safe’ or ‘unsafe’, it supports previous research highlighting the ease with which current AI models can be misused.
It’s an unusual move to anonymize AI models in academic research, but this is likely due to the study being funded by the government’s Department of Science, Innovation, and Technology, thus avoiding risks to their relationships with AI companies. Nevertheless, this essential AI safety research is being pursued by the AISI, with future discussions expected at upcoming summits.
An smaller interim Safety Summit is set to take place this week in Seoul, while the main annual event will occur in France later this year. The study serves to emphasize the importance of improving safeguards within AI systems, and brings attention to the vulnerabilities in AI technology that need to be addressed.