Large language models (LLMs) are increasingly in use, which is leading to new cybersecurity risks. The risks stem from their main characteristics: enhanced capability for code creation, deployment for real-time code generation, automated execution within code interpreters, and integration into applications handling unprotected data. It brings about the need for a strong approach to cybersecurity assessments.
Previous work to assess the security attributes of LLMs includes open benchmark frameworks and position papers proposing evaluation criteria. Cybermetric, SecQA, and WMDP-Cyber use a multiple-choice format similar to educational assessments. CyberBench expands the assessment to various tasks within the cybersecurity domain, while LLM4Vuln focuses on finding vulnerabilities by coupling LLMs with external knowledge. Meanwhile, Rainbow Teaming, an application of CYBERSECEVAL 1, generates adversarial prompts automatically, akin to those used in cyberattack tests.
Meta researchers are presenting CYBERSECEVAL 2, a new standard for assessing security risks and capabilities in LLMs. These include prompt injection and code interpreter testing. The benchmark’s open-source code makes it easy to evaluate other LLMs. The research also introduces the concept of safety-utility trade-off measured by the False Refusal Rate (FRR). This showcases the propensity of LLMs to reject both unsafe and benign prompts, which affects utility.
In CYBERSECEVAL 2, prompt injection assessments fall into logic-violating and security-violating types. Tests for vulnerability exploitation focus on challenging but solvable scenarios to test LLMs’ general reasoning abilities. In assessing code interpreter abuse, conditioning of the LLM is given priority along with unique categories of abuse, while another LLM acts as the judge, assessing the compliance of the generated code.
The results from CYBERSECEVAL 2 show a decrease in LLM compliance to requests for assistance with cyberattacks from 52% to 28%, which suggests a growing security consciousness. Non-code-specialised models like Llama 3 were better at non-compliance, whereas CodeLlama-70B-Instruct came close to state-of-the-art performance. Variations were revealed in FRR with CodeLlama-70B showing a notably high FRR. All models displayed vulnerability to prompt injection, succumbing to such attempts at rates exceeding 17.1%. Tests of code exploitation and interpreter abuse underscored the limitations of LLMs and emphasised the need for improved security measures.
The research introduced robust prompt injection tests, with an evaluation of 15 attack categories on LLMs. It also launched assessments measuring LLM compliance with instructions to compromise attached code interpreters. Additionally, it brought in an assessment suite to measure LLM capabilities in creating exploits in C, Python, and Javascript. The key findings highlight the persistent vulnerabilities in prompt injection, the effective use of False Refusal Rate, and the need for further research on exploit generation tasks.
The research and its outcomes are available for further study on their paper and GitHub page.