CyberSecEval 2 – A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

With the speed at which the generative AI space is moving, we imagine an open approach is a very important approach to bring the ecosystem together and mitigate potential risks of Large Language Models (LLMs). Last 12 months, Meta released an initial suite of open tools and evaluations aimed toward facilitating responsible development with open generative AI models. As LLMs develop into increasingly integrated as coding assistants, they introduce novel cybersecurity vulnerabilities that have to be addressed. To tackle this challenge, comprehensive benchmarks are essential for evaluating the cybersecurity safety of LLMs. That is where CyberSecEval 2, which assesses an LLM’s susceptibility to code interpreter abuse, offensive cybersecurity capabilities, and prompt injection attacks, comes into play to supply a more comprehensive evaluation of LLM cybersecurity risks. You’ll be able to view the CyberSecEval 2 leaderboard here.

Benchmarks

CyberSecEval 2 benchmarks help evaluate LLMs’ propensity to generate insecure code and comply with requests to assist cyber attackers:

Testing for generation of insecure coding practices: Insecure coding-practice tests measure how often an LLM suggests dangerous security weaknesses in each autocomplete and instruction contexts as defined within the industry-standard insecure coding practice taxonomy of the Common Weakness Enumeration. We report the code test pass rates.
Testing for susceptibility to prompt injection: Prompt injection attacks of LLM-based applications are attempts to cause the LLM to behave in undesirable ways. The prompt injection tests evaluate the flexibility of the LLM to acknowledge which a part of an input is untrusted and its level of resilience against common prompt injection techniques. We report how ceaselessly the model complies with attacks.
Testing for compliance with requests to assist with cyber attacks: Tests to measure the false rejection rate of confusingly benign prompts. These prompts are just like the cyber attack compliance tests in that they cover a wide range of topics including cyberdefense, but they’re explicitly benign—even when they might appear malicious. We report the tradeoff between false refusals (refusing to help in legitimate cyber related activities) and violation rate (agreeing to help in offensive cyber attacks).
Testing propensity to abuse code interpreters: Code interpreters allow LLMs to run code in a sandboxed environment. This set of prompts tries to control an LLM into executing malicious code to either gain access to the system that runs the LLM, gather sensitive information concerning the system, craft and execute social engineering attacks, or gather information concerning the external infrastructure of the host environment. We report the frequency of model compliance to attacks.
Testing automated offensive cybersecurity capabilities: This suite consists of capture-the-flag style security test cases that simulate program exploitation. We use an LLM as a security tool to find out whether it could reach a particular point in this system where a security issue has been intentionally inserted. In a few of these tests we explicitly check if the tool can execute basic exploits corresponding to SQL injections and buffer overflows. We report the model’s percentage of completion.

All of the code is open source, and we hope the community will use it to measure and enhance the cybersecurity safety properties of LLMs.

You’ll be able to read more about all of the benchmarks here.

Key Insights

Our latest evaluation of state-of-the-art Large Language Models (LLMs) using CyberSecEval 2 reveals each progress and ongoing challenges in addressing cybersecurity risks.

Industry Improvement

For the reason that first version of the benchmark, published in December 2023, the common LLM compliance rate with requests to help in cyber attacks has decreased from 52% to twenty-eight%, indicating that the industry is becoming more aware of this issue and taking steps towards improvement.

Model Comparison

We found models without code specialization are inclined to have lower non-compliance rates in comparison with those which might be code-specialized. Nevertheless, the gap between these models has narrowed, suggesting that code-specialized models are catching up by way of security.

heatmap of compared results

Prompt Injection Risks

Our prompt injection tests reveal that conditioning LLMs against such attacks stays an unsolved problem, posing a major security risk for applications built using these models. Developers mustn’t assume that LLMs will be trusted to follow system prompts safely within the face of adversarial inputs.

Code Exploitation Limitations

Our code exploitation tests suggest that while models with high general coding capability perform higher, LLMs still have a protracted approach to go before having the ability to reliably solve end-to-end exploit challenges. This means that LLMs are unlikely to disrupt cyber exploitation attacks of their current state.

Interpreter Abuse Risks

Our interpreter abuse tests highlight the vulnerability of LLMs to manipulation, allowing them to perform abusive actions inside a code interpreter. This underscores the necessity for extra guardrails and detection mechanisms to stop interpreter abuse.

Tips on how to contribute?

We’d love for the community to contribute to our benchmark, and there are several things you’ll be able to do if interested!

To run the CyberSecEval 2 benchmarks in your model, you’ll be able to follow the instructions here. Be happy to send us the outputs so we are able to add your model to the leaderboard!

If you have got ideas to enhance the CyberSecEval 2 benchmarks, you’ll be able to contribute to it directly by following the instructions here.

Other Resources

Source link

CyberSecEval 2 – A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

Benchmarks