CyberSecEval 4

CyberSecEval 4 is an evaluation suite covering cybersecurity-related capabilities and risks of large language models. The insecure-code-generation tracks measure whether a model produces vulnerable code: the Instruct track presents coding requests designed to elicit known insecure patterns, while the Autocomplete track prompts the model with code context leading up to a known insecure pattern, with vulnerabilities detected via static analysis.

MAI-Thinking-1 from Microsoft currently leads the CyberSecEval 4 leaderboard with a score of 0.630 across 1 evaluated AI models.

Paper
About this benchmark

What CyberSecEval 4 measures

CyberSecEval 4 is a text benchmark that evaluates large language models on safety and code tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.6, with the leader reaching 0.6.

Compare leaders on the best AI for safety and best AI for code leaderboards.

Publication

Paper
CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models
Authors
Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, and 9 others
Published

Abstract

We are releasing a new suite of security benchmarks for LLMs, CYBERSECEVAL 3, to continue the conversation on empirically measuring LLM cybersecurity risks and capabilities. CYBERSECEVAL 3 assesses 8 different risks across two broad categories: risk to third parties, and risk to application developers and end users. Compared to previous work, we add new areas focused on offensive security capabilities: automated social engineering, scaling manual offensive cyber operations, and autonomous offensive cyber operations. In this paper we discuss applying these benchmarks to the Llama 3 models and a suite of contemporaneous state-of-the-art LLMs, enabling us to contextualize risks both with and without mitigations in place.

MicrosoftMAI-Thinking-1 leads with 63.0%.

Progress Over Time

Interactive timeline showing model performance evolution on CyberSecEval 4

State-of-the-art frontier
Open
Proprietary

CyberSecEval 4 Leaderboard

1 models
ContextCostLicense
1
Microsoft
Microsoft
1.0T
Notice missing or incorrect data?

FAQ

Common questions about CyberSecEval 4.

What is the CyberSecEval 4 benchmark?

CyberSecEval 4 is an evaluation suite covering cybersecurity-related capabilities and risks of large language models. The insecure-code-generation tracks measure whether a model produces vulnerable code: the Instruct track presents coding requests designed to elicit known insecure patterns, while the Autocomplete track prompts the model with code context leading up to a known insecure pattern, with vulnerabilities detected via static analysis.

What is the CyberSecEval 4 leaderboard?

The CyberSecEval 4 leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MAI-Thinking-1 by Microsoft leads with a score of 0.630. The average score across all models is 0.630.

What is the highest CyberSecEval 4 score?

The highest CyberSecEval 4 score is 0.630, achieved by MAI-Thinking-1 from Microsoft.

How many models are evaluated on CyberSecEval 4?

1 models have been evaluated on the CyberSecEval 4 benchmark, with 0 verified results and 1 self-reported results.

Where can I find the CyberSecEval 4 paper?

The CyberSecEval 4 paper is available at https://arxiv.org/abs/2408.01605. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does CyberSecEval 4 cover?

CyberSecEval 4 is categorized under safety and code. The benchmark evaluates text models.

How recent are the CyberSecEval 4 leaderboard results?

The CyberSecEval 4 leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all safety
SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

code
97 models
LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

code
73 models
HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

code
66 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

code
44 models
SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

code
30 models
SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

code
26 models