CyberSecEval 4

Name: CyberSecEval 4 Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on CyberSecEval 4

State-of-the-art frontier

Open

Proprietary

CyberSecEval 4 Leaderboard

1 models

				Context	Cost	License
1	MAI-Thinking-1 Microsoft		1.0T	—	—

Notice missing or incorrect data?

About this benchmark

What is CyberSecEval 4?

CyberSecEval 4 is an evaluation suite covering cybersecurity-related capabilities and risks of large language models. The insecure-code-generation tracks measure whether a model produces vulnerable code: the Instruct track presents coding requests designed to elicit known insecure patterns, while the Autocomplete track prompts the model with code context leading up to a known insecure pattern, with vulnerabilities detected via static analysis.

CyberSecEval 4 is a text benchmark evaluating models on safety and code tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.6.

Compare leaders on the best AI for safety and best AI for code leaderboards.

Current leaders

MAI-Thinking-1 from Microsoft currently leads the CyberSecEval 4 leaderboard with a score of 0.630 across 1 evaluated AI models.

MAI-Thinking-1Microsoft63.0%

Source paper

Title: CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models
Authors: Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, and 9 others
Published: August 2, 2024
arXiv: 2408.01605

Abstract

We are releasing a new suite of security benchmarks for LLMs, CYBERSECEVAL 3, to continue the conversation on empirically measuring LLM cybersecurity risks and capabilities. CYBERSECEVAL 3 assesses 8 different risks across two broad categories: risk to third parties, and risk to application developers and end users. Compared to previous work, we add new areas focused on offensive security capabilities: automated social engineering, scaling manual offensive cyber operations, and autonomous offensive cyber operations. In this paper we discuss applying these benchmarks to the Llama 3 models and a suite of contemporaneous state-of-the-art LLMs, enabling us to contextualize risks both with and without mitigations in place.

FAQ

Common questions about the CyberSecEval 4 benchmark and leaderboard.

What is the CyberSecEval 4 benchmark?

What is the CyberSecEval 4 leaderboard?

The CyberSecEval 4 leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MAI-Thinking-1 by Microsoft leads with a score of 0.630. The average score across all models is 0.630.

What is the highest CyberSecEval 4 score?

The highest CyberSecEval 4 score is 0.630, achieved by MAI-Thinking-1 from Microsoft.

How many models are evaluated on CyberSecEval 4?

1 models have been evaluated on the CyberSecEval 4 benchmark, with 0 verified results and 1 self-reported results.

Where can I find the CyberSecEval 4 paper?

The CyberSecEval 4 paper is available at https://arxiv.org/abs/2408.01605. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does CyberSecEval 4 cover?

CyberSecEval 4 is categorized under safety and code. The benchmark evaluates text models.

How recent are the CyberSecEval 4 leaderboard results?

The CyberSecEval 4 leaderboard was last updated in July 2026 and currently includes 1 evaluated models.