CyBench

CyBench is a suite of Capture-the-Flag (CTF) challenges measuring agentic cyber attack capabilities. It evaluates dual-use cybersecurity knowledge and measures the 'unguided success rate', where agents complete tasks end-to-end without guidance on appropriate subtasks.

Claude Mythos Preview from Anthropic currently leads the CyBench leaderboard with a score of 1.000 across 2 evaluated AI models.

Paper

Claude Mythos Preview leads with 100.0%, followed by Grok-4.1 Thinking at 39.0%.

Progress Over Time

Interactive timeline showing model performance evolution on CyBench

State-of-the-art frontier

Open

Proprietary

CyBench Leaderboard

2 models

				Context	Cost	License
1	Claude Mythos Preview Anthropic		—	—	—
2	Grok-4.1 Thinking xAI		—	—	—

Notice missing or incorrect data?

FAQ

Common questions about CyBench.

What is the CyBench benchmark?

What is the CyBench leaderboard?

The CyBench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 1.000. The average score across all models is 0.695.

What is the highest CyBench score?

The highest CyBench score is 1.000, achieved by Claude Mythos Preview from Anthropic.

How many models are evaluated on CyBench?

2 models have been evaluated on the CyBench benchmark, with 0 verified results and 2 self-reported results.

Where can I find the CyBench paper?

The CyBench paper is available at https://arxiv.org/abs/2408.08926. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does CyBench cover?

CyBench is categorized under code, safety, and agents. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all code →

SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

code

91 models

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

code

71 models

HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

code

66 models

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents

46 models

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

code

41 models

SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

code

28 models