CyberGym Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on CyberGym

State-of-the-art frontier

Open

Proprietary

CyberGym Leaderboard

6 models

			Context	Cost
1	Claude Mythos Preview Anthropic	—	—	$25.00 / $125.00
2	GPT-5.5New OpenAI	—	1.0M	$5.00 / $30.00
3	Claude Opus 4.6 Anthropic	—	1.0M	$5.00 / $25.00
4	Claude Opus 4.7 Anthropic	—	1.0M	$5.00 / $25.00
5	GLM-5.1 Zhipu AI	754B	200K	$1.40 / $4.40
6	Kimi K2.5 Moonshot AI	1.0T	262K	$0.60 / $3.00

FAQ

Common questions about CyberGym

CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.

The CyberGym leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.831. The average score across all models is 0.703.

The highest CyberGym score is 0.831, achieved by Claude Mythos Preview from Anthropic.

6 models have been evaluated on the CyberGym benchmark, with 0 verified results and 6 self-reported results.

CyberGym is categorized under agents, code, and safety. The benchmark evaluates text models.

CyberGym

Progress Over Time

CyberGym Leaderboard

FAQ

What is the CyberGym benchmark?

What is the CyberGym leaderboard?

What is the highest CyberGym score?

How many models are evaluated on CyberGym?

What categories does CyberGym cover?