CyberGym

CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.

Progress Over Time

Interactive timeline showing model performance evolution on CyberGym

State-of-the-art frontier
Open
Proprietary

CyberGym Leaderboard

6 models
ContextCostLicense
1$25.00 / $125.00
2
OpenAI
OpenAI
1.0M$5.00 / $30.00
31.0M$5.00 / $25.00
41.0M$5.00 / $25.00
5
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
6
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $3.00
Notice missing or incorrect data?

FAQ

Common questions about CyberGym

CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.
The CyberGym leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.831. The average score across all models is 0.703.
The highest CyberGym score is 0.831, achieved by Claude Mythos Preview from Anthropic.
6 models have been evaluated on the CyberGym benchmark, with 0 verified results and 6 self-reported results.
CyberGym is categorized under agents, code, and safety. The benchmark evaluates text models.