CyberGym
CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.
Progress Over Time
Interactive timeline showing model performance evolution on CyberGym
State-of-the-art frontier
Open
Proprietary
CyberGym Leaderboard
6 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | $25.00 / $125.00 | ||
| 2 | GPT-5.5New OpenAI | — | 1.0M | $5.00 / $30.00 | ||
| 3 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 4 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 5 | Zhipu AI | 754B | 200K | $1.40 / $4.40 | ||
| 6 | Moonshot AI | 1.0T | 262K | $0.60 / $3.00 |
Notice missing or incorrect data?
FAQ
Common questions about CyberGym
CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.
The CyberGym leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.831. The average score across all models is 0.703.
The highest CyberGym score is 0.831, achieved by Claude Mythos Preview from Anthropic.
6 models have been evaluated on the CyberGym benchmark, with 0 verified results and 6 self-reported results.
CyberGym is categorized under agents, code, and safety. The benchmark evaluates text models.