CyberGym

CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.

Claude Mythos Preview from Anthropic currently leads the CyberGym leaderboard with a score of 0.831 across 7 evaluated AI models.