Benchmarks/safety/Cybersecurity CTFs

Cybersecurity CTFs

Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Cybersecurity CTFs

State-of-the-art frontier
Open
Proprietary

Cybersecurity CTFs Leaderboard

3 models • 0 verified
ContextCostLicense
1
400K
$1.75
$14.00
2
200K
$1.00
$5.00
3
OpenAI
OpenAI
128K
$3.00
$12.00
Notice missing or incorrect data?

FAQ

Common questions about Cybersecurity CTFs

Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.
The Cybersecurity CTFs paper is available at https://arxiv.org/abs/2406.05590. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Cybersecurity CTFs leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, GPT-5.3 Codex by OpenAI leads with a score of 0.776. The average score across all models is 0.511.
The highest Cybersecurity CTFs score is 0.776, achieved by GPT-5.3 Codex from OpenAI.
3 models have been evaluated on the Cybersecurity CTFs benchmark, with 0 verified results and 3 self-reported results.
Cybersecurity CTFs is categorized under safety. The benchmark evaluates text models.