Cybersecurity CTFs

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Cybersecurity CTFs

State-of-the-art frontier
Open
Proprietary

Cybersecurity CTFs Leaderboard

3 models
ContextCostLicense
1400K$1.75 / $14.00
2200K$1.00 / $5.00
3
OpenAI
OpenAI
Notice missing or incorrect data?
About this benchmark

What is Cybersecurity CTFs?

Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.

Cybersecurity CTFs is a text benchmark evaluating models on safety tasks. LLM Stats tracks 3 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 0.8.

Compare leaders on the best AI for safety leaderboards.

Current leaders

GPT-5.3 Codex from OpenAI currently leads the Cybersecurity CTFs leaderboard with a score of 0.776 across 3 evaluated AI models.

1GPT-5.3 CodexOpenAI77.6%
2Claude Haiku 4.5Anthropic46.9%
3o1-miniOpenAI28.7%

Source paper

Title
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
Authors
Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, and 9 others
Published
Abstract

Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized benchmark, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our benchmark dataset open source to public https://github.com/NYU-LLM-CTF/NYU_CTF_Bench along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.

FAQ

Common questions about the Cybersecurity CTFs benchmark and leaderboard.

What is the Cybersecurity CTFs benchmark?

Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.

What is the Cybersecurity CTFs leaderboard?

The Cybersecurity CTFs leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, GPT-5.3 Codex by OpenAI leads with a score of 0.776. The average score across all models is 0.511.

What is the highest Cybersecurity CTFs score?

The highest Cybersecurity CTFs score is 0.776, achieved by GPT-5.3 Codex from OpenAI.

How many models are evaluated on Cybersecurity CTFs?

3 models have been evaluated on the Cybersecurity CTFs benchmark, with 0 verified results and 3 self-reported results.

Where can I find the Cybersecurity CTFs paper?

The Cybersecurity CTFs paper is available at https://arxiv.org/abs/2406.05590. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Cybersecurity CTFs cover?

Cybersecurity CTFs is categorized under safety. The benchmark evaluates text models.

Which model offers the best value on Cybersecurity CTFs?

Among models scoring within 10% of the leader, GPT-5.3 Codex from OpenAI is the cheapest, at $1.75 per million input tokens with a score of 0.776.

How recent are the Cybersecurity CTFs leaderboard results?

The Cybersecurity CTFs leaderboard was last updated in June 2026 and currently includes 3 evaluated models.