Cybersecurity CTFs

Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.

GPT-5.3 Codex from OpenAI currently leads the Cybersecurity CTFs leaderboard with a score of 0.776 across 3 evaluated AI models.

Paper

GPT-5.3 Codex leads with 77.6%, followed by Claude Haiku 4.5 at 46.9% and o1-mini at 28.7%.

Progress Over Time

Interactive timeline showing model performance evolution on Cybersecurity CTFs

State-of-the-art frontier

Open

Proprietary

Cybersecurity CTFs Leaderboard

3 models

			Context	Cost
1	GPT-5.3 Codex OpenAI	—	400K	$1.75 / $14.00
2	Claude Haiku 4.5 Anthropic	—	200K	$1.00 / $5.00
3	o1-mini OpenAI	—	—	—

Notice missing or incorrect data?

FAQ

Common questions about Cybersecurity CTFs.

What is the Cybersecurity CTFs benchmark?

What is the Cybersecurity CTFs leaderboard?

The Cybersecurity CTFs leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, GPT-5.3 Codex by OpenAI leads with a score of 0.776. The average score across all models is 0.511.

What is the highest Cybersecurity CTFs score?

The highest Cybersecurity CTFs score is 0.776, achieved by GPT-5.3 Codex from OpenAI.

How many models are evaluated on Cybersecurity CTFs?

3 models have been evaluated on the Cybersecurity CTFs benchmark, with 0 verified results and 3 self-reported results.

Where can I find the Cybersecurity CTFs paper?

The Cybersecurity CTFs paper is available at https://arxiv.org/abs/2406.05590. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Cybersecurity CTFs cover?

Cybersecurity CTFs is categorized under safety. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all safety →

CyberGym

CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.

safety

6 models

AttaQ

AttaQ is a unique dataset containing adversarial examples in the form of questions designed to provoke harmful or inappropriate responses from large language models. The benchmark evaluates safety vulnerabilities by using specialized clustering techniques that analyze both the semantic similarity of input attacks and the harmfulness of model responses, facilitating targeted improvements to model safety mechanisms.

safety

3 models

FigQA

FigQA is a multiple-choice benchmark on interpreting scientific figures from biology papers. It evaluates dual-use biological knowledge and multimodal reasoning relevant to bioweapons development.

safetymultimodal

3 models

XSTest

XSTest is a test suite designed to identify exaggerated safety behaviours in large language models. It comprises 450 prompts: 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models should refuse. The benchmark systematically evaluates whether models refuse to respond to clearly safe prompts due to overly cautious safety mechanisms.

safety

3 models

CyBench

CyBench is a suite of Capture-the-Flag (CTF) challenges measuring agentic cyber attack capabilities. It evaluates dual-use cybersecurity knowledge and measures the 'unguided success rate', where agents complete tasks end-to-end without guidance on appropriate subtasks.

safety

2 models

POPE

Polling-based Object Probing Evaluation (POPE) is a benchmark for evaluating object hallucination in Large Vision-Language Models (LVLMs). POPE addresses the problem where LVLMs generate objects inconsistent with target images by using a polling-based query method that asks yes/no questions about object presence in images, providing more stable and flexible evaluation of object hallucination.

safetymultimodal

2 models