SecCodeBench

SecCodeBench evaluates LLM coding agents on secure code generation and vulnerability detection, testing the ability to produce code that is both functional and free from security vulnerabilities.

Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team currently leads the SecCodeBench leaderboard with a score of 0.683 across 1 evaluated AI models.

Qwen3.5-397B-A17B leads with 68.3%.

Progress Over Time

Interactive timeline showing model performance evolution on SecCodeBench

State-of-the-art frontier

Open

Proprietary

SecCodeBench Leaderboard

1 models

				Context	Cost	License
1	Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team		397B	262K	$0.60 / $3.60

Notice missing or incorrect data?

FAQ

Common questions about SecCodeBench.

What is the SecCodeBench benchmark?

SecCodeBench evaluates LLM coding agents on secure code generation and vulnerability detection, testing the ability to produce code that is both functional and free from security vulnerabilities.

What is the SecCodeBench leaderboard?

The SecCodeBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.683. The average score across all models is 0.683.

What is the highest SecCodeBench score?

The highest SecCodeBench score is 0.683, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.

How many models are evaluated on SecCodeBench?

1 models have been evaluated on the SecCodeBench benchmark, with 0 verified results and 1 self-reported results.

What categories does SecCodeBench cover?

SecCodeBench is categorized under coding. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all coding →

Claw-Eval

Claw-Eval tests real-world agentic task completion across complex multi-step scenarios, evaluating a model's ability to use tools, navigate environments, and complete end-to-end tasks autonomously.

coding

7 models

NL2Repo

NL2Repo evaluates long-horizon coding capabilities including repository-level understanding, where models must generate or modify code across entire repositories from natural language specifications.

coding

5 models

PinchBench

PinchBench evaluates coding agents on real-world agentic coding tasks, measuring both best-case and average performance across complex software engineering scenarios.

coding

3 models

SkillsBench

SkillsBench evaluates coding agents on self-contained programming tasks, measuring practical engineering skills across diverse software development scenarios.

coding

3 models

ZClawBench

ZClawBench evaluates Claw-style agent task execution quality, measuring a model's ability to autonomously complete complex multi-step coding tasks in real-world environments.

coding

3 models

CC-Bench-V2 Backend

CC-Bench-V2 Backend evaluates coding agents on backend development tasks, measuring practical engineering ability to implement server-side logic, APIs, and system components.

coding

1 models