CC-Bench-V2 Backend
CC-Bench-V2 Backend evaluates coding agents on backend development tasks, measuring practical engineering ability to implement server-side logic, APIs, and system components.
GLM-5V-Turbo from Zhipu AI currently leads the CC-Bench-V2 Backend leaderboard with a score of 0.228 across 1 evaluated AI models.
What CC-Bench-V2 Backend measures
CC-Bench-V2 Backend is a text benchmark that evaluates large language models on coding tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.2, with the leader reaching 0.2.
Compare leaders on the best AI for coding leaderboards.
GLM-5V-Turbo leads with 22.8%.
Progress Over Time
Interactive timeline showing model performance evolution on CC-Bench-V2 Backend
CC-Bench-V2 Backend Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Zhipu AI | — | — | — |
FAQ
Common questions about CC-Bench-V2 Backend.
More evaluations to explore
Related benchmarks in the same category
Claw-Eval tests real-world agentic task completion across complex multi-step scenarios, evaluating a model's ability to use tools, navigate environments, and complete end-to-end tasks autonomously.
NL2Repo evaluates long-horizon coding capabilities including repository-level understanding, where models must generate or modify code across entire repositories from natural language specifications.
SkillsBench evaluates coding agents on self-contained programming tasks, measuring practical engineering skills across diverse software development scenarios.
ZClawBench evaluates Claw-style agent task execution quality, measuring a model's ability to autonomously complete complex multi-step coding tasks in real-world environments.
PinchBench evaluates coding agents on real-world agentic coding tasks, measuring both best-case and average performance across complex software engineering scenarios.
LongCodeBench evaluates the code understanding and comprehension abilities of large language models at very long context windows, scaling up to 1M tokens. It tests whether models can reason about extensive codebases provided in a single prompt by answering multiple-choice questions about the code.