CC-Bench-V2 Frontend
CC-Bench-V2 Frontend evaluates coding agents on frontend development tasks, measuring ability to build UI components, handle styling, and implement client-side logic.
GLM-5V-Turbo from Zhipu AI currently leads the CC-Bench-V2 Frontend leaderboard with a score of 0.684 across 1 evaluated AI models.
What CC-Bench-V2 Frontend measures
CC-Bench-V2 Frontend is a text benchmark that evaluates large language models on coding tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.7, with the leader reaching 0.7.
Compare leaders on the best AI for coding leaderboards.
GLM-5V-Turbo leads with 68.4%.
Progress Over Time
Interactive timeline showing model performance evolution on CC-Bench-V2 Frontend
CC-Bench-V2 Frontend Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Zhipu AI | — | — | — |
FAQ
Common questions about CC-Bench-V2 Frontend.
More evaluations to explore
Related benchmarks in the same category
Claw-Eval tests real-world agentic task completion across complex multi-step scenarios, evaluating a model's ability to use tools, navigate environments, and complete end-to-end tasks autonomously.
NL2Repo evaluates long-horizon coding capabilities including repository-level understanding, where models must generate or modify code across entire repositories from natural language specifications.
SkillsBench evaluates coding agents on self-contained programming tasks, measuring practical engineering skills across diverse software development scenarios.
ZClawBench evaluates Claw-style agent task execution quality, measuring a model's ability to autonomously complete complex multi-step coding tasks in real-world environments.
PinchBench evaluates coding agents on real-world agentic coding tasks, measuring both best-case and average performance across complex software engineering scenarios.
LongCodeBench evaluates the code understanding and comprehension abilities of large language models at very long context windows, scaling up to 1M tokens. It tests whether models can reason about extensive codebases provided in a single prompt by answering multiple-choice questions about the code.