CC-Bench-V2 Repo Exploration

Name: CC-Bench-V2 Repo Exploration Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

CC-Bench-V2 Repo Exploration evaluates coding agents on repository-level understanding and navigation, measuring ability to explore, comprehend, and work across entire codebases.

GLM-5V-Turbo from Zhipu AI currently leads the CC-Bench-V2 Repo Exploration leaderboard with a score of 0.722 across 1 evaluated AI models.

About this benchmark

What CC-Bench-V2 Repo Exploration measures

CC-Bench-V2 Repo Exploration is a text benchmark that evaluates large language models on agents and coding tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.7, with the leader reaching 0.7.

Compare leaders on the best AI for agents and best AI for coding leaderboards.

GLM-5V-Turbo leads with 72.2%.

Progress Over Time

Interactive timeline showing model performance evolution on CC-Bench-V2 Repo Exploration

State-of-the-art frontier

Open

Proprietary

CC-Bench-V2 Repo Exploration Leaderboard

1 models

				Context	Cost	License
1	GLM-5V-Turbo Zhipu AI		—	—	—

Notice missing or incorrect data?

FAQ

Common questions about CC-Bench-V2 Repo Exploration.

What is the CC-Bench-V2 Repo Exploration benchmark?

CC-Bench-V2 Repo Exploration evaluates coding agents on repository-level understanding and navigation, measuring ability to explore, comprehend, and work across entire codebases.

What is the CC-Bench-V2 Repo Exploration leaderboard?

The CC-Bench-V2 Repo Exploration leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, GLM-5V-Turbo by Zhipu AI leads with a score of 0.722. The average score across all models is 0.722.

What is the highest CC-Bench-V2 Repo Exploration score?

The highest CC-Bench-V2 Repo Exploration score is 0.722, achieved by GLM-5V-Turbo from Zhipu AI.

How many models are evaluated on CC-Bench-V2 Repo Exploration?

1 models have been evaluated on the CC-Bench-V2 Repo Exploration benchmark, with 0 verified results and 1 self-reported results.

What categories does CC-Bench-V2 Repo Exploration cover?

CC-Bench-V2 Repo Exploration is categorized under agents and coding. The benchmark evaluates text models.

How recent are the CC-Bench-V2 Repo Exploration leaderboard results?

The CC-Bench-V2 Repo Exploration leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all agents →

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents

48 models

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents

47 models

SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

agents

29 models

Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

agents

25 models

MCP Atlas

MCP Atlas is a benchmark for evaluating AI models on scaled tool use capabilities, measuring how well models can coordinate and utilize multiple tools across complex multi-step tasks.

agents

23 models

t2-bench

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

agents

23 models