ZClawBench

ZClawBench evaluates Claw-style agent task execution quality, measuring a model's ability to autonomously complete complex multi-step coding tasks in real-world environments.

GLM-5V-Turbo from Zhipu AI currently leads the ZClawBench leaderboard with a score of 0.576 across 3 evaluated AI models.

Zhipu AIGLM-5V-Turbo leads with 57.6%, followed by Alibaba Cloud / Qwen TeamQwen3.6-27B at 53.4% and Alibaba Cloud / Qwen TeamQwen3.6-35B-A3B at 52.6%.

Progress Over Time

Interactive timeline showing model performance evolution on ZClawBench

State-of-the-art frontier
Open
Proprietary

ZClawBench Leaderboard

3 models
ContextCostLicense
1
Zhipu AI
Zhipu AI
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
28B262K$0.60 / $3.60
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
Notice missing or incorrect data?

FAQ

Common questions about ZClawBench.

What is the ZClawBench benchmark?

ZClawBench evaluates Claw-style agent task execution quality, measuring a model's ability to autonomously complete complex multi-step coding tasks in real-world environments.

What is the ZClawBench leaderboard?

The ZClawBench leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, GLM-5V-Turbo by Zhipu AI leads with a score of 0.576. The average score across all models is 0.545.

What is the highest ZClawBench score?

The highest ZClawBench score is 0.576, achieved by GLM-5V-Turbo from Zhipu AI.

How many models are evaluated on ZClawBench?

3 models have been evaluated on the ZClawBench benchmark, with 0 verified results and 3 self-reported results.

What categories does ZClawBench cover?

ZClawBench is categorized under agents and coding. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all agents
BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents
45 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents
39 models
Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

agents
23 models
t2-bench

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

agents
22 models
SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

agents
20 models
BFCL-v3

Berkeley Function Calling Leaderboard v3 (BFCL-v3) is an advanced benchmark that evaluates large language models' function calling capabilities through multi-turn and multi-step interactions. It introduces extended conversational exchanges where models must retain contextual information across turns and execute multiple internal function calls for complex user requests. The benchmark includes 1000 test cases across domains like vehicle control, trading bots, travel booking, and file system management, using state-based evaluation to verify both system state changes and execution path correctness.

agents
18 models