Claw-Eval
Claw-Eval tests real-world agentic task completion across complex multi-step scenarios, evaluating a model's ability to use tools, navigate environments, and complete end-to-end tasks autonomously.
Kimi K2.6 from Moonshot AI currently leads the Claw-Eval leaderboard with a score of 0.809 across 11 evaluated AI models.
What Claw-Eval measures
Claw-Eval is a text benchmark that evaluates large language models on agents and coding tasks. LLM Stats tracks 11 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.6, with the leader reaching 0.8.
Compare leaders on the best AI for agents and best AI for coding leaderboards.
Kimi K2.6 leads with 80.9%, followed by
GLM-5V-Turbo at 75.0% and
MiniMax M3 at 74.5%.
Progress Over Time
Interactive timeline showing model performance evolution on Claw-Eval
Claw-Eval Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | 262K | $0.95 / $4.00 | ||
| 2 | Zhipu AI | — | — | — | ||
| 3 | MiniMax | — | 1.0M | $0.60 / $2.40 | ||
| 4 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 5 | Xiaomi | 1.0T | 1.0M | $0.43 / $0.87 | ||
| 6 | Xiaomi | 311B | 1.0M | $0.17 / $0.34 | ||
| 7 | Xiaomi | 1.0T | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 28B | 262K | $0.60 / $3.60 | ||
| 9 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 10 | Xiaomi | — | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 35B | — | — |
FAQ
Common questions about Claw-Eval.
Sub-benchmarks
More evaluations to explore
Related benchmarks in the same category
BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.
Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.
MCP Atlas is a benchmark for evaluating AI models on scaled tool use capabilities, measuring how well models can coordinate and utilize multiple tools across complex multi-step tasks.
t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.