Kimi Claw 24/7 Bench

Kimi Claw 24/7 Bench is Moonshot AI's in-house benchmark for evaluating long-horizon agentic performance in persistent, multi-day coworking tasks. It spans 17 professional scenarios across 610 evaluation points, covering software engineering, ML research, recruiting, trading, and marketing tasks executed through the OpenClaw harness.

Kimi K2.7 Code from Moonshot AI currently leads the Kimi Claw 24/7 Bench leaderboard with a score of 0.469 across 1 evaluated AI models.

About this benchmark

What Kimi Claw 24/7 Bench measures

Kimi Claw 24/7 Bench is a text benchmark that evaluates large language models on agents and coding tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.5, with the leader reaching 0.5.

Compare leaders on the best AI for agents and best AI for coding leaderboards.

Moonshot AIKimi K2.7 Code leads with 46.9%.

Progress Over Time

Interactive timeline showing model performance evolution on Kimi Claw 24/7 Bench

State-of-the-art frontier
Open
Proprietary

Kimi Claw 24/7 Bench Leaderboard

1 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T262K$0.95 / $4.00
Notice missing or incorrect data?

FAQ

Common questions about Kimi Claw 24/7 Bench.

What is the Kimi Claw 24/7 Bench benchmark?

Kimi Claw 24/7 Bench is Moonshot AI's in-house benchmark for evaluating long-horizon agentic performance in persistent, multi-day coworking tasks. It spans 17 professional scenarios across 610 evaluation points, covering software engineering, ML research, recruiting, trading, and marketing tasks executed through the OpenClaw harness.

What is the Kimi Claw 24/7 Bench leaderboard?

The Kimi Claw 24/7 Bench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2.7 Code by Moonshot AI leads with a score of 0.469. The average score across all models is 0.469.

What is the highest Kimi Claw 24/7 Bench score?

The highest Kimi Claw 24/7 Bench score is 0.469, achieved by Kimi K2.7 Code from Moonshot AI.

How many models are evaluated on Kimi Claw 24/7 Bench?

1 models have been evaluated on the Kimi Claw 24/7 Bench benchmark, with 0 verified results and 1 self-reported results.

What categories does Kimi Claw 24/7 Bench cover?

Kimi Claw 24/7 Bench is categorized under agents and coding. The benchmark evaluates text models.

What's the difference between Kimi Claw 24/7 Bench and Claw-Eval?

Kimi Claw 24/7 Bench is a variant of Claw-Eval. See the Claw-Eval leaderboard for the broader benchmark and per-model comparison.

What is the best open-source model on Kimi Claw 24/7 Bench?

Kimi K2.7 Code by Moonshot AI is the top-ranked open-source model on Kimi Claw 24/7 Bench, with a score of 0.469 (rank #1).

Which model offers the best value on Kimi Claw 24/7 Bench?

Among models scoring within 10% of the leader, Kimi K2.7 Code from Moonshot AI is the cheapest, at $0.95 per million input tokens with a score of 0.469.

How recent are the Kimi Claw 24/7 Bench leaderboard results?

The Kimi Claw 24/7 Bench leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all agents
BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents
48 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents
46 models
SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

agents
29 models
Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

agents
25 models
MCP Atlas

MCP Atlas is a benchmark for evaluating AI models on scaled tool use capabilities, measuring how well models can coordinate and utilize multiple tools across complex multi-step tasks.

agents
23 models
t2-bench

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

agents
23 models