WildClawBench

Name: WildClawBench Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Implementation

Progress Over Time

Interactive timeline showing model performance evolution on WildClawBench

State-of-the-art frontier

Open

Proprietary

WildClawBench Leaderboard

4 models

			Context	Cost
1	Seed 2.1 Turbo ByteDance	—	—	—
2	Seed 2.1 Pro ByteDance	—	—	—
3	Hy3 Tencent	295B	—	—
4	MiMo-V2.5-Pro Xiaomi	1.0T	1.0M	$0.43 / $0.87

Notice missing or incorrect data?

About this benchmark

What is WildClawBench?

WildClawBench is an agentic coding benchmark from InternLM/Claw-Eval that reports overall model performance on real-world tool-using development tasks.

WildClawBench is a text benchmark evaluating models on agents and coding tasks. LLM Stats tracks 4 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.6.

Compare leaders on the best AI for agents and best AI for coding leaderboards.

Current leaders

Seed 2.1 Turbo from ByteDance currently leads the WildClawBench leaderboard with a score of 0.628 across 4 evaluated AI models.

Seed 2.1 TurboByteDance62.8%

Seed 2.1 ProByteDance61.7%

Hy3Tencent53.6%

FAQ

Common questions about the WildClawBench benchmark and leaderboard.

What is the WildClawBench benchmark?

WildClawBench is an agentic coding benchmark from InternLM/Claw-Eval that reports overall model performance on real-world tool-using development tasks.

What is the WildClawBench leaderboard?

The WildClawBench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Seed 2.1 Turbo by ByteDance leads with a score of 0.628. The average score across all models is 0.553.

What is the highest WildClawBench score?

The highest WildClawBench score is 0.628, achieved by Seed 2.1 Turbo from ByteDance.

How many models are evaluated on WildClawBench?

4 models have been evaluated on the WildClawBench benchmark, with 0 verified results and 4 self-reported results.

Where can I find the WildClawBench dataset?

The WildClawBench dataset is available on HuggingFace at https://huggingface.co/datasets/internlm/WildClawBench.

What categories does WildClawBench cover?

WildClawBench is categorized under agents and coding. The benchmark evaluates text models.

What is the best open-source model on WildClawBench?

Hy3 by Tencent is the top-ranked open-source model on WildClawBench, with a score of 0.536 (rank #3).

How recent are the WildClawBench leaderboard results?

The WildClawBench leaderboard was last updated in July 2026 and currently includes 4 evaluated models.