MM-ClawBench

MM-ClawBench evaluates models on MiniMax's Claw-style agent benchmark, measuring practical agentic task completion quality in real-world OpenClaw usage scenarios.

MiniMax M2.7 from MiniMax currently leads the MM-ClawBench leaderboard with a score of 0.627 across 1 evaluated AI models.

MiniMax M2.7 leads with 62.7%.

Progress Over Time

Interactive timeline showing model performance evolution on MM-ClawBench

State-of-the-art frontier

Open

Proprietary

MM-ClawBench Leaderboard

1 models

				Context	Cost	License
1	MiniMax M2.7 MiniMax		—	205K	$0.30 / $1.20

Notice missing or incorrect data?

FAQ

Common questions about MM-ClawBench.

What is the MM-ClawBench benchmark?

MM-ClawBench evaluates models on MiniMax's Claw-style agent benchmark, measuring practical agentic task completion quality in real-world OpenClaw usage scenarios.

What is the MM-ClawBench leaderboard?

The MM-ClawBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MiniMax M2.7 by MiniMax leads with a score of 0.627. The average score across all models is 0.627.

What is the highest MM-ClawBench score?

The highest MM-ClawBench score is 0.627, achieved by MiniMax M2.7 from MiniMax.

How many models are evaluated on MM-ClawBench?

1 models have been evaluated on the MM-ClawBench benchmark, with 0 verified results and 1 self-reported results.

What categories does MM-ClawBench cover?

MM-ClawBench is categorized under agents and coding. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all agents →

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents

46 models

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents

40 models

Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

agents

23 models

t2-bench

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

agents

22 models

SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

agents

21 models

Toolathlon

Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.

agents

19 models