ResearchClawBench
ResearchClawBench evaluates research agents on realistic, tool-using research tasks that require code execution and filesystem workspace interaction.
MiMo-V2.5 from Xiaomi currently leads the ResearchClawBench leaderboard with a score of 0.169 across 1 evaluated AI models.
What ResearchClawBench measures
ResearchClawBench is a text benchmark that evaluates large language models on tool calling, research, and agents tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.2, with the leader reaching 0.2.
Compare leaders on the best AI for tool calling, best AI for research and best AI for agents leaderboards.
MiMo-V2.5 leads with 16.9%.
Progress Over Time
Interactive timeline showing model performance evolution on ResearchClawBench
ResearchClawBench Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Xiaomi | 311B | 1.0M | $0.17 / $0.34 |
FAQ
Common questions about ResearchClawBench.
More evaluations to explore
Related benchmarks in the same category
BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.
SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.
τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.
A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.