Toolathlon
Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.
Progress Over Time
Interactive timeline showing model performance evolution on Toolathlon
State-of-the-art frontier
Open
Proprietary
Toolathlon Leaderboard
16 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | GPT-5.5New OpenAI | — | 1.0M | $5.00 / $30.00 | ||
| 2 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 3 | Kimi K2.6New Moonshot AI | 1.0T | 262K | $0.95 / $4.00 | ||
| 4 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 5 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 5 | MiniMax | — | 205K | $0.30 / $1.20 | ||
| 7 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 8 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 9 | Zhipu AI | 754B | 200K | $1.40 / $4.40 | ||
| 10 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 12 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 13 | DeepSeek | 685B | — | — | ||
| 13 | DeepSeek | 685B | — | — | ||
| 13 | DeepSeek | 685B | 164K | $0.26 / $0.38 | ||
| 16 | Alibaba Cloud / Qwen Team | 35B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about Toolathlon
Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.
The Toolathlon leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.556. The average score across all models is 0.422.
The highest Toolathlon score is 0.556, achieved by GPT-5.5 from OpenAI.
16 models have been evaluated on the Toolathlon benchmark, with 0 verified results and 16 self-reported results.
Toolathlon is categorized under agents, reasoning, and tool calling. The benchmark evaluates text models.