Benchmarks/agents/Toolathlon

Toolathlon

Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.

Progress Over Time

Interactive timeline showing model performance evolution on Toolathlon

State-of-the-art frontier
Open
Proprietary

Toolathlon Leaderboard

16 models
ContextCostLicense
1
OpenAI
OpenAI
1.0M$5.00 / $30.00
2
OpenAI
OpenAI
1.0M$2.50 / $15.00
3
Moonshot AI
Moonshot AI
1.0T262K$0.95 / $4.00
41.0M$0.50 / $3.00
5
OpenAI
OpenAI
400K$1.75 / $14.00
5205K$0.30 / $1.20
7230B1.0M$0.30 / $1.20
8400K$0.75 / $4.50
9
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
12400K$0.20 / $1.25
13685B
13685B
13685B164K$0.26 / $0.38
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
Notice missing or incorrect data?

FAQ

Common questions about Toolathlon

Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.
The Toolathlon leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.556. The average score across all models is 0.422.
The highest Toolathlon score is 0.556, achieved by GPT-5.5 from OpenAI.
16 models have been evaluated on the Toolathlon benchmark, with 0 verified results and 16 self-reported results.
Toolathlon is categorized under agents, reasoning, and tool calling. The benchmark evaluates text models.