Toolathlon Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on Toolathlon

State-of-the-art frontier

Open

Proprietary

Toolathlon Leaderboard

16 models

			Context	Cost
1	GPT-5.5New OpenAI	—	1.0M	$5.00 / $30.00
2	GPT-5.4 OpenAI	—	1.0M	$2.50 / $15.00
3	Kimi K2.6New Moonshot AI	1.0T	262K	$0.95 / $4.00
4	Gemini 3 Flash Google	—	1.0M	$0.50 / $3.00
5	GPT-5.2 OpenAI	—	400K	$1.75 / $14.00
5	MiniMax M2.7 MiniMax	—	205K	$0.30 / $1.20
7	MiniMax M2.1 MiniMax	230B	1.0M	$0.30 / $1.20
8	GPT-5.4 mini OpenAI	—	400K	$0.75 / $4.50
9	GLM-5.1 Zhipu AI	754B	200K	$1.40 / $4.40
10	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	—	—
11	Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team	397B	262K	$0.60 / $3.60
12	GPT-5.4 nano OpenAI	—	400K	$0.20 / $1.25
13	DeepSeek-V3.2 (Thinking) DeepSeek	685B	—	—
13	DeepSeek-V3.2-Speciale DeepSeek	685B	—	—
13	DeepSeek-V3.2 DeepSeek	685B	164K	$0.26 / $0.38
16	Qwen3.6-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—

FAQ

Common questions about Toolathlon

Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.

The Toolathlon leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.556. The average score across all models is 0.422.

The highest Toolathlon score is 0.556, achieved by GPT-5.5 from OpenAI.

16 models have been evaluated on the Toolathlon benchmark, with 0 verified results and 16 self-reported results.

Toolathlon is categorized under agents, reasoning, and tool calling. The benchmark evaluates text models.

Toolathlon

Progress Over Time

Toolathlon Leaderboard

FAQ

What is the Toolathlon benchmark?

What is the Toolathlon leaderboard?

What is the highest Toolathlon score?

How many models are evaluated on Toolathlon?

What categories does Toolathlon cover?