t2-bench
t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.
Progress Over Time
Interactive timeline showing model performance evolution on t2-bench
State-of-the-art frontier
Open
Proprietary
t2-bench Leaderboard
17 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Google | — | 1.0M | $2.50 $15.00 | ||
2 | Google | — | 1.0M | $0.50 $3.00 | ||
3 | Zhipu AI | 744B | 200K | $1.00 $3.20 | ||
4 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 $3.60 | ||
5 | Google | — | — | — | ||
6 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 $2.00 | ||
7 | DeepSeek | 685B | — | — | ||
8 | DeepSeek | 685B | — | — | ||
9 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
10 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 $3.20 | ||
11 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
12 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
13 | Alibaba Cloud / Qwen Team | 1.0T | 256K | $0.50 $5.00 | ||
14 | LG AI Research | 236B | 33K | $0.60 $1.00 | ||
15 | OpenAI | 117B | 131K | $0.10 $0.50 | ||
16 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
17 | Alibaba Cloud / Qwen Team | 800M | — | — |
Notice missing or incorrect data?
FAQ
Common questions about t2-bench
t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.
The t2-bench leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Gemini 3.1 Pro by Google leads with a score of 0.993. The average score across all models is 0.755.
The highest t2-bench score is 0.993, achieved by Gemini 3.1 Pro from Google.
17 models have been evaluated on the t2-bench benchmark, with 0 verified results and 17 self-reported results.
t2-bench is categorized under agents, reasoning, and tool calling. The benchmark evaluates text models.