t2-bench

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

Progress Over Time

Interactive timeline showing model performance evolution on t2-bench

State-of-the-art frontier
Open
Proprietary

t2-bench Leaderboard

17 models • 0 verified
ContextCostLicense
1
1.0M
$2.50
$15.00
2
1.0M
$0.50
$3.00
3
Zhipu AI
Zhipu AI
744B200K
$1.00
$3.20
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
5
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K
$0.25
$2.00
7
685B
8
685B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K
$0.40
$3.20
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0T256K
$0.50
$5.00
14
LG AI Research
LG AI Research
236B33K
$0.60
$1.00
15
117B131K
$0.10
$0.50
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
Notice missing or incorrect data?

FAQ

Common questions about t2-bench

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.
The t2-bench leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Gemini 3.1 Pro by Google leads with a score of 0.993. The average score across all models is 0.755.
The highest t2-bench score is 0.993, achieved by Gemini 3.1 Pro from Google.
17 models have been evaluated on the t2-bench benchmark, with 0 verified results and 17 self-reported results.
t2-bench is categorized under agents, reasoning, and tool calling. The benchmark evaluates text models.