t2-bench

Progress Over Time

Interactive timeline showing model performance evolution on t2-bench

State-of-the-art frontier
Open
Proprietary

t2-bench Leaderboard

23 models
ContextCostLicense
11.0M$2.50 / $15.00
21.0M$0.50 / $3.00
3
Zhipu AI
Zhipu AI
744B200K$1.00 / $3.20
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
531B262K$0.13 / $0.38
625B262K$0.13 / $0.40
7
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
9685B
9685B
11685B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0T
17
LG AI Research
LG AI Research
236B
18117B131K$0.10 / $0.50
198B
2025B
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
225B
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
Notice missing or incorrect data?
About this benchmark

What is t2-bench?

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

t2-bench is a text benchmark evaluating models on reasoning, agents, and tool calling tasks. LLM Stats tracks 23 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 1.0.

Compare leaders on the best AI for reasoning, best AI for agents and best AI for tool calling leaderboards.

Current leaders

Gemini 3.1 Pro from Google currently leads the t2-bench leaderboard with a score of 0.993 across 23 evaluated AI models.

1Gemini 3.1 ProGoogle99.3%
2Gemini 3 FlashGoogle90.2%
3GLM-5Zhipu AI89.7%

FAQ

Common questions about the t2-bench benchmark and leaderboard.

What is the t2-bench benchmark?

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

What is the t2-bench leaderboard?

The t2-bench leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, Gemini 3.1 Pro by Google leads with a score of 0.993. The average score across all models is 0.730.

What is the highest t2-bench score?

The highest t2-bench score is 0.993, achieved by Gemini 3.1 Pro from Google.

How many models are evaluated on t2-bench?

23 models have been evaluated on the t2-bench benchmark, with 0 verified results and 23 self-reported results.

What categories does t2-bench cover?

t2-bench is categorized under reasoning, agents, and tool calling. The benchmark evaluates text models.

What is the best open-source model on t2-bench?

GLM-5 by Zhipu AI is the top-ranked open-source model on t2-bench, with a score of 0.897 (rank #3).

Which model offers the best value on t2-bench?

Among models scoring within 10% of the leader, Gemini 3 Flash from Google is the cheapest, at $0.50 per million input tokens with a score of 0.902.

How recent are the t2-bench leaderboard results?

The t2-bench leaderboard was last updated in July 2026 and currently includes 23 evaluated models.