t2-bench
Progress Over Time
Interactive timeline showing model performance evolution on t2-bench
t2-bench Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 2 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 3 | Zhipu AI | 744B | 200K | $1.00 / $3.20 | ||
| 4 | Alibaba Cloud / Qwen Team | 397B | — | — | ||
| 5 | Google | 31B | 262K | $0.13 / $0.38 | ||
| 6 | Google | 25B | 262K | $0.13 / $0.40 | ||
| 7 | Google | — | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 9 | DeepSeek | 685B | — | — | ||
| 9 | DeepSeek | 685B | — | — | ||
| 11 | DeepSeek | 685B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 14 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 16 | Alibaba Cloud / Qwen Team | 1.0T | — | — | ||
| 17 | LG AI Research | 236B | — | — | ||
| 18 | OpenAI | 117B | 131K | $0.10 / $0.50 | ||
| 19 | Google | 8B | — | — | ||
| 20 | Google | 25B | — | — | ||
| 21 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 22 | Google | 5B | — | — | ||
| 23 | Alibaba Cloud / Qwen Team | 800M | — | — |
What is t2-bench?
t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.
t2-bench is a text benchmark evaluating models on reasoning, agents, and tool calling tasks. LLM Stats tracks 23 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 1.0.
Compare leaders on the best AI for reasoning, best AI for agents and best AI for tool calling leaderboards.
Current leaders
Gemini 3.1 Pro from Google currently leads the t2-bench leaderboard with a score of 0.993 across 23 evaluated AI models.
FAQ
Common questions about the t2-bench benchmark and leaderboard.