Tau2 Telecom
Progress Over Time
Interactive timeline showing model performance evolution on Tau2 Telecom
Tau2 Telecom Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Meituan | 560B | — | — | ||
| 1 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 3 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 4 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 5 | Anthropic | — | — | — | ||
| 6 | OpenAI | — | 1.1M | $5.00 / $30.00 | ||
| 7 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 8 | Xiaomi | 1.0T | — | — | ||
| 9 | OpenAI | — | — | — | ||
| 10 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 10 | OpenAI | — | — | — | ||
| 10 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 13 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 14 | Amazon | — | — | — | ||
| 15 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 16 | Meta | — | — | — | ||
| 17 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 17 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 19 | Cohere | 218B | — | — | ||
| 20 | Meituan | 560B | — | — | ||
| 21 | Anthropic | — | 200K | $1.00 / $5.00 | ||
| 22 | Amazon | — | — | — | ||
| 23 | Amazon | — | 1.0M | $0.30 / $2.50 | ||
| 24 | Meituan | 560B | — | — | ||
| 25 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 26 | Microsoft | — | — | — | ||
| 27 | Moonshot AI | 1.0T | — | — | ||
| 27 | Moonshot AI | 1.0T | — | — | ||
| 29 | 120B | — | — | |||
| 30 | OpenAI | — | — | — | ||
| 31 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 32 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 33 | 32B | 262K | $0.06 / $0.24 | |||
| 34 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 35 | Alibaba Cloud / Qwen Team | 80B | — | — |
What is Tau2 Telecom?
τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.
Tau2 Telecom is a text benchmark evaluating models on reasoning, communication, and tool calling tasks. LLM Stats tracks 35 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 1.0.
Compare leaders on the best AI for reasoning, best AI for communication and best AI for tool calling leaderboards.
Current leaders
LongCat-Flash-Thinking-2601 from Meituan currently leads the Tau2 Telecom leaderboard with a score of 0.993 across 35 evaluated AI models.
Source paper
- Title
- $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
- Authors
- Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and 1 others
- Published
- arXiv
- 2506.07982
Abstract
Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $τ^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $τ^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.
FAQ
Common questions about the Tau2 Telecom benchmark and leaderboard.