Tau2 Telecom
τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.
Progress Over Time
Interactive timeline showing model performance evolution on Tau2 Telecom
State-of-the-art frontier
Open
Proprietary
Tau2 Telecom Leaderboard
28 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 1 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 3 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 4 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 5 | Anthropic | — | 200K | $5.00 / $25.00 | ||
| 6 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 7 | Xiaomi | 1.0T | 1.0M | $1.00 / $3.00 | ||
| 8 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 9 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 9 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 9 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 12 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 13 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 14 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 14 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 16 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 17 | Anthropic | — | 200K | $1.00 / $5.00 | ||
| 18 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 19 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 20 | Moonshot AI | 1.0T | — | — | ||
| 20 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 22 | 120B | 262K | $0.10 / $0.50 | |||
| 23 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 24 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.30 / $3.00 | ||
| 25 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 26 | 32B | 262K | $0.06 / $0.24 | |||
| 27 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 28 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 |
Notice missing or incorrect data?
FAQ
Common questions about Tau2 Telecom
τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.
The Tau2 Telecom paper is available at https://arxiv.org/abs/2506.07982. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Tau2 Telecom leaderboard ranks 28 AI models based on their performance on this benchmark. Currently, LongCat-Flash-Thinking-2601 by Meituan leads with a score of 0.993. The average score across all models is 0.774.
The highest Tau2 Telecom score is 0.993, achieved by LongCat-Flash-Thinking-2601 from Meituan.
28 models have been evaluated on the Tau2 Telecom benchmark, with 0 verified results and 28 self-reported results.
Tau2 Telecom is categorized under communication and reasoning. The benchmark evaluates text models.