Tau2 Telecom

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Tau2 Telecom

State-of-the-art frontier
Open
Proprietary

Tau2 Telecom Leaderboard

28 models
ContextCostLicense
1560B128K$0.30 / $1.20
11.0M$5.00 / $25.00
3
OpenAI
OpenAI
1.0M$2.50 / $15.00
4
OpenAI
OpenAI
400K$1.75 / $14.00
5200K$5.00 / $25.00
6200K$3.00 / $15.00
71.0T1.0M$1.00 / $3.00
8
OpenAI
OpenAI
400K$1.25 / $10.00
9400K$1.25 / $10.00
9400K$1.25 / $10.00
9
OpenAI
OpenAI
400K$1.25 / $10.00
12400K$0.75 / $4.50
13400K$0.20 / $1.25
14
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
14230B1.0M$0.30 / $1.20
16560B128K$0.30 / $1.20
17200K$1.00 / $5.00
18560B128K$0.30 / $1.20
1969B256K$0.10 / $0.40
201.0T
20
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
22120B262K$0.10 / $0.50
23
OpenAI
OpenAI
200K$2.00 / $8.00
24
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
2632B262K$0.06 / $0.24
27
OpenAI
OpenAI
128K$2.50 / $10.00
28
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
Notice missing or incorrect data?

FAQ

Common questions about Tau2 Telecom

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.
The Tau2 Telecom paper is available at https://arxiv.org/abs/2506.07982. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Tau2 Telecom leaderboard ranks 28 AI models based on their performance on this benchmark. Currently, LongCat-Flash-Thinking-2601 by Meituan leads with a score of 0.993. The average score across all models is 0.774.
The highest Tau2 Telecom score is 0.993, achieved by LongCat-Flash-Thinking-2601 from Meituan.
28 models have been evaluated on the Tau2 Telecom benchmark, with 0 verified results and 28 self-reported results.
Tau2 Telecom is categorized under communication and reasoning. The benchmark evaluates text models.