Tau2 Telecom

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Tau2 Telecom

State-of-the-art frontier
Open
Proprietary

Tau2 Telecom Leaderboard

35 models
ContextCostLicense
1560B
11.0M$5.00 / $25.00
3
OpenAI
OpenAI
1.0M$2.50 / $15.00
4
OpenAI
OpenAI
400K$1.75 / $14.00
5
6
OpenAI
OpenAI
1.1M$5.00 / $30.00
7200K$3.00 / $15.00
81.0T
9
OpenAI
OpenAI
10400K$1.25 / $10.00
10
10
OpenAI
OpenAI
400K$1.25 / $10.00
13400K$0.75 / $4.50
14
15400K$0.20 / $1.25
16
17230B1.0M$0.30 / $1.20
17
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
19218B
20560B
21200K$1.00 / $5.00
22
231.0M$0.30 / $2.50
24560B
2569B256K$0.10 / $0.40
26
271.0T
27
Moonshot AI
Moonshot AI
1.0T
29120B
30
OpenAI
OpenAI
31
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
32
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B
3332B262K$0.06 / $0.24
34
OpenAI
OpenAI
128K$2.50 / $10.00
35
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B
Notice missing or incorrect data?
About this benchmark

What is Tau2 Telecom?

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

Tau2 Telecom is a text benchmark evaluating models on reasoning, communication, and tool calling tasks. LLM Stats tracks 35 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 1.0.

Compare leaders on the best AI for reasoning, best AI for communication and best AI for tool calling leaderboards.

Current leaders

LongCat-Flash-Thinking-2601 from Meituan currently leads the Tau2 Telecom leaderboard with a score of 0.993 across 35 evaluated AI models.

1Claude Opus 4.6Anthropic99.3%
3GPT-5.4OpenAI98.9%

Source paper

Title
$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Authors
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and 1 others
Published
Abstract

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $τ^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $τ^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

FAQ

Common questions about the Tau2 Telecom benchmark and leaderboard.

What is the Tau2 Telecom benchmark?

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

What is the Tau2 Telecom leaderboard?

The Tau2 Telecom leaderboard ranks 35 AI models based on their performance on this benchmark. Currently, LongCat-Flash-Thinking-2601 by Meituan leads with a score of 0.993. The average score across all models is 0.789.

What is the highest Tau2 Telecom score?

The highest Tau2 Telecom score is 0.993, achieved by LongCat-Flash-Thinking-2601 from Meituan.

How many models are evaluated on Tau2 Telecom?

35 models have been evaluated on the Tau2 Telecom benchmark, with 0 verified results and 35 self-reported results.

Where can I find the Tau2 Telecom paper?

The Tau2 Telecom paper is available at https://arxiv.org/abs/2506.07982. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Tau2 Telecom cover?

Tau2 Telecom is categorized under reasoning, communication, and tool calling. The benchmark evaluates text models.

What is the best open-source model on Tau2 Telecom?

LongCat-Flash-Thinking-2601 by Meituan is the top-ranked open-source model on Tau2 Telecom, with a score of 0.993 (rank #1).

Which model offers the best value on Tau2 Telecom?

Among models scoring within 10% of the leader, GPT-5.4 nano from OpenAI is the cheapest, at $0.20 per million input tokens with a score of 0.925.

How recent are the Tau2 Telecom leaderboard results?

The Tau2 Telecom leaderboard was last updated in July 2026 and currently includes 35 evaluated models.