Tau2 Retail

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Tau2 Retail

State-of-the-art frontier
Open
Proprietary

Tau2 Retail Leaderboard

26 models
ContextCostLicense
11.0M$5.00 / $25.00
2200K$3.00 / $15.00
3
4560B
5200K$1.00 / $5.00
6
OpenAI
OpenAI
400K$1.75 / $14.00
7
OpenAI
OpenAI
8
OpenAI
OpenAI
9
10
OpenAI
OpenAI
400K$1.25 / $10.00
10
10400K$1.25 / $10.00
13
141.0M$0.30 / $2.50
1569B256K$0.10 / $0.40
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
17560B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
19560B
20
Moonshot AI
Moonshot AI
1.0T
201.0T
22
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B
23
OpenAI
OpenAI
128K$2.50 / $10.00
24120B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B
2632B262K$0.06 / $0.24
Notice missing or incorrect data?
About this benchmark

What is Tau2 Retail?

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

Tau2 Retail is a text benchmark evaluating models on reasoning, communication, and tool calling tasks. LLM Stats tracks 26 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.

Compare leaders on the best AI for reasoning, best AI for communication and best AI for tool calling leaderboards.

Current leaders

Claude Opus 4.6 from Anthropic currently leads the Tau2 Retail leaderboard with a score of 0.919 across 26 evaluated AI models.

1Claude Opus 4.6Anthropic91.9%
2Claude Sonnet 4.6Anthropic91.7%
3Claude Opus 4.5Anthropic88.9%
OSSLongCat-Flash-Thinking-2601#4 open-weight88.6%

Source paper

Title
$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Authors
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and 1 others
Published
Abstract

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $τ^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $τ^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

FAQ

Common questions about the Tau2 Retail benchmark and leaderboard.

What is the Tau2 Retail benchmark?

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

What is the Tau2 Retail leaderboard?

The Tau2 Retail leaderboard ranks 26 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.919. The average score across all models is 0.755.

What is the highest Tau2 Retail score?

The highest Tau2 Retail score is 0.919, achieved by Claude Opus 4.6 from Anthropic.

How many models are evaluated on Tau2 Retail?

26 models have been evaluated on the Tau2 Retail benchmark, with 0 verified results and 26 self-reported results.

Where can I find the Tau2 Retail paper?

The Tau2 Retail paper is available at https://arxiv.org/abs/2506.07982. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Tau2 Retail cover?

Tau2 Retail is categorized under reasoning, communication, and tool calling. The benchmark evaluates text models.

What is the best open-source model on Tau2 Retail?

LongCat-Flash-Thinking-2601 by Meituan is the top-ranked open-source model on Tau2 Retail, with a score of 0.886 (rank #4).

Which model offers the best value on Tau2 Retail?

Among models scoring within 10% of the leader, Claude Haiku 4.5 from Anthropic is the cheapest, at $1.00 per million input tokens with a score of 0.832.

How recent are the Tau2 Retail leaderboard results?

The Tau2 Retail leaderboard was last updated in July 2026 and currently includes 26 evaluated models.