Tau2 Retail
Progress Over Time
Interactive timeline showing model performance evolution on Tau2 Retail
Tau2 Retail Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 2 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 3 | Anthropic | — | — | — | ||
| 4 | Meituan | 560B | — | — | ||
| 5 | Anthropic | — | 200K | $1.00 / $5.00 | ||
| 6 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 7 | OpenAI | — | — | — | ||
| 8 | OpenAI | — | — | — | ||
| 9 | Amazon | — | — | — | ||
| 10 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 10 | OpenAI | — | — | — | ||
| 10 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 13 | Amazon | — | — | — | ||
| 14 | Amazon | — | 1.0M | $0.30 / $2.50 | ||
| 15 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 16 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 17 | Meituan | 560B | — | — | ||
| 18 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 19 | Meituan | 560B | — | — | ||
| 20 | Moonshot AI | 1.0T | — | — | ||
| 20 | Moonshot AI | 1.0T | — | — | ||
| 22 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 23 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 24 | 120B | — | — | |||
| 25 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 26 | 32B | 262K | $0.06 / $0.24 |
What is Tau2 Retail?
τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.
Tau2 Retail is a text benchmark evaluating models on reasoning, communication, and tool calling tasks. LLM Stats tracks 26 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.
Compare leaders on the best AI for reasoning, best AI for communication and best AI for tool calling leaderboards.
Current leaders
Claude Opus 4.6 from Anthropic currently leads the Tau2 Retail leaderboard with a score of 0.919 across 26 evaluated AI models.
Source paper
- Title
- $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
- Authors
- Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and 1 others
- Published
- arXiv
- 2506.07982
Abstract
Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $τ^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $τ^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.
FAQ
Common questions about the Tau2 Retail benchmark and leaderboard.