Tau2 Retail

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Tau2 Retail

State-of-the-art frontier
Open
Proprietary

Tau2 Retail Leaderboard

23 models
ContextCostLicense
11.0M$5.00 / $25.00
2200K$3.00 / $15.00
3200K$5.00 / $25.00
4560B128K$0.30 / $1.20
5200K$1.00 / $5.00
6
OpenAI
OpenAI
400K$1.75 / $14.00
7
OpenAI
OpenAI
400K$1.25 / $10.00
8
OpenAI
OpenAI
200K$2.00 / $8.00
9400K$1.25 / $10.00
9
OpenAI
OpenAI
400K$1.25 / $10.00
9400K$1.25 / $10.00
1269B256K$0.10 / $0.40
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
14560B128K$0.30 / $1.20
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
16560B128K$0.30 / $1.20
171.0T
17
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
20
OpenAI
OpenAI
128K$2.50 / $10.00
21120B262K$0.10 / $0.50
22
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
2332B262K$0.06 / $0.24
Notice missing or incorrect data?

FAQ

Common questions about Tau2 Retail

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.
The Tau2 Retail paper is available at https://arxiv.org/abs/2506.07982. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Tau2 Retail leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.919. The average score across all models is 0.752.
The highest Tau2 Retail score is 0.919, achieved by Claude Opus 4.6 from Anthropic.
23 models have been evaluated on the Tau2 Retail benchmark, with 0 verified results and 23 self-reported results.
Tau2 Retail is categorized under communication, reasoning, and tool calling. The benchmark evaluates text models.