Benchmarks/communication/TAU-bench Retail

TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TAU-bench Retail

State-of-the-art frontier
Open
Proprietary

TAU-bench Retail Leaderboard

25 models
ContextCostLicense
1200K$3.00 / $15.00
2200K$15.00 / $75.00
3
Anthropic
Anthropic
200K$15.00 / $75.00
4200K$3.00 / $15.00
5200K$3.00 / $15.00
6
Zhipu AI
Zhipu AI
355B131K$0.40 / $1.60
7
Zhipu AI
Zhipu AI
106B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
9
OpenAI
OpenAI
200K$1.10 / $4.40
10
OpenAI
OpenAI
200K$15.00 / $60.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
12200K$3.00 / $15.00
13
OpenAI
OpenAI
128K$75.00 / $150.00
14
OpenAI
OpenAI
1.0M$2.00 / $8.00
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
15117B131K$0.09 / $0.45
15456B
18456B1.0M$0.55 / $2.20
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
20
OpenAI
OpenAI
128K$2.50 / $10.00
21
OpenAI
OpenAI
200K$1.10 / $4.40
221.0M$0.40 / $1.60
2321B131K$0.05 / $0.20
24200K$0.80 / $4.00
251.0M$0.10 / $0.40
Notice missing or incorrect data?

FAQ

Common questions about TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.
The TAU-bench Retail paper is available at https://arxiv.org/abs/2406.12045. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The TAU-bench Retail leaderboard ranks 25 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.862. The average score across all models is 0.678.
The highest TAU-bench Retail score is 0.862, achieved by Claude Sonnet 4.5 from Anthropic.
25 models have been evaluated on the TAU-bench Retail benchmark, with 0 verified results and 25 self-reported results.
TAU-bench Retail is categorized under communication, reasoning, and tool calling. The benchmark evaluates text models.