Benchmarks/communication/TAU-bench Retail

TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TAU-bench Retail

State-of-the-art frontier
Open
Proprietary

TAU-bench Retail Leaderboard

25 models • 0 verified
ContextCostLicense
1
200K
$3.00
$15.00
2
200K
$15.00
$75.00
3
Anthropic
Anthropic
200K
$15.00
$75.00
4
200K
$3.00
$15.00
5
200K
$3.00
$15.00
6
Zhipu AI
Zhipu AI
355B131K
$0.40
$1.60
7
Zhipu AI
Zhipu AI
106B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
9
OpenAI
OpenAI
200K
$1.10
$4.40
10
OpenAI
OpenAI
200K
$15.00
$60.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K
$0.15
$1.50
12
200K
$3.00
$15.00
13
OpenAI
OpenAI
128K
$75.00
$150.00
14
OpenAI
OpenAI
1.0M
$2.00
$8.00
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K
$0.30
$3.00
15
117B131K
$0.09
$0.45
15
456B
18
456B1.0M
$0.55
$2.20
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K
$0.15
$1.50
20
OpenAI
OpenAI
128K
$2.50
$10.00
21
OpenAI
OpenAI
200K
$1.10
$4.40
22
1.0M
$0.40
$1.60
23
21B131K
$0.05
$0.20
24
200K
$0.80
$4.00
25
1.0M
$0.10
$0.40
Notice missing or incorrect data?

FAQ

Common questions about TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.
The TAU-bench Retail paper is available at https://arxiv.org/abs/2406.12045. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The TAU-bench Retail leaderboard ranks 25 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.862. The average score across all models is 0.678.
The highest TAU-bench Retail score is 0.862, achieved by Claude Sonnet 4.5 from Anthropic.
25 models have been evaluated on the TAU-bench Retail benchmark, with 0 verified results and 25 self-reported results.
TAU-bench Retail is categorized under communication, reasoning, and tool calling. The benchmark evaluates text models.