Benchmarks/agents/Tau-bench

Tau-bench

τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Tests language agents' ability to interact with users and follow domain-specific rules through dynamic conversations using API tools and policy guidelines across retail and airline domains. Evaluates consistency and reliability of agent behavior over multiple trials.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Tau-bench

State-of-the-art frontier
Open
Proprietary

Tau-bench Leaderboard

6 models • 0 verified
ContextCostLicense
1
196B66K
$0.10
$0.40
2
Zhipu AI
Zhipu AI
358B205K
$0.60
$2.20
3
309B256K
$0.10
$0.30
4
30B128K
$0.07
$0.40
5
MiniMax
MiniMax
230B1.0M
$0.30
$1.20
6
OpenAI
OpenAI
200K
$2.00
$8.00
Notice missing or incorrect data?

FAQ

Common questions about Tau-bench

τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Tests language agents' ability to interact with users and follow domain-specific rules through dynamic conversations using API tools and policy guidelines across retail and airline domains. Evaluates consistency and reliability of agent behavior over multiple trials.
The Tau-bench paper is available at https://arxiv.org/abs/2406.12045. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Tau-bench leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, Step-3.5-Flash by StepFun leads with a score of 0.882. The average score across all models is 0.793.
The highest Tau-bench score is 0.882, achieved by Step-3.5-Flash from StepFun.
6 models have been evaluated on the Tau-bench benchmark, with 0 verified results and 6 self-reported results.
Tau-bench is categorized under agents, general, reasoning, and tool calling. The benchmark evaluates text models.