Tau-bench
τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Tests language agents' ability to interact with users and follow domain-specific rules through dynamic conversations using API tools and policy guidelines across retail and airline domains. Evaluates consistency and reliability of agent behavior over multiple trials.
Progress Over Time
Interactive timeline showing model performance evolution on Tau-bench
State-of-the-art frontier
Open
Proprietary
Tau-bench Leaderboard
6 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | StepFun | 196B | 66K | $0.10 $0.40 | ||
2 | Zhipu AI | 358B | 205K | $0.60 $2.20 | ||
3 | Xiaomi | 309B | 256K | $0.10 $0.30 | ||
4 | Zhipu AI | 30B | 128K | $0.07 $0.40 | ||
5 | MiniMax | 230B | 1.0M | $0.30 $1.20 | ||
6 | OpenAI | — | 200K | $2.00 $8.00 |
Notice missing or incorrect data?
FAQ
Common questions about Tau-bench
τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Tests language agents' ability to interact with users and follow domain-specific rules through dynamic conversations using API tools and policy guidelines across retail and airline domains. Evaluates consistency and reliability of agent behavior over multiple trials.
The Tau-bench paper is available at https://arxiv.org/abs/2406.12045. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Tau-bench leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, Step-3.5-Flash by StepFun leads with a score of 0.882. The average score across all models is 0.793.
The highest Tau-bench score is 0.882, achieved by Step-3.5-Flash from StepFun.
6 models have been evaluated on the Tau-bench benchmark, with 0 verified results and 6 self-reported results.
Tau-bench is categorized under agents, general, reasoning, and tool calling. The benchmark evaluates text models.