Benchmarks/tool calling/TAU-bench Airline

TAU-bench Airline

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TAU-bench Airline

State-of-the-art frontier
Open
Proprietary

TAU-bench Airline Leaderboard

23 models
ContextCostLicense
1200K$3.00 / $15.00
2456B1.0M$0.55 / $2.20
3
Zhipu AI
Zhipu AI
106B
4
Zhipu AI
Zhipu AI
355B131K$0.40 / $1.60
5456B
5200K$3.00 / $15.00
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
8
Anthropic
Anthropic
200K$15.00 / $75.00
9200K$3.00 / $15.00
10200K$15.00 / $75.00
11
OpenAI
OpenAI
200K$15.00 / $60.00
11
OpenAI
OpenAI
128K$75.00 / $150.00
13
OpenAI
OpenAI
1.0M$2.00 / $8.00
14
OpenAI
OpenAI
200K$1.10 / $4.40
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
16200K$3.00 / $15.00
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
19
OpenAI
OpenAI
128K$2.50 / $10.00
201.0M$0.40 / $1.60
21
OpenAI
OpenAI
200K$1.10 / $4.40
22200K$0.80 / $4.00
231.0M$0.10 / $0.40
Notice missing or incorrect data?

FAQ

Common questions about TAU-bench Airline

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.
The TAU-bench Airline paper is available at https://arxiv.org/abs/2406.12045. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The TAU-bench Airline leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.700. The average score across all models is 0.495.
The highest TAU-bench Airline score is 0.700, achieved by Claude Sonnet 4.5 from Anthropic.
23 models have been evaluated on the TAU-bench Airline benchmark, with 0 verified results and 23 self-reported results.
TAU-bench Airline is categorized under tool calling, communication, and reasoning. The benchmark evaluates text models.