Tau2 Airline
TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.
Progress Over Time
Interactive timeline showing model performance evolution on Tau2 Airline
State-of-the-art frontier
Open
Proprietary
Tau2 Airline Leaderboard
20 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 2 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 3 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 3 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 3 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 6 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 7 | Anthropic | — | 200K | $1.00 / $5.00 | ||
| 8 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 9 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 10 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.30 / $3.00 | ||
| 10 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 10 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 13 | Moonshot AI | 1.0T | — | — | ||
| 13 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 15 | 120B | 262K | $0.10 / $0.50 | |||
| 16 | Inception | — | 128K | $0.25 / $0.75 | ||
| 17 | 32B | 262K | $0.06 / $0.24 | |||
| 18 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 18 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 20 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 / $0.80 |
Notice missing or incorrect data?
FAQ
Common questions about Tau2 Airline
TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.
The Tau2 Airline paper is available at https://arxiv.org/abs/2506.07982. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Tau2 Airline leaderboard ranks 20 AI models based on their performance on this benchmark. Currently, LongCat-Flash-Thinking-2601 by Meituan leads with a score of 0.765. The average score across all models is 0.588.
The highest Tau2 Airline score is 0.765, achieved by LongCat-Flash-Thinking-2601 from Meituan.
20 models have been evaluated on the Tau2 Airline benchmark, with 0 verified results and 20 self-reported results.
Tau2 Airline is categorized under communication, reasoning, and tool calling. The benchmark evaluates text models.