Tau2 Airline

Name: Tau2 Airline Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Tau2 Airline

State-of-the-art frontier

Open

Proprietary

Tau2 Airline Leaderboard

23 models

			Context	Cost
1	LongCat-Flash-Thinking-2601 Meituan	560B	—	—
2	Nova 2 Omni Amazon	—	—	—
3	LongCat-Flash-Thinking Meituan	560B	—	—
4	GPT-5.1 Thinking OpenAI	—	—	—
4	GPT-5.1 OpenAI	—	400K	$1.25 / $10.00
4	GPT-5.1 Instant OpenAI	—	400K	$1.25 / $10.00
7	Nova 2 Pro Amazon	—	—	—
8	Nova 2 Lite Amazon	—	1.0M	$0.30 / $2.50
8	o3 OpenAI	—	—	—
10	Claude Haiku 4.5 Anthropic	—	200K	$1.00 / $5.00
11	GPT-5 OpenAI	—	—	—
12	Qwen3-Next-80B-A3B-Thinking Alibaba Cloud / Qwen Team	80B	—	—
13	LongCat-Flash-Chat Meituan	560B	—	—
13	Qwen3-235B-A22B-Thinking-2507 Alibaba Cloud / Qwen Team	235B	—	—
13	LongCat-Flash-Lite Meituan	69B	256K	$0.10 / $0.40
16	Kimi K2 Instruct Moonshot AI	1.0T	—	—
16	Kimi K2-Instruct-0905 Moonshot AI	1.0T	—	—
18	Nemotron 3 Super (120B A12B) NVIDIA	120B	—	—
19	Mercury 2 Inception	—	128K	$0.25 / $0.75
20	Nemotron 3 Nano (30B A3B) NVIDIA	32B	262K	$0.06 / $0.24
21	Qwen3-Next-80B-A3B-Instruct Alibaba Cloud / Qwen Team	80B	—	—
21	GPT-4o OpenAI	—	128K	$2.50 / $10.00
23	Qwen3-235B-A22B-Instruct-2507 Alibaba Cloud / Qwen Team	235B	—	—

Notice missing or incorrect data?

About this benchmark

What is Tau2 Airline?

TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.

Tau2 Airline is a text benchmark evaluating models on reasoning, communication, and tool calling tasks. LLM Stats tracks 23 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.

Compare leaders on the best AI for reasoning, best AI for communication and best AI for tool calling leaderboards.

Current leaders

LongCat-Flash-Thinking-2601 from Meituan currently leads the Tau2 Airline leaderboard with a score of 0.765 across 23 evaluated AI models.

LongCat-Flash-Thinking-2601Meituan76.5%

Nova 2 OmniAmazon68.8%

LongCat-Flash-ThinkingMeituan67.5%

Source paper

Title: $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Authors: Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and 1 others
Published: June 9, 2025
arXiv: 2506.07982

Abstract

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $τ^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $τ^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

FAQ

Common questions about the Tau2 Airline benchmark and leaderboard.

What is the Tau2 Airline benchmark?

What is the Tau2 Airline leaderboard?

The Tau2 Airline leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, LongCat-Flash-Thinking-2601 by Meituan leads with a score of 0.765. The average score across all models is 0.598.

What is the highest Tau2 Airline score?

The highest Tau2 Airline score is 0.765, achieved by LongCat-Flash-Thinking-2601 from Meituan.

How many models are evaluated on Tau2 Airline?

23 models have been evaluated on the Tau2 Airline benchmark, with 0 verified results and 23 self-reported results.

Where can I find the Tau2 Airline paper?

The Tau2 Airline paper is available at https://arxiv.org/abs/2506.07982. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Tau2 Airline cover?

Tau2 Airline is categorized under reasoning, communication, and tool calling. The benchmark evaluates text models.

What is the best open-source model on Tau2 Airline?

LongCat-Flash-Thinking-2601 by Meituan is the top-ranked open-source model on Tau2 Airline, with a score of 0.765 (rank #1).

How recent are the Tau2 Airline leaderboard results?

The Tau2 Airline leaderboard was last updated in July 2026 and currently includes 23 evaluated models.