Multi-Challenge

Progress Over Time

Interactive timeline showing model performance evolution on Multi-Challenge

State-of-the-art frontier
Open
Proprietary

Multi-Challenge Leaderboard

29 models
ContextCostLicense
1
21.0M$0.30 / $2.50
3
4
OpenAI
OpenAI
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
6550B
710B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
10
OpenAI
OpenAI
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
12120B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
14
Moonshot AI
Moonshot AI
1.0T
141.0T
161.0T
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
18456B
18456B
20
OpenAI
OpenAI
21
OpenAI
OpenAI
22
OpenAI
OpenAI
128K$2.50 / $10.00
23
OpenAI
OpenAI
2432B262K$0.06 / $0.24
25
OpenAI
OpenAI
1.0M$2.00 / $8.00
261.0M$0.40 / $1.60
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
28
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
291.0M$0.10 / $0.40
Notice missing or incorrect data?
About this benchmark

What is Multi-Challenge?

MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.

Multi-Challenge is a text benchmark evaluating models on reasoning and communication tasks. LLM Stats tracks 29 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 0.8.

Compare leaders on the best AI for reasoning and best AI for communication leaderboards.

Current leaders

Nova 2 Pro from Amazon currently leads the Multi-Challenge leaderboard with a score of 0.777 across 29 evaluated AI models.

1Nova 2 ProAmazon77.7%
2Nova 2 LiteAmazon76.6%
3Nova 2 OmniAmazon75.5%
OSSQwen3.5-397B-A17B#5 open-weight67.6%

Source paper

Title
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Authors
Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, and 6 others
Published
Abstract

We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy.

FAQ

Common questions about the Multi-Challenge benchmark and leaderboard.

What is the Multi-Challenge benchmark?

MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.

What is the Multi-Challenge leaderboard?

The Multi-Challenge leaderboard ranks 29 AI models based on their performance on this benchmark. Currently, Nova 2 Pro by Amazon leads with a score of 0.777. The average score across all models is 0.515.

What is the highest Multi-Challenge score?

The highest Multi-Challenge score is 0.777, achieved by Nova 2 Pro from Amazon.

How many models are evaluated on Multi-Challenge?

29 models have been evaluated on the Multi-Challenge benchmark, with 0 verified results and 29 self-reported results.

Where can I find the Multi-Challenge paper?

The Multi-Challenge paper is available at https://arxiv.org/abs/2501.17399. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Multi-Challenge cover?

Multi-Challenge is categorized under reasoning and communication. The benchmark evaluates text models.

What is the best open-source model on Multi-Challenge?

Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on Multi-Challenge, with a score of 0.676 (rank #5).

Which model offers the best value on Multi-Challenge?

Among models scoring within 10% of the leader, Nova 2 Lite from Amazon is the cheapest, at $0.30 per million input tokens with a score of 0.766.

How is Multi-Challenge scored?

Multi-Challenge is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the Multi-Challenge leaderboard results?

The Multi-Challenge leaderboard was last updated in July 2026 and currently includes 29 evaluated models.