Benchmarks/communication/Multi-Challenge

Multi-Challenge

MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Multi-Challenge

State-of-the-art frontier
Open
Proprietary

Multi-Challenge Leaderboard

18 models • 0 verified
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
2
10B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K
$0.40
$3.20
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K
$0.25
$2.00
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
7
1.0T
7
Moonshot AI
Moonshot AI
1.0T200K
$0.50
$0.50
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
10
456B
10
456B1.0M
$0.55
$2.20
12
OpenAI
OpenAI
128K
$75.00
$150.00
13
OpenAI
OpenAI
200K
$1.10
$4.40
14
OpenAI
OpenAI
1.0M
$2.00
$8.00
15
1.0M
$0.40
$1.60
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
18
1.0M
$0.10
$0.40
Notice missing or incorrect data?

FAQ

Common questions about Multi-Challenge

MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.
The Multi-Challenge paper is available at https://arxiv.org/abs/2501.17399. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Multi-Challenge leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.676. The average score across all models is 0.466.
The highest Multi-Challenge score is 0.676, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
18 models have been evaluated on the Multi-Challenge benchmark, with 0 verified results and 18 self-reported results.
Multi-Challenge is categorized under communication and reasoning. The benchmark evaluates text models.