Multi-Challenge
MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.
Progress Over Time
Interactive timeline showing model performance evolution on Multi-Challenge
State-of-the-art frontier
Open
Proprietary
Multi-Challenge Leaderboard
18 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 $3.60 | ||
2 | StepFun | 10B | — | — | ||
3 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 $3.20 | ||
4 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
5 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 $2.00 | ||
6 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
7 | Moonshot AI | 1.0T | — | — | ||
7 | Moonshot AI | 1.0T | 200K | $0.50 $0.50 | ||
9 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
10 | MiniMax | 456B | — | — | ||
10 | MiniMax | 456B | 1.0M | $0.55 $2.20 | ||
12 | OpenAI | — | 128K | $75.00 $150.00 | ||
13 | OpenAI | — | 200K | $1.10 $4.40 | ||
14 | OpenAI | — | 1.0M | $2.00 $8.00 | ||
15 | OpenAI | — | 1.0M | $0.40 $1.60 | ||
16 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
17 | Alibaba Cloud / Qwen Team | 800M | — | — | ||
18 | OpenAI | — | 1.0M | $0.10 $0.40 |
Notice missing or incorrect data?
FAQ
Common questions about Multi-Challenge
MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.
The Multi-Challenge paper is available at https://arxiv.org/abs/2501.17399. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Multi-Challenge leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.676. The average score across all models is 0.466.
The highest Multi-Challenge score is 0.676, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
18 models have been evaluated on the Multi-Challenge benchmark, with 0 verified results and 18 self-reported results.
Multi-Challenge is categorized under communication and reasoning. The benchmark evaluates text models.