Multi-Challenge
Progress Over Time
Interactive timeline showing model performance evolution on Multi-Challenge
Multi-Challenge Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Amazon | — | — | — | ||
| 2 | Amazon | — | 1.0M | $0.30 / $2.50 | ||
| 3 | Amazon | — | — | — | ||
| 4 | OpenAI | — | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 397B | — | — | ||
| 6 | 550B | — | — | |||
| 7 | StepFun | 10B | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 10 | OpenAI | — | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 12 | 120B | — | — | |||
| 13 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 14 | Moonshot AI | 1.0T | — | — | ||
| 14 | Moonshot AI | 1.0T | — | — | ||
| 16 | Microsoft | 1.0T | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 18 | MiniMax | 456B | — | — | ||
| 18 | MiniMax | 456B | — | — | ||
| 20 | OpenAI | — | — | — | ||
| 21 | OpenAI | — | — | — | ||
| 22 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 23 | OpenAI | — | — | — | ||
| 24 | 32B | 262K | $0.06 / $0.24 | |||
| 25 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 26 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 27 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 28 | Alibaba Cloud / Qwen Team | 800M | — | — | ||
| 29 | OpenAI | — | 1.0M | $0.10 / $0.40 |
What is Multi-Challenge?
MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.
Multi-Challenge is a text benchmark evaluating models on reasoning and communication tasks. LLM Stats tracks 29 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 0.8.
Compare leaders on the best AI for reasoning and best AI for communication leaderboards.
Current leaders
Nova 2 Pro from Amazon currently leads the Multi-Challenge leaderboard with a score of 0.777 across 29 evaluated AI models.
Source paper
- Title
- MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
- Authors
- Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, and 6 others
- Published
- arXiv
- 2501.17399
Abstract
We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy.
FAQ
Common questions about the Multi-Challenge benchmark and leaderboard.