Scale MultiChallenge
MultiChallenge is a realistic multi-turn conversation evaluation benchmark developed by Scale AI that evaluates large language models on four challenging conversation categories: instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. Each challenge requires accurate instruction-following, context allocation, and in-context reasoning. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.
Progress Over Time
Interactive timeline showing model performance evolution on Scale MultiChallenge
State-of-the-art frontier
Open
Proprietary
Scale MultiChallenge Leaderboard
6 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 2 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 3 | 120B | 262K | $0.10 / $0.50 | |||
| 4 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 5 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 6 | 32B | 262K | $0.06 / $0.24 |
Notice missing or incorrect data?
FAQ
Common questions about Scale MultiChallenge
MultiChallenge is a realistic multi-turn conversation evaluation benchmark developed by Scale AI that evaluates large language models on four challenging conversation categories: instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. Each challenge requires accurate instruction-following, context allocation, and in-context reasoning. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.
The Scale MultiChallenge paper is available at https://arxiv.org/abs/2501.17399. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Scale MultiChallenge leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.696. The average score across all models is 0.512.
The highest Scale MultiChallenge score is 0.696, achieved by GPT-5 from OpenAI.
6 models have been evaluated on the Scale MultiChallenge benchmark, with 0 verified results and 6 self-reported results.
Scale MultiChallenge is categorized under communication, general, and reasoning. The benchmark evaluates text models.