Benchmarks/communication/Scale MultiChallenge

Scale MultiChallenge

MultiChallenge is a realistic multi-turn conversation evaluation benchmark developed by Scale AI that evaluates large language models on four challenging conversation categories: instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. Each challenge requires accurate instruction-following, context allocation, and in-context reasoning. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Scale MultiChallenge

State-of-the-art frontier
Open
Proprietary

Scale MultiChallenge Leaderboard

6 models
ContextCostLicense
1
OpenAI
OpenAI
400K$1.25 / $10.00
2
OpenAI
OpenAI
200K$2.00 / $8.00
3120B262K$0.10 / $0.50
4
OpenAI
OpenAI
200K$1.10 / $4.40
5
OpenAI
OpenAI
128K$2.50 / $10.00
632B262K$0.06 / $0.24
Notice missing or incorrect data?

FAQ

Common questions about Scale MultiChallenge

MultiChallenge is a realistic multi-turn conversation evaluation benchmark developed by Scale AI that evaluates large language models on four challenging conversation categories: instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. Each challenge requires accurate instruction-following, context allocation, and in-context reasoning. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.
The Scale MultiChallenge paper is available at https://arxiv.org/abs/2501.17399. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Scale MultiChallenge leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.696. The average score across all models is 0.512.
The highest Scale MultiChallenge score is 0.696, achieved by GPT-5 from OpenAI.
6 models have been evaluated on the Scale MultiChallenge benchmark, with 0 verified results and 6 self-reported results.
Scale MultiChallenge is categorized under communication, general, and reasoning. The benchmark evaluates text models.