Scale MultiChallenge Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on Scale MultiChallenge

State-of-the-art frontier

Open

Proprietary

Scale MultiChallenge Leaderboard

6 models

			Context	Cost
1	GPT-5 OpenAI	—	400K	$1.25 / $10.00
2	o3 OpenAI	—	200K	$2.00 / $8.00
3	Nemotron 3 Super (120B A12B) NVIDIA	120B	262K	$0.10 / $0.50
4	o4-mini OpenAI	—	200K	$1.10 / $4.40
5	GPT-4o OpenAI	—	128K	$2.50 / $10.00
6	Nemotron 3 Nano (30B A3B) NVIDIA	32B	262K	$0.06 / $0.24

FAQ

Common questions about Scale MultiChallenge

MultiChallenge is a realistic multi-turn conversation evaluation benchmark developed by Scale AI that evaluates large language models on four challenging conversation categories: instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. Each challenge requires accurate instruction-following, context allocation, and in-context reasoning. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.

The Scale MultiChallenge paper is available at https://arxiv.org/abs/2501.17399. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The Scale MultiChallenge leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.696. The average score across all models is 0.512.

The highest Scale MultiChallenge score is 0.696, achieved by GPT-5 from OpenAI.

6 models have been evaluated on the Scale MultiChallenge benchmark, with 0 verified results and 6 self-reported results.

Scale MultiChallenge is categorized under communication, general, and reasoning. The benchmark evaluates text models.

Scale MultiChallenge

Progress Over Time

Scale MultiChallenge Leaderboard

FAQ

What is the Scale MultiChallenge benchmark?

Where can I find the Scale MultiChallenge paper?

What is the Scale MultiChallenge leaderboard?

What is the highest Scale MultiChallenge score?

How many models are evaluated on Scale MultiChallenge?

What categories does Scale MultiChallenge cover?