Benchmarks/general/MRCR

MRCR

MRCR (Multi-Round Coreference Resolution) is a synthetic long-context reasoning task where models must navigate long conversations to reproduce specific model outputs. It tests the ability to distinguish between similar requests and reason about ordering while maintaining attention across extended contexts.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MRCR

State-of-the-art frontier

Open

Proprietary

MRCR Leaderboard

7 models

			Context	Cost
1	Gemini 2.5 Pro Google	—	1.0M	$1.25 / $10.00
2	Gemini 1.5 Pro Google	—	2.1M	$2.50 / $10.00
3	Gemini 1.5 Flash Google	—	1.0M	$0.15 / $0.60
4	Gemini 2.0 Flash Google	—	1.0M	$0.10 / $0.40
5	Gemini 1.5 Flash 8B Google	8B	1.0M	$0.07 / $0.30
6	MiMo-V2-Flash Xiaomi	309B	256K	$0.10 / $0.30
7	Gemini 2.5 Flash Google	—	1.0M	$0.30 / $2.50

Notice missing or incorrect data?

FAQ

Common questions about MRCR

The MRCR paper is available at https://arxiv.org/abs/2409.12640. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The MRCR leaderboard ranks 7 AI models based on their performance on this benchmark. Currently, Gemini 2.5 Pro by Google leads with a score of 0.930. The average score across all models is 0.642.

The highest MRCR score is 0.930, achieved by Gemini 2.5 Pro from Google.

7 models have been evaluated on the MRCR benchmark, with 0 verified results and 7 self-reported results.

MRCR is categorized under general, long context, and reasoning. The benchmark evaluates text models.

MRCR

Progress Over Time

MRCR Leaderboard

FAQ

What is the MRCR benchmark?

Where can I find the MRCR paper?

What is the MRCR leaderboard?

What is the highest MRCR score?

How many models are evaluated on MRCR?

What categories does MRCR cover?

Sub-benchmarks

MRCR 128K (2-needle)

MRCR 128K (4-needle)

MRCR 128K (8-needle)

MRCR 64K (2-needle)

MRCR 64K (4-needle)

MRCR 64K (8-needle)