MRCR

MRCR (Multi-Round Coreference Resolution) is a synthetic long-context reasoning task where models must navigate long conversations to reproduce specific model outputs. It tests the ability to distinguish between similar requests and reason about ordering while maintaining attention across extended contexts.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MRCR

State-of-the-art frontier
Open
Proprietary

MRCR Leaderboard

7 models
ContextCostLicense
11.0M$1.25 / $10.00
22.1M$2.50 / $10.00
31.0M$0.15 / $0.60
41.0M$0.10 / $0.40
58B1.0M$0.07 / $0.30
6309B256K$0.10 / $0.30
71.0M$0.30 / $2.50
Notice missing or incorrect data?

FAQ

Common questions about MRCR

MRCR (Multi-Round Coreference Resolution) is a synthetic long-context reasoning task where models must navigate long conversations to reproduce specific model outputs. It tests the ability to distinguish between similar requests and reason about ordering while maintaining attention across extended contexts.
The MRCR paper is available at https://arxiv.org/abs/2409.12640. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MRCR leaderboard ranks 7 AI models based on their performance on this benchmark. Currently, Gemini 2.5 Pro by Google leads with a score of 0.930. The average score across all models is 0.642.
The highest MRCR score is 0.930, achieved by Gemini 2.5 Pro from Google.
7 models have been evaluated on the MRCR benchmark, with 0 verified results and 7 self-reported results.
MRCR is categorized under general, long context, and reasoning. The benchmark evaluates text models.

Sub-benchmarks

MRCR 128K (2-needle)

MRCR (Multi-Round Coreference Resolution) at 128K context length with 2 needles. Models must navigate long conversations to reproduce specific model outputs, testing attention and reasoning across 128K-token contexts with 2 items to retrieve.

textMax 1

MRCR 128K (4-needle)

MRCR (Multi-Round Coreference Resolution) at 128K context length with 4 needles. Models must navigate long conversations to reproduce specific model outputs, testing attention and reasoning across 128K-token contexts with 4 items to retrieve.

textMax 1

MRCR 128K (8-needle)

MRCR (Multi-Round Coreference Resolution) at 128K context length with 8 needles. Models must navigate long conversations to reproduce specific model outputs, testing attention and reasoning across 128K-token contexts with 8 items to retrieve.

textMax 1

MRCR 64K (2-needle)

MRCR (Multi-Round Coreference Resolution) at 64K context length with 2 needles. Models must navigate long conversations to reproduce specific model outputs, testing attention and reasoning across 64K-token contexts with 2 items to retrieve.

textMax 1

MRCR 64K (4-needle)

MRCR (Multi-Round Coreference Resolution) at 64K context length with 4 needles. Models must navigate long conversations to reproduce specific model outputs, testing attention and reasoning across 64K-token contexts with 4 items to retrieve.

textMax 1

MRCR 64K (8-needle)

MRCR (Multi-Round Coreference Resolution) at 64K context length with 8 needles. Models must navigate long conversations to reproduce specific model outputs, testing attention and reasoning across 64K-token contexts with 8 items to retrieve.

textMax 1