MRCR v2 (8-needle)
MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.
Progress Over Time
Interactive timeline showing model performance evolution on MRCR v2 (8-needle)
State-of-the-art frontier
Open
Proprietary
MRCR v2 (8-needle) Leaderboard
8 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 2 | Google | — | 1.0M | $0.25 / $1.50 | ||
| 3 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 4 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 5 | Google | — | — | — | ||
| 5 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 7 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 8 | — | 1.0M | $1.25 / $10.00 |
Notice missing or incorrect data?
FAQ
Common questions about MRCR v2 (8-needle)
MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.
The MRCR v2 (8-needle) paper is available at https://arxiv.org/abs/2409.12640. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MRCR v2 (8-needle) leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.930. The average score across all models is 0.389.
The highest MRCR v2 (8-needle) score is 0.930, achieved by Claude Opus 4.6 from Anthropic.
8 models have been evaluated on the MRCR v2 (8-needle) benchmark, with 0 verified results and 8 self-reported results.
MRCR v2 (8-needle) is categorized under general, long context, and reasoning. The benchmark evaluates text models.