Benchmarks/long context/OpenAI-MRCR: 2 needle 128k

OpenAI-MRCR: 2 needle 128k

Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OpenAI-MRCR: 2 needle 128k

State-of-the-art frontier
Open
Proprietary

OpenAI-MRCR: 2 needle 128k Leaderboard

9 models
ContextCostLicense
1
OpenAI
OpenAI
400K$1.25 / $10.00
2456B
3456B1.0M$0.55 / $2.20
4
OpenAI
OpenAI
1.0M$2.00 / $8.00
51.0M$0.40 / $1.60
6
OpenAI
OpenAI
128K$75.00 / $150.00
71.0M$0.10 / $0.40
8
OpenAI
OpenAI
128K$2.50 / $10.00
9
OpenAI
OpenAI
200K$1.10 / $4.40
Notice missing or incorrect data?

FAQ

Common questions about OpenAI-MRCR: 2 needle 128k

Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.
The OpenAI-MRCR: 2 needle 128k paper is available at https://arxiv.org/abs/2403.05530. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OpenAI-MRCR: 2 needle 128k leaderboard ranks 9 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.952. The average score across all models is 0.528.
The highest OpenAI-MRCR: 2 needle 128k score is 0.952, achieved by GPT-5 from OpenAI.
9 models have been evaluated on the OpenAI-MRCR: 2 needle 128k benchmark, with 0 verified results and 9 self-reported results.
OpenAI-MRCR: 2 needle 128k is categorized under long context and reasoning. The benchmark evaluates text models.