OpenAI-MRCR: 2 needle 128k
Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.
Progress Over Time
Interactive timeline showing model performance evolution on OpenAI-MRCR: 2 needle 128k
State-of-the-art frontier
Open
Proprietary
OpenAI-MRCR: 2 needle 128k Leaderboard
9 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 2 | MiniMax | 456B | — | — | ||
| 3 | MiniMax | 456B | 1.0M | $0.55 / $2.20 | ||
| 4 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 5 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 6 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 7 | OpenAI | — | 1.0M | $0.10 / $0.40 | ||
| 8 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 9 | OpenAI | — | 200K | $1.10 / $4.40 |
Notice missing or incorrect data?
FAQ
Common questions about OpenAI-MRCR: 2 needle 128k
Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.
The OpenAI-MRCR: 2 needle 128k paper is available at https://arxiv.org/abs/2403.05530. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OpenAI-MRCR: 2 needle 128k leaderboard ranks 9 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.952. The average score across all models is 0.528.
The highest OpenAI-MRCR: 2 needle 128k score is 0.952, achieved by GPT-5 from OpenAI.
9 models have been evaluated on the OpenAI-MRCR: 2 needle 128k benchmark, with 0 verified results and 9 self-reported results.
OpenAI-MRCR: 2 needle 128k is categorized under long context and reasoning. The benchmark evaluates text models.