OpenAI-MRCR: 2 needle 128k

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OpenAI-MRCR: 2 needle 128k

State-of-the-art frontier
Open
Proprietary

OpenAI-MRCR: 2 needle 128k Leaderboard

9 models
ContextCostLicense
1
OpenAI
OpenAI
2456B
3456B
4
OpenAI
OpenAI
1.0M$2.00 / $8.00
51.0M$0.40 / $1.60
6
OpenAI
OpenAI
71.0M$0.10 / $0.40
8
OpenAI
OpenAI
128K$2.50 / $10.00
9
OpenAI
OpenAI
Notice missing or incorrect data?
About this benchmark

What is OpenAI-MRCR: 2 needle 128k?

Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.

OpenAI-MRCR: 2 needle 128k is a text benchmark evaluating models on long context and reasoning tasks. LLM Stats tracks 9 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 1.0.

Compare leaders on the best AI for long context and best AI for reasoning leaderboards.

Current leaders

GPT-5 from OpenAI currently leads the OpenAI-MRCR: 2 needle 128k leaderboard with a score of 0.952 across 9 evaluated AI models.

1GPT-5OpenAI95.2%
2MiniMax M1 40KMiniMax76.1%
3MiniMax M1 80KMiniMax73.4%

Source paper

Title
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, and 1133 others
Published
Abstract

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

FAQ

Common questions about the OpenAI-MRCR: 2 needle 128k benchmark and leaderboard.

What is the OpenAI-MRCR: 2 needle 128k benchmark?

Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.

What is the OpenAI-MRCR: 2 needle 128k leaderboard?

The OpenAI-MRCR: 2 needle 128k leaderboard ranks 9 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.952. The average score across all models is 0.528.

What is the highest OpenAI-MRCR: 2 needle 128k score?

The highest OpenAI-MRCR: 2 needle 128k score is 0.952, achieved by GPT-5 from OpenAI.

How many models are evaluated on OpenAI-MRCR: 2 needle 128k?

9 models have been evaluated on the OpenAI-MRCR: 2 needle 128k benchmark, with 0 verified results and 9 self-reported results.

Where can I find the OpenAI-MRCR: 2 needle 128k paper?

The OpenAI-MRCR: 2 needle 128k paper is available at https://arxiv.org/abs/2403.05530. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does OpenAI-MRCR: 2 needle 128k cover?

OpenAI-MRCR: 2 needle 128k is categorized under long context and reasoning. The benchmark evaluates text models.

What is the best open-source model on OpenAI-MRCR: 2 needle 128k?

MiniMax M1 40K by MiniMax is the top-ranked open-source model on OpenAI-MRCR: 2 needle 128k, with a score of 0.761 (rank #2).

How recent are the OpenAI-MRCR: 2 needle 128k leaderboard results?

The OpenAI-MRCR: 2 needle 128k leaderboard was last updated in July 2026 and currently includes 9 evaluated models.