OpenAI-MRCR: 2 needle 128k

Name: OpenAI-MRCR: 2 needle 128k Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OpenAI-MRCR: 2 needle 128k

State-of-the-art frontier

Open

Proprietary

OpenAI-MRCR: 2 needle 128k Leaderboard

9 models

			Context	Cost
1	GPT-5 OpenAI	—	—	—
2	MiniMax M1 40K MiniMax	456B	—	—
3	MiniMax M1 80K MiniMax	456B	—	—
4	GPT-4.1 OpenAI	—	1.0M	$2.00 / $8.00
5	GPT-4.1 mini OpenAI	—	1.0M	$0.40 / $1.60
6	GPT-4.5 OpenAI	—	—	—
7	GPT-4.1 nano OpenAI	—	1.0M	$0.10 / $0.40
8	GPT-4o OpenAI	—	128K	$2.50 / $10.00
9	o3-mini OpenAI	—	—	—

Notice missing or incorrect data?

About this benchmark

What is OpenAI-MRCR: 2 needle 128k?

Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.

OpenAI-MRCR: 2 needle 128k is a text benchmark evaluating models on long context and reasoning tasks. LLM Stats tracks 9 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 1.0.

Compare leaders on the best AI for long context and best AI for reasoning leaderboards.

Current leaders

GPT-5 from OpenAI currently leads the OpenAI-MRCR: 2 needle 128k leaderboard with a score of 0.952 across 9 evaluated AI models.

GPT-5OpenAI95.2%

MiniMax M1 40KMiniMax76.1%

MiniMax M1 80KMiniMax73.4%

Source paper

Title: Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, and 1133 others
Published: March 8, 2024
arXiv: 2403.05530

Abstract

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

FAQ

Common questions about the OpenAI-MRCR: 2 needle 128k benchmark and leaderboard.

What is the OpenAI-MRCR: 2 needle 128k benchmark?

What is the OpenAI-MRCR: 2 needle 128k leaderboard?

The OpenAI-MRCR: 2 needle 128k leaderboard ranks 9 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.952. The average score across all models is 0.528.

What is the highest OpenAI-MRCR: 2 needle 128k score?

The highest OpenAI-MRCR: 2 needle 128k score is 0.952, achieved by GPT-5 from OpenAI.

How many models are evaluated on OpenAI-MRCR: 2 needle 128k?

9 models have been evaluated on the OpenAI-MRCR: 2 needle 128k benchmark, with 0 verified results and 9 self-reported results.

Where can I find the OpenAI-MRCR: 2 needle 128k paper?

The OpenAI-MRCR: 2 needle 128k paper is available at https://arxiv.org/abs/2403.05530. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does OpenAI-MRCR: 2 needle 128k cover?

OpenAI-MRCR: 2 needle 128k is categorized under long context and reasoning. The benchmark evaluates text models.

What is the best open-source model on OpenAI-MRCR: 2 needle 128k?

MiniMax M1 40K by MiniMax is the top-ranked open-source model on OpenAI-MRCR: 2 needle 128k, with a score of 0.761 (rank #2).

How recent are the OpenAI-MRCR: 2 needle 128k leaderboard results?

The OpenAI-MRCR: 2 needle 128k leaderboard was last updated in July 2026 and currently includes 9 evaluated models.