MRCR 128K (2-needle)

Name: MRCR 128K (2-needle) Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MRCR 128K (2-needle)

State-of-the-art frontier

Open

Proprietary

MRCR 128K (2-needle) Leaderboard

1 models

				Context	Cost	License
1	MiniCPM-SALA OpenBMB		9B	—	—

Notice missing or incorrect data?

About this benchmark

What is MRCR 128K (2-needle)?

MRCR (Multi-Round Coreference Resolution) at 128K context length with 2 needles. Models must navigate long conversations to reproduce specific model outputs, testing attention and reasoning across 128K-token contexts with 2 items to retrieve.

MRCR 128K (2-needle) is a text benchmark evaluating models on long context, reasoning, and general tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.3.

Compare leaders on the best AI for long context, best AI for reasoning and best AI for general leaderboards.

Current leaders

MiniCPM-SALA from OpenBMB currently leads the MRCR 128K (2-needle) leaderboard with a score of 0.286 across 1 evaluated AI models.

MiniCPM-SALAOpenBMB28.6%

Source paper

Title: Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
Authors: Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, and 20 others
Published: September 19, 2024
arXiv: 2409.12640

Abstract

We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.

FAQ

Common questions about the MRCR 128K (2-needle) benchmark and leaderboard.

What is the MRCR 128K (2-needle) benchmark?

What is the MRCR 128K (2-needle) leaderboard?

The MRCR 128K (2-needle) leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MiniCPM-SALA by OpenBMB leads with a score of 0.286. The average score across all models is 0.286.

What is the highest MRCR 128K (2-needle) score?

The highest MRCR 128K (2-needle) score is 0.286, achieved by MiniCPM-SALA from OpenBMB.

How many models are evaluated on MRCR 128K (2-needle)?

1 models have been evaluated on the MRCR 128K (2-needle) benchmark, with 0 verified results and 1 self-reported results.

Where can I find the MRCR 128K (2-needle) paper?

The MRCR 128K (2-needle) paper is available at https://arxiv.org/abs/2409.12640. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MRCR 128K (2-needle) cover?

MRCR 128K (2-needle) is categorized under long context, reasoning, and general. The benchmark evaluates text models.

What's the difference between MRCR 128K (2-needle) and MRCR?

MRCR 128K (2-needle) is a variant of MRCR. See the MRCR leaderboard for the broader benchmark and per-model comparison.

What is the best open-source model on MRCR 128K (2-needle)?

MiniCPM-SALA by OpenBMB is the top-ranked open-source model on MRCR 128K (2-needle), with a score of 0.286 (rank #1).

How recent are the MRCR 128K (2-needle) leaderboard results?

The MRCR 128K (2-needle) leaderboard was last updated in August 2026 and currently includes 1 evaluated models.