OpenAI-MRCR: 2 needle 128k
Progress Over Time
Interactive timeline showing model performance evolution on OpenAI-MRCR: 2 needle 128k
OpenAI-MRCR: 2 needle 128k Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | — | — | ||
| 2 | MiniMax | 456B | — | — | ||
| 3 | MiniMax | 456B | — | — | ||
| 4 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 5 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 6 | OpenAI | — | — | — | ||
| 7 | OpenAI | — | 1.0M | $0.10 / $0.40 | ||
| 8 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 9 | OpenAI | — | — | — |
What is OpenAI-MRCR: 2 needle 128k?
Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.
OpenAI-MRCR: 2 needle 128k is a text benchmark evaluating models on long context and reasoning tasks. LLM Stats tracks 9 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 1.0.
Compare leaders on the best AI for long context and best AI for reasoning leaderboards.
Current leaders
GPT-5 from OpenAI currently leads the OpenAI-MRCR: 2 needle 128k leaderboard with a score of 0.952 across 9 evaluated AI models.
Source paper
- Title
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
- Authors
- Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, and 1133 others
- Published
- arXiv
- 2403.05530
Abstract
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
FAQ
Common questions about the OpenAI-MRCR: 2 needle 128k benchmark and leaderboard.