NIH/Multi-needle

Paper

Progress Over Time

Interactive timeline showing model performance evolution on NIH/Multi-needle

State-of-the-art frontier
Open
Proprietary

NIH/Multi-needle Leaderboard

1 models
ContextCostLicense
13B
Notice missing or incorrect data?
About this benchmark

What is NIH/Multi-needle?

Multi-needle in a haystack benchmark for evaluating long-context comprehension capabilities of language models by testing retrieval of multiple target pieces of information from extended documents

NIH/Multi-needle is a text benchmark evaluating models on long context tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.8.

Compare leaders on the best AI for long context leaderboards.

Current leaders

Llama 3.2 3B Instruct from Meta currently leads the NIH/Multi-needle leaderboard with a score of 0.847 across 1 evaluated AI models.

Source paper

Title
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
Authors
Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, and 5 others
Published
Abstract

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

FAQ

Common questions about the NIH/Multi-needle benchmark and leaderboard.

What is the NIH/Multi-needle benchmark?

Multi-needle in a haystack benchmark for evaluating long-context comprehension capabilities of language models by testing retrieval of multiple target pieces of information from extended documents

What is the NIH/Multi-needle leaderboard?

The NIH/Multi-needle leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.2 3B Instruct by Meta leads with a score of 0.847. The average score across all models is 0.847.

What is the highest NIH/Multi-needle score?

The highest NIH/Multi-needle score is 0.847, achieved by Llama 3.2 3B Instruct from Meta.

How many models are evaluated on NIH/Multi-needle?

1 models have been evaluated on the NIH/Multi-needle benchmark, with 0 verified results and 1 self-reported results.

Where can I find the NIH/Multi-needle paper?

The NIH/Multi-needle paper is available at https://arxiv.org/abs/2406.11230. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does NIH/Multi-needle cover?

NIH/Multi-needle is categorized under long context. The benchmark evaluates text models.

What is the best open-source model on NIH/Multi-needle?

Llama 3.2 3B Instruct by Meta is the top-ranked open-source model on NIH/Multi-needle, with a score of 0.847 (rank #1).

How recent are the NIH/Multi-needle leaderboard results?

The NIH/Multi-needle leaderboard was last updated in July 2026 and currently includes 1 evaluated models.