What is the NIH/Multi-needle leaderboard?

The NIH/Multi-needle leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.2 3B Instruct by Meta leads with a score of 0.847. The average score across all models is 0.847.

What is the highest NIH/Multi-needle score?

The highest NIH/Multi-needle score is 0.847, achieved by Llama 3.2 3B Instruct from Meta.

How many models are evaluated on NIH/Multi-needle?

1 models have been evaluated on the NIH/Multi-needle benchmark, with 0 verified results and 1 self-reported results.

Where can I find the NIH/Multi-needle paper?

The NIH/Multi-needle paper is available at https://arxiv.org/abs/2406.11230. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does NIH/Multi-needle cover?

NIH/Multi-needle is categorized under long context. The benchmark evaluates text models.

What is the best open-source model on NIH/Multi-needle?

Llama 3.2 3B Instruct by Meta is the top-ranked open-source model on NIH/Multi-needle, with a score of 0.847 (rank #1).

How recent are the NIH/Multi-needle leaderboard results?

The NIH/Multi-needle leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

All benchmarks

NIH/Multi-needle

Multi-needle in a haystack benchmark for evaluating long-context comprehension capabilities of language models by testing retrieval of multiple target pieces of information from extended documents

Llama 3.2 3B Instruct from Meta currently leads the NIH/Multi-needle leaderboard with a score of 0.847 across 1 evaluated AI models.

Paper