NIH/Multi-needle
Multi-needle in a haystack benchmark for evaluating long-context comprehension capabilities of language models by testing retrieval of multiple target pieces of information from extended documents
Progress Over Time
Interactive timeline showing model performance evolution on NIH/Multi-needle
State-of-the-art frontier
Open
Proprietary
NIH/Multi-needle Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 3B | 128K | $0.01 / $0.02 |
Notice missing or incorrect data?
FAQ
Common questions about NIH/Multi-needle
Multi-needle in a haystack benchmark for evaluating long-context comprehension capabilities of language models by testing retrieval of multiple target pieces of information from extended documents
The NIH/Multi-needle paper is available at https://arxiv.org/abs/2406.11230. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The NIH/Multi-needle leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.2 3B Instruct by Meta leads with a score of 0.847. The average score across all models is 0.847.
The highest NIH/Multi-needle score is 0.847, achieved by Llama 3.2 3B Instruct from Meta.
1 models have been evaluated on the NIH/Multi-needle benchmark, with 0 verified results and 1 self-reported results.
NIH/Multi-needle is categorized under long context. The benchmark evaluates text models.