NIH/Multi-needle
Multi-needle in a haystack benchmark for evaluating long-context comprehension capabilities of language models by testing retrieval of multiple target pieces of information from extended documents
Llama 3.2 3B Instruct from Meta currently leads the NIH/Multi-needle leaderboard with a score of 0.847 across 1 evaluated AI models.