NQ

Natural Questions (NQ) benchmark containing real user questions issued to Google search with answers found from Wikipedia, designed for training and evaluation of automatic question answering systems

Paper

Progress Over Time

Interactive timeline showing model performance evolution on NQ

State-of-the-art frontier
Open
Proprietary

NQ Leaderboard

1 models
ContextCostLicense
18B
Notice missing or incorrect data?

FAQ

Common questions about NQ

Natural Questions (NQ) benchmark containing real user questions issued to Google search with answers found from Wikipedia, designed for training and evaluation of automatic question answering systems
The NQ paper is available at https://aclanthology.org/Q19-1026/. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The NQ leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Granite 3.3 8B Base by IBM leads with a score of 0.365. The average score across all models is 0.365.
The highest NQ score is 0.365, achieved by Granite 3.3 8B Base from IBM.
1 models have been evaluated on the NQ benchmark, with 0 verified results and 1 self-reported results.
NQ is categorized under general and reasoning. The benchmark evaluates text models.