BoolQ
BoolQ is a reading comprehension dataset for yes/no questions containing 15,942 naturally occurring examples. Each example consists of a question, passage, and boolean answer, where questions are generated in unprompted and unconstrained settings. The dataset challenges models with complex, non-factoid information requiring entailment-like inference to solve.
Progress Over Time
Interactive timeline showing model performance evolution on BoolQ
State-of-the-art frontier
Open
Proprietary
BoolQ Leaderboard
10 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Nous Research | 70B | — | — | ||
| 2 | Google | 27B | — | — | ||
| 3 | Microsoft | 60B | — | — | ||
| 4 | Google | 9B | — | — | ||
| 5 | 2B | — | — | |||
| 5 | Google | 8B | — | — | ||
| 7 | Microsoft | 4B | — | — | ||
| 8 | Microsoft | 4B | 128K | $0.10 / $0.10 | ||
| 9 | 2B | — | — | |||
| 9 | Google | 8B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about BoolQ
BoolQ is a reading comprehension dataset for yes/no questions containing 15,942 naturally occurring examples. Each example consists of a question, passage, and boolean answer, where questions are generated in unprompted and unconstrained settings. The dataset challenges models with complex, non-factoid information requiring entailment-like inference to solve.
The BoolQ paper is available at https://arxiv.org/abs/1905.10044. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BoolQ leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, Hermes 3 70B by Nous Research leads with a score of 0.880. The average score across all models is 0.817.
The highest BoolQ score is 0.880, achieved by Hermes 3 70B from Nous Research.
10 models have been evaluated on the BoolQ benchmark, with 0 verified results and 10 self-reported results.
BoolQ is categorized under language and reasoning. The benchmark evaluates text models.