OpenBookQA
OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding. It contains 5,957 multiple-choice elementary-level science questions that probe understanding of 1,326 core science facts and their application to novel situations, requiring combination of open book facts with broad common knowledge through multi-hop reasoning.
Progress Over Time
Interactive timeline showing model performance evolution on OpenBookQA
State-of-the-art frontier
Open
Proprietary
OpenBookQA Leaderboard
5 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Microsoft | 60B | — | — | ||
| 2 | Microsoft | 4B | 128K | $0.10 / $0.10 | ||
| 2 | Microsoft | 4B | — | — | ||
| 4 | Mistral AI | 12B | 128K | $0.15 / $0.15 | ||
| 5 | Nous Research | 70B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about OpenBookQA
OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding. It contains 5,957 multiple-choice elementary-level science questions that probe understanding of 1,326 core science facts and their application to novel situations, requiring combination of open book facts with broad common knowledge through multi-hop reasoning.
The OpenBookQA paper is available at https://arxiv.org/abs/1809.02789. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OpenBookQA leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.896. The average score across all models is 0.716.
The highest OpenBookQA score is 0.896, achieved by Phi-3.5-MoE-instruct from Microsoft.
5 models have been evaluated on the OpenBookQA benchmark, with 0 verified results and 5 self-reported results.
OpenBookQA is categorized under general and reasoning. The benchmark evaluates text models.