Benchmarks/general/OpenBookQA

OpenBookQA

OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding. It contains 5,957 multiple-choice elementary-level science questions that probe understanding of 1,326 core science facts and their application to novel situations, requiring combination of open book facts with broad common knowledge through multi-hop reasoning.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OpenBookQA

State-of-the-art frontier
Open
Proprietary

OpenBookQA Leaderboard

5 models
ContextCostLicense
160B
24B128K$0.10 / $0.10
2
Microsoft
Microsoft
4B
412B128K$0.15 / $0.15
5
Nous Research
Nous Research
70B
Notice missing or incorrect data?

FAQ

Common questions about OpenBookQA

OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding. It contains 5,957 multiple-choice elementary-level science questions that probe understanding of 1,326 core science facts and their application to novel situations, requiring combination of open book facts with broad common knowledge through multi-hop reasoning.
The OpenBookQA paper is available at https://arxiv.org/abs/1809.02789. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OpenBookQA leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.896. The average score across all models is 0.716.
The highest OpenBookQA score is 0.896, achieved by Phi-3.5-MoE-instruct from Microsoft.
5 models have been evaluated on the OpenBookQA benchmark, with 0 verified results and 5 self-reported results.
OpenBookQA is categorized under general and reasoning. The benchmark evaluates text models.