OpenBookQA Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on OpenBookQA

State-of-the-art frontier

Open

Proprietary

OpenBookQA Leaderboard

5 models

			Context	Cost
1	Phi-3.5-MoE-instruct Microsoft	60B	—	—
2	Phi-3.5-mini-instruct Microsoft	4B	128K	$0.10 / $0.10
2	Phi 4 Mini Microsoft	4B	—	—
4	Mistral NeMo Instruct Mistral AI	12B	128K	$0.15 / $0.15
5	Hermes 3 70B Nous Research	70B	—	—

FAQ

Common questions about OpenBookQA

OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding. It contains 5,957 multiple-choice elementary-level science questions that probe understanding of 1,326 core science facts and their application to novel situations, requiring combination of open book facts with broad common knowledge through multi-hop reasoning.

The OpenBookQA paper is available at https://arxiv.org/abs/1809.02789. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The OpenBookQA leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.896. The average score across all models is 0.716.

The highest OpenBookQA score is 0.896, achieved by Phi-3.5-MoE-instruct from Microsoft.

5 models have been evaluated on the OpenBookQA benchmark, with 0 verified results and 5 self-reported results.

OpenBookQA is categorized under general and reasoning. The benchmark evaluates text models.

OpenBookQA

Progress Over Time

OpenBookQA Leaderboard

FAQ

What is the OpenBookQA benchmark?

Where can I find the OpenBookQA paper?

What is the OpenBookQA leaderboard?

What is the highest OpenBookQA score?

How many models are evaluated on OpenBookQA?

What categories does OpenBookQA cover?