AI2 Reasoning Challenge (ARC)
A dataset of 7,787 genuine grade-school level, multiple-choice science questions assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and Easy Set, where the Challenge Set contains only questions answered incorrectly by both retrieval-based and word co-occurrence algorithms. Covers multiple scientific domains including biology, physics, earth science, and chemistry, requiring scientific reasoning, causal understanding, and conceptual knowledge beyond simple fact retrieval. Includes a supporting corpus of over 14 million science sentences.
Progress Over Time
Interactive timeline showing model performance evolution on AI2 Reasoning Challenge (ARC)
State-of-the-art frontier
Open
Proprietary
AI2 Reasoning Challenge (ARC) Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 33K | $30.00 / $60.00 |
Notice missing or incorrect data?
FAQ
Common questions about AI2 Reasoning Challenge (ARC)
A dataset of 7,787 genuine grade-school level, multiple-choice science questions assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and Easy Set, where the Challenge Set contains only questions answered incorrectly by both retrieval-based and word co-occurrence algorithms. Covers multiple scientific domains including biology, physics, earth science, and chemistry, requiring scientific reasoning, causal understanding, and conceptual knowledge beyond simple fact retrieval. Includes a supporting corpus of over 14 million science sentences.
The AI2 Reasoning Challenge (ARC) paper is available at https://arxiv.org/abs/1803.05457. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AI2 Reasoning Challenge (ARC) dataset is available at https://github.com/allenai/ARC-Solvers.
The AI2 Reasoning Challenge (ARC) leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, GPT-4 by OpenAI leads with a score of 0.963. The average score across all models is 0.963.
The highest AI2 Reasoning Challenge (ARC) score is 0.963, achieved by GPT-4 from OpenAI.
1 models have been evaluated on the AI2 Reasoning Challenge (ARC) benchmark, with 0 verified results and 1 self-reported results.
AI2 Reasoning Challenge (ARC) is categorized under general and reasoning. The benchmark evaluates text models.