AI2D

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on AI2D

State-of-the-art frontier
Open
Proprietary

AI2D Leaderboard

7 models • 0 verified
ContextCostLicense
1
2
27B
3
12B
4
DeepSeek
DeepSeek
27B
5
16B
6
4B
7
3B
Notice missing or incorrect data?

FAQ

Common questions about AI2D

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.
The AI2D paper is available at https://arxiv.org/abs/1603.07396. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AI2D dataset is available at https://allenai.org/data/diagrams.
The AI2D leaderboard ranks 7 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.947. The average score across all models is 0.816.
The highest AI2D score is 0.947, achieved by Claude 3.5 Sonnet from Anthropic.
7 models have been evaluated on the AI2D benchmark, with 0 verified results and 7 self-reported results.
AI2D is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.