AI2D

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.

Claude 3.5 Sonnet from Anthropic currently leads the AI2D leaderboard with a score of 0.947 across 32 evaluated AI models.

PaperImplementation

AnthropicClaude 3.5 Sonnet leads with 94.7%, followed by Alibaba Cloud / Qwen TeamQwen3.6 Plus at 94.4% and OpenAIGPT-4o at 94.2%.

Progress Over Time

Interactive timeline showing model performance evolution on AI2D

State-of-the-art frontier
Open
Proprietary

AI2D Leaderboard

32 models
ContextCostLicense
1
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
3
OpenAI
OpenAI
128K$2.50 / $10.00
4
Mistral AI
Mistral AI
124B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
624B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
1090B
1111B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.50
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
17
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
2327B
2412B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
26
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
276B
28
DeepSeek
DeepSeek
27B
2916B
304B
314B
323B
Notice missing or incorrect data?

FAQ

Common questions about AI2D.

What is the AI2D benchmark?

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.

What is the AI2D leaderboard?

The AI2D leaderboard ranks 32 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.947. The average score across all models is 0.872.

What is the highest AI2D score?

The highest AI2D score is 0.947, achieved by Claude 3.5 Sonnet from Anthropic.

How many models are evaluated on AI2D?

32 models have been evaluated on the AI2D benchmark, with 0 verified results and 32 self-reported results.

Where can I find the AI2D paper?

The AI2D paper is available at https://arxiv.org/abs/1603.07396. The paper details the methodology, dataset construction, and evaluation criteria.

Where can I find the AI2D dataset?

The AI2D dataset is available at https://allenai.org/data/diagrams.

What categories does AI2D cover?

AI2D is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.

More evaluations to explore

Related benchmarks in the same category

View all multimodal
GPQA

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

reasoning
214 models
MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

reasoning
119 models
AIME 2025

All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.

reasoning
108 models
MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

reasoning
99 models
SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

reasoning
89 models
Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

reasoningmultimodal
74 models