AI2D

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.

Claude 3.5 Sonnet from Anthropic currently leads the AI2D leaderboard with a score of 0.947 across 32 evaluated AI models.

Paper Implementation

Claude 3.5 Sonnet leads with 94.7%, followed by Qwen3.6 Plus at 94.4% and GPT-4o at 94.2%.

Progress Over Time

Interactive timeline showing model performance evolution on AI2D

State-of-the-art frontier

Open

Proprietary

AI2D Leaderboard

32 models

			Context	Cost
1	Claude 3.5 Sonnet Anthropic	—	—	—
2	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.50 / $3.00
3	GPT-4o OpenAI	—	128K	$2.50 / $10.00
4	Pixtral Large Mistral AI	124B	—	—
5	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	262K	$0.40 / $3.20
6	Mistral Small 3.2 24B Instruct Mistral AI	24B	—	—
7	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
8	Qwen3.6-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—
9	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	262K	$0.25 / $2.00
10	Llama 3.2 90B Instruct Meta	90B	—	—
11	Llama 3.2 11B Instruct Meta	11B	—	—
12	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	262K	$0.30 / $1.50
13	Qwen3 VL 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
14	Qwen3 VL 235B A22B Thinking Alibaba Cloud / Qwen Team	236B	262K	$0.45 / $3.49
15	Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team	33B	—	—
16	Qwen2.5 VL 72B Instruct Alibaba Cloud / Qwen Team	72B	—	—
17	Grok-1.5V xAI	—	—	—
18	Qwen3 VL 30B A3B Thinking Alibaba Cloud / Qwen Team	31B	—	—
19	Qwen3 VL 8B Instruct Alibaba Cloud / Qwen Team	9B	262K	$0.08 / $0.50
20	Qwen3 VL 30B A3B Instruct Alibaba Cloud / Qwen Team	31B	—	—
21	Qwen3 VL 8B Thinking Alibaba Cloud / Qwen Team	9B	262K	$0.18 / $2.09
21	Qwen3 VL 4B Thinking Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $1.00
23	Gemma 3 27B Google	27B	—	—
24	Gemma 3 12B Google	12B	—	—
25	Qwen3 VL 4B Instruct Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $0.60
26	Qwen2.5-Omni-7B Alibaba Cloud / Qwen Team	7B	—	—
27	Phi-4-multimodal-instruct Microsoft	6B	—	—
28	DeepSeek VL2 DeepSeek	27B	—	—
29	DeepSeek VL2 Small DeepSeek	16B	—	—
30	Phi-3.5-vision-instruct Microsoft	4B	—	—
31	Gemma 3 4B Google	4B	—	—
32	DeepSeek VL2 Tiny DeepSeek	3B	—	—

Notice missing or incorrect data?

FAQ

Common questions about AI2D.

What is the AI2D benchmark?

What is the AI2D leaderboard?

The AI2D leaderboard ranks 32 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.947. The average score across all models is 0.872.

What is the highest AI2D score?

The highest AI2D score is 0.947, achieved by Claude 3.5 Sonnet from Anthropic.

How many models are evaluated on AI2D?

32 models have been evaluated on the AI2D benchmark, with 0 verified results and 32 self-reported results.

Where can I find the AI2D paper?

The AI2D paper is available at https://arxiv.org/abs/1603.07396. The paper details the methodology, dataset construction, and evaluation criteria.

Where can I find the AI2D dataset?

The AI2D dataset is available at https://allenai.org/data/diagrams.

What categories does AI2D cover?

AI2D is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.

More evaluations to explore

Related benchmarks in the same category

View all multimodal →

GPQA

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

reasoning

214 models

MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

reasoning

119 models

AIME 2025

All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.

reasoning

108 models

MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

reasoning

99 models

SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

reasoning

89 models

Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

reasoningmultimodal

74 models