AI2D

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on AI2D

State-of-the-art frontier
Open
Proprietary

AI2D Leaderboard

32 models
ContextCostLicense
1
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
3
OpenAI
OpenAI
128K$2.50 / $10.00
4
Mistral AI
Mistral AI
124B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
624B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
1090B
1111B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
17
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
2327B
2412B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
26
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
276B
28
DeepSeek
DeepSeek
27B
2916B
304B
314B
323B
Notice missing or incorrect data?
About this benchmark

What is AI2D?

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.

AI2D is a multimodal benchmark evaluating models on multimodal, reasoning, and vision tasks. LLM Stats tracks 32 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.

Compare leaders on the best AI for multimodal, best AI for reasoning and best AI for vision leaderboards.

Current leaders

Claude 3.5 Sonnet from Anthropic currently leads the AI2D leaderboard with a score of 0.947 across 32 evaluated AI models.

1Claude 3.5 SonnetAnthropic94.7%
2Qwen3.6 PlusAlibaba Cloud / Qwen Team94.4%
3GPT-4oOpenAI94.2%
OSSPixtral Large#4 open-weight93.8%

Source paper

Title
A Diagram Is Worth A Dozen Images
Authors
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, and 2 others
Published
Abstract

Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. Understanding natural images has been extensively studied in computer vision, while diagram understanding has received little attention. In this paper, we study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships. We introduce Diagram Parse Graphs (DPG) as our representation to model the structure of diagrams. We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering. We compile a new dataset of diagrams with exhaustive annotations of constituents and relationships for over 5,000 diagrams and 15,000 questions and answers. Our results show the significance of our models for syntactic parsing and question answering in diagrams using DPGs.

FAQ

Common questions about the AI2D benchmark and leaderboard.

What is the AI2D benchmark?

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.

What is the AI2D leaderboard?

The AI2D leaderboard ranks 32 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.947. The average score across all models is 0.872.

What is the highest AI2D score?

The highest AI2D score is 0.947, achieved by Claude 3.5 Sonnet from Anthropic.

How many models are evaluated on AI2D?

32 models have been evaluated on the AI2D benchmark, with 0 verified results and 32 self-reported results.

Where can I find the AI2D paper?

The AI2D paper is available at https://arxiv.org/abs/1603.07396. The paper details the methodology, dataset construction, and evaluation criteria.

Where can I find the AI2D dataset?

The AI2D dataset is available at https://allenai.org/data/diagrams.

What categories does AI2D cover?

AI2D is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on AI2D?

Pixtral Large by Mistral AI is the top-ranked open-source model on AI2D, with a score of 0.938 (rank #4).

Which model offers the best value on AI2D?

Among models scoring within 10% of the leader, Qwen3.5-27B from Alibaba Cloud / Qwen Team is the cheapest, at $0.30 per million input tokens with a score of 0.929.

How recent are the AI2D leaderboard results?

The AI2D leaderboard was last updated in July 2026 and currently includes 32 evaluated models.