AI2D
Progress Over Time
Interactive timeline showing model performance evolution on AI2D
AI2D Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 3 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 4 | Mistral AI | 124B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 6 | Mistral AI | 24B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 8 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 10 | 90B | — | — | |||
| 11 | 11B | — | — | |||
| 12 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 14 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 16 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 17 | xAI | — | — | — | ||
| 18 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 19 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 20 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 21 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 21 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 23 | Google | 27B | — | — | ||
| 24 | Google | 12B | — | — | ||
| 25 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 26 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 27 | Microsoft | 6B | — | — | ||
| 28 | DeepSeek | 27B | — | — | ||
| 29 | DeepSeek | 16B | — | — | ||
| 30 | Microsoft | 4B | — | — | ||
| 31 | Google | 4B | — | — | ||
| 32 | DeepSeek | 3B | — | — |
What is AI2D?
AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.
AI2D is a multimodal benchmark evaluating models on multimodal, reasoning, and vision tasks. LLM Stats tracks 32 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.
Compare leaders on the best AI for multimodal, best AI for reasoning and best AI for vision leaderboards.
Current leaders
Claude 3.5 Sonnet from Anthropic currently leads the AI2D leaderboard with a score of 0.947 across 32 evaluated AI models.
Source paper
- Title
- A Diagram Is Worth A Dozen Images
- Authors
- Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, and 2 others
- Published
- arXiv
- 1603.07396
Abstract
Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. Understanding natural images has been extensively studied in computer vision, while diagram understanding has received little attention. In this paper, we study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships. We introduce Diagram Parse Graphs (DPG) as our representation to model the structure of diagrams. We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering. We compile a new dataset of diagrams with exhaustive annotations of constituents and relationships for over 5,000 diagrams and 15,000 questions and answers. Our results show the significance of our models for syntactic parsing and question answering in diagrams using DPGs.
FAQ
Common questions about the AI2D benchmark and leaderboard.