ARC-AGI v2
ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.
Progress Over Time
Interactive timeline showing model performance evolution on ARC-AGI v2
State-of-the-art frontier
Open
Proprietary
ARC-AGI v2 Leaderboard
15 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | GPT-5.5New OpenAI | — | 1.0M | $5.00 / $30.00 | ||
| 2 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 3 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 4 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 5 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 6 | OpenAI | — | 400K | $21.00 / $168.00 | ||
| 7 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 8 | Meta | — | — | — | ||
| 9 | Anthropic | — | 200K | $5.00 / $25.00 | ||
| 10 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 11 | Google | — | — | — | ||
| 12 | xAI | — | — | — | ||
| 13 | Anthropic | — | — | — | ||
| 14 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 15 | Google | — | 1.0M | $1.25 / $10.00 |
Notice missing or incorrect data?
FAQ
Common questions about ARC-AGI v2
ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.
The ARC-AGI v2 paper is available at https://arxiv.org/abs/2505.11831. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ARC-AGI v2 leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.850. The average score across all models is 0.434.
The highest ARC-AGI v2 score is 0.850, achieved by GPT-5.5 from OpenAI.
15 models have been evaluated on the ARC-AGI v2 benchmark, with 0 verified results and 12 self-reported results.
ARC-AGI v2 is categorized under reasoning, spatial reasoning, and vision. The benchmark evaluates multimodal models.