ARC-AGI v2

ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on ARC-AGI v2

State-of-the-art frontier
Open
Proprietary

ARC-AGI v2 Leaderboard

15 models
ContextCostLicense
1
OpenAI
OpenAI
1.0M$5.00 / $30.00
21.0M$2.50 / $15.00
3
OpenAI
OpenAI
1.0M$2.50 / $15.00
41.0M$5.00 / $25.00
5200K$3.00 / $15.00
6400K$21.00 / $168.00
7
OpenAI
OpenAI
400K$1.75 / $14.00
8
9200K$5.00 / $25.00
101.0M$0.50 / $3.00
11
12
13
Anthropic
Anthropic
14
OpenAI
OpenAI
200K$2.00 / $8.00
151.0M$1.25 / $10.00
Notice missing or incorrect data?

FAQ

Common questions about ARC-AGI v2

ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.
The ARC-AGI v2 paper is available at https://arxiv.org/abs/2505.11831. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ARC-AGI v2 leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.850. The average score across all models is 0.434.
The highest ARC-AGI v2 score is 0.850, achieved by GPT-5.5 from OpenAI.
15 models have been evaluated on the ARC-AGI v2 benchmark, with 0 verified results and 12 self-reported results.
ARC-AGI v2 is categorized under reasoning, spatial reasoning, and vision. The benchmark evaluates multimodal models.