ARC-AGI v2
ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.
GPT-5.5 from OpenAI currently leads the ARC-AGI v2 leaderboard with a score of 0.850 across 16 evaluated AI models.
What ARC-AGI v2 measures
ARC-AGI v2 is a multimodal benchmark that evaluates large language models on spatial reasoning, vision, and reasoning tasks. LLM Stats tracks 16 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.5, with the leader reaching 0.8.
Compare leaders on the best AI for spatial reasoning, best AI for vision and best AI for reasoning leaderboards.
Publication
- Paper
- ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
- Authors
- Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and 1 others
- Published
- arXiv
- 2505.11831
Abstract
The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.
GPT-5.5 leads with 85.0%, followed by
Gemini 3.1 Pro at 77.1% and
GPT-5.4 at 73.3%.
Progress Over Time
Interactive timeline showing model performance evolution on ARC-AGI v2
ARC-AGI v2 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 1.1M | $5.00 / $30.00 | ||
| 2 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 3 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 4 | Google | — | 1.0M | $1.50 / $9.00 | ||
| 5 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 6 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 7 | OpenAI | — | — | — | ||
| 8 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 9 | Meta | — | — | — | ||
| 10 | Anthropic | — | — | — | ||
| 11 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 12 | Google | — | — | — | ||
| 13 | xAI | — | — | — | ||
| 14 | Anthropic | — | — | — | ||
| 15 | OpenAI | — | — | — | ||
| 16 | Google | — | 1.0M | $1.25 / $10.00 |
Recent Reviews
FAQ
Common questions about ARC-AGI v2.
More evaluations to explore
Related benchmarks in the same category
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions