ARC-AGI v2

ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.

GPT-5.5 from OpenAI currently leads the ARC-AGI v2 leaderboard with a score of 0.850 across 16 evaluated AI models.

Paper
About this benchmark

What ARC-AGI v2 measures

ARC-AGI v2 is a multimodal benchmark that evaluates large language models on spatial reasoning, vision, and reasoning tasks. LLM Stats tracks 16 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.5, with the leader reaching 0.8.

Compare leaders on the best AI for spatial reasoning, best AI for vision and best AI for reasoning leaderboards.

Publication

Paper
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Authors
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and 1 others
Published

Abstract

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.

OpenAIGPT-5.5 leads with 85.0%, followed by GoogleGemini 3.1 Pro at 77.1% and OpenAIGPT-5.4 at 73.3%.

Progress Over Time

Interactive timeline showing model performance evolution on ARC-AGI v2

State-of-the-art frontier
Open
Proprietary

ARC-AGI v2 Leaderboard

16 models
ContextCostLicense
1
OpenAI
OpenAI
1.1M$5.00 / $30.00
21.0M$2.50 / $15.00
3
OpenAI
OpenAI
1.0M$2.50 / $15.00
41.0M$1.50 / $9.00
51.0M$5.00 / $25.00
6200K$3.00 / $15.00
7
8
OpenAI
OpenAI
400K$1.75 / $14.00
9
10
111.0M$0.50 / $3.00
12
13
14
Anthropic
Anthropic
15
OpenAI
OpenAI
161.0M$1.25 / $10.00
Notice missing or incorrect data?

Recent Reviews

FAQ

Common questions about ARC-AGI v2.

What is the ARC-AGI v2 benchmark?

ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.

What is the ARC-AGI v2 leaderboard?

The ARC-AGI v2 leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.850. The average score across all models is 0.451.

What is the highest ARC-AGI v2 score?

The highest ARC-AGI v2 score is 0.850, achieved by GPT-5.5 from OpenAI.

How many models are evaluated on ARC-AGI v2?

16 models have been evaluated on the ARC-AGI v2 benchmark, with 0 verified results and 13 self-reported results.

Where can I find the ARC-AGI v2 paper?

The ARC-AGI v2 paper is available at https://arxiv.org/abs/2505.11831. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does ARC-AGI v2 cover?

ARC-AGI v2 is categorized under spatial reasoning, vision, and reasoning. The benchmark evaluates multimodal models.

Which model offers the best value on ARC-AGI v2?

Among models scoring within 10% of the leader, Gemini 3.1 Pro from Google is the cheapest, at $2.50 per million input tokens with a score of 0.771.

How recent are the ARC-AGI v2 leaderboard results?

The ARC-AGI v2 leaderboard was last updated in June 2026 and currently includes 16 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all spatial reasoning
GPQA

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

reasoning
220 models
MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

reasoning
124 models
AIME 2025

All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.

reasoning
113 models
MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

reasoning
99 models
SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

reasoning
97 models
Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

visionmultimodal
78 models