ARC-AGI-2

Overview

ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence, version 2) is the successor to François Chollet’s original ARC benchmark, widely considered the most important test of genuine reasoning ability in AI. Each task presents a small set of input-output grid examples, and the model must infer the underlying transformation rule and apply it to a new input. ARC-AGI-2 is specifically designed to be unsolvable through memorization or scale alone — it requires the kind of fluid intelligence and abstraction that humans use effortlessly but machines find extraordinarily difficult.

Key Details

Property	Value
Created by	François Chollet / ARC Prize Foundation
Version	2.0 (evolved from ARC-AGI-1)
Task type	Visual abstract reasoning (grid transformations)
Dataset size	1,000+ tasks
Format	Input-output grid pairs → predict new output
Evaluation	Accuracy (exact grid match)
Prize	ARC Prize — $1M+ for significant progress
Leaderboard	arcprize.org

How It Works

Each ARC task consists of:

Demonstration pairs: 2-5 input grids and their corresponding output grids
Test input: A new grid that follows the same transformation rule
Challenge: The model must produce the exact correct output grid

Demo 1:  Input  → Output
Demo 2:  Input  → Output
Demo 3:  Input  → Output
         ─────────────
Test:    Input  → ???    (Model must predict this)

Task Example (Conceptual)

Imagine grids where colored squares form patterns:

Demo 1: A red L-shape → A red L-shape rotated 90°
Demo 2: A blue T-shape → A blue T-shape rotated 90°
Test: A green Z-shape → ??? (Must produce the Z-shape rotated 90°)

The model must infer the rule (“rotate the shape 90° clockwise”) and apply it perfectly to the test input. Real ARC tasks involve much more complex and varied transformations.

What Makes ARC-AGI-2 Unique

Novelty by design

Every task uses a unique transformation not repeated elsewhere in the dataset. There is no pattern to memorize across tasks — each one is a brand-new puzzle.

No training distribution

ARC tasks are deliberately outside any training distribution. They test on-the-fly learning from just a few examples, not recall of similar problems.

Human baseline is high

Average humans solve ~85% of ARC tasks. The benchmark was calibrated to be easy for humans but hard for machines — the opposite of benchmarks like MATH or MMLU.

Exact-match evaluation

Outputs must be pixel-perfect. There is no partial credit — either the model produces exactly the right grid or it fails.

ARC-AGI-2 vs. ARC-AGI-1

Feature	ARC-AGI-1	ARC-AGI-2
Difficulty	Hard	Significantly harder
Task complexity	Moderate transformations	Multi-step, compositional rules
Best AI score	~55% (with heavy compute)	~40%
Human score	~85%	~80%
Focus	Basic abstraction	Compositional generalization

Why It Matters

ARC-AGI-2 is the benchmark most closely associated with the question: “Are we making progress toward AGI?”

Tests intelligence, not knowledge — No amount of training data helps; the model must reason in real-time
Immune to scale — Bigger models don’t automatically perform better
Measures generalization — The core capability that separates intelligence from memorization
ARC Prize — François Chollet’s $1M+ prize has made this a flagship challenge in the AI community

Notable Results

Model / System	Accuracy	Date
ARC Prize 2024 Winner (program synthesis)	~55% (ARC-1)	2024
OpenAI o3 (high compute)	~40% (ARC-2)	2025
Claude 3.5 Sonnet	~25% (ARC-2)	2025
GPT-4o	~20% (ARC-2)	2025

The gap between human performance (~80%) and the best AI systems (~40%) on ARC-AGI-2 remains one of the largest among major benchmarks, making it a key indicator for progress toward genuine reasoning.

Limitations

Grid-based only — Tests visual/spatial reasoning but not linguistic, social, or physical reasoning
Binary scoring — No partial credit for “almost correct” solutions
Compute-sensitive — Some approaches brute-force solutions with massive compute, gaming the benchmark
Narrow reasoning type — Does not capture all forms of intelligence

References

ARC Prize — Official prize and leaderboard
On the Measure of Intelligence — François Chollet’s original paper defining the ARC framework
ARC-AGI-2 Announcement — Technical details on the updated benchmark

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Overview

Key Details

How It Works

Task Example (Conceptual)

What Makes ARC-AGI-2 Unique

Novelty by design

No training distribution

Human baseline is high

Exact-match evaluation

ARC-AGI-2 vs. ARC-AGI-1

Why It Matters

Notable Results

Limitations

References

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Documentation Index

​Overview

​Key Details

​How It Works

​Task Example (Conceptual)

​What Makes ARC-AGI-2 Unique

​Novelty by design

​No training distribution

​Human baseline is high

​Exact-match evaluation

​ARC-AGI-2 vs. ARC-AGI-1

​Why It Matters

​Notable Results

​Limitations

​References

Overview

Key Details

How It Works

Task Example (Conceptual)

What Makes ARC-AGI-2 Unique

Novelty by design

No training distribution

Human baseline is high

Exact-match evaluation

ARC-AGI-2 vs. ARC-AGI-1

Why It Matters

Notable Results

Limitations

References