Skip to main content

Documentation Index

Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence, version 2) is the successor to François Chollet’s original ARC benchmark, widely considered the most important test of genuine reasoning ability in AI. Each task presents a small set of input-output grid examples, and the model must infer the underlying transformation rule and apply it to a new input. ARC-AGI-2 is specifically designed to be unsolvable through memorization or scale alone — it requires the kind of fluid intelligence and abstraction that humans use effortlessly but machines find extraordinarily difficult.

Key Details

PropertyValue
Created byFrançois Chollet / ARC Prize Foundation
Version2.0 (evolved from ARC-AGI-1)
Task typeVisual abstract reasoning (grid transformations)
Dataset size1,000+ tasks
FormatInput-output grid pairs → predict new output
EvaluationAccuracy (exact grid match)
PrizeARC Prize — $1M+ for significant progress
Leaderboardarcprize.org

How It Works

Each ARC task consists of:
  1. Demonstration pairs: 2-5 input grids and their corresponding output grids
  2. Test input: A new grid that follows the same transformation rule
  3. Challenge: The model must produce the exact correct output grid
Demo 1:  Input  → Output
Demo 2:  Input  → Output
Demo 3:  Input  → Output
         ─────────────
Test:    Input  → ???    (Model must predict this)

Task Example (Conceptual)

Imagine grids where colored squares form patterns:
  • Demo 1: A red L-shape → A red L-shape rotated 90°
  • Demo 2: A blue T-shape → A blue T-shape rotated 90°
  • Test: A green Z-shape → ??? (Must produce the Z-shape rotated 90°)
The model must infer the rule (“rotate the shape 90° clockwise”) and apply it perfectly to the test input. Real ARC tasks involve much more complex and varied transformations.

What Makes ARC-AGI-2 Unique

Novelty by design

Every task uses a unique transformation not repeated elsewhere in the dataset. There is no pattern to memorize across tasks — each one is a brand-new puzzle.

No training distribution

ARC tasks are deliberately outside any training distribution. They test on-the-fly learning from just a few examples, not recall of similar problems.

Human baseline is high

Average humans solve ~85% of ARC tasks. The benchmark was calibrated to be easy for humans but hard for machines — the opposite of benchmarks like MATH or MMLU.

Exact-match evaluation

Outputs must be pixel-perfect. There is no partial credit — either the model produces exactly the right grid or it fails.

ARC-AGI-2 vs. ARC-AGI-1

FeatureARC-AGI-1ARC-AGI-2
DifficultyHardSignificantly harder
Task complexityModerate transformationsMulti-step, compositional rules
Best AI score~55% (with heavy compute)~40%
Human score~85%~80%
FocusBasic abstractionCompositional generalization

Why It Matters

ARC-AGI-2 is the benchmark most closely associated with the question: “Are we making progress toward AGI?”
  • Tests intelligence, not knowledge — No amount of training data helps; the model must reason in real-time
  • Immune to scale — Bigger models don’t automatically perform better
  • Measures generalization — The core capability that separates intelligence from memorization
  • ARC Prize — François Chollet’s $1M+ prize has made this a flagship challenge in the AI community

Notable Results

Model / SystemAccuracyDate
ARC Prize 2024 Winner (program synthesis)~55% (ARC-1)2024
OpenAI o3 (high compute)~40% (ARC-2)2025
Claude 3.5 Sonnet~25% (ARC-2)2025
GPT-4o~20% (ARC-2)2025
The gap between human performance (~80%) and the best AI systems (~40%) on ARC-AGI-2 remains one of the largest among major benchmarks, making it a key indicator for progress toward genuine reasoning.

Limitations

  • Grid-based only — Tests visual/spatial reasoning but not linguistic, social, or physical reasoning
  • Binary scoring — No partial credit for “almost correct” solutions
  • Compute-sensitive — Some approaches brute-force solutions with massive compute, gaming the benchmark
  • Narrow reasoning type — Does not capture all forms of intelligence

References