Documentation Index
Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence, version 2) is the successor to François Chollet’s original ARC benchmark, widely considered the most important test of genuine reasoning ability in AI. Each task presents a small set of input-output grid examples, and the model must infer the underlying transformation rule and apply it to a new input. ARC-AGI-2 is specifically designed to be unsolvable through memorization or scale alone — it requires the kind of fluid intelligence and abstraction that humans use effortlessly but machines find extraordinarily difficult.Key Details
| Property | Value |
|---|---|
| Created by | François Chollet / ARC Prize Foundation |
| Version | 2.0 (evolved from ARC-AGI-1) |
| Task type | Visual abstract reasoning (grid transformations) |
| Dataset size | 1,000+ tasks |
| Format | Input-output grid pairs → predict new output |
| Evaluation | Accuracy (exact grid match) |
| Prize | ARC Prize — $1M+ for significant progress |
| Leaderboard | arcprize.org |
How It Works
Each ARC task consists of:- Demonstration pairs: 2-5 input grids and their corresponding output grids
- Test input: A new grid that follows the same transformation rule
- Challenge: The model must produce the exact correct output grid
Task Example (Conceptual)
Imagine grids where colored squares form patterns:- Demo 1: A red L-shape → A red L-shape rotated 90°
- Demo 2: A blue T-shape → A blue T-shape rotated 90°
- Test: A green Z-shape → ??? (Must produce the Z-shape rotated 90°)
What Makes ARC-AGI-2 Unique
Novelty by design
Every task uses a unique transformation not repeated elsewhere in the dataset. There is no pattern to memorize across tasks — each one is a brand-new puzzle.No training distribution
ARC tasks are deliberately outside any training distribution. They test on-the-fly learning from just a few examples, not recall of similar problems.Human baseline is high
Average humans solve ~85% of ARC tasks. The benchmark was calibrated to be easy for humans but hard for machines — the opposite of benchmarks like MATH or MMLU.Exact-match evaluation
Outputs must be pixel-perfect. There is no partial credit — either the model produces exactly the right grid or it fails.ARC-AGI-2 vs. ARC-AGI-1
| Feature | ARC-AGI-1 | ARC-AGI-2 |
|---|---|---|
| Difficulty | Hard | Significantly harder |
| Task complexity | Moderate transformations | Multi-step, compositional rules |
| Best AI score | ~55% (with heavy compute) | ~40% |
| Human score | ~85% | ~80% |
| Focus | Basic abstraction | Compositional generalization |
Why It Matters
ARC-AGI-2 is the benchmark most closely associated with the question: “Are we making progress toward AGI?”- Tests intelligence, not knowledge — No amount of training data helps; the model must reason in real-time
- Immune to scale — Bigger models don’t automatically perform better
- Measures generalization — The core capability that separates intelligence from memorization
- ARC Prize — François Chollet’s $1M+ prize has made this a flagship challenge in the AI community
Notable Results
| Model / System | Accuracy | Date |
|---|---|---|
| ARC Prize 2024 Winner (program synthesis) | ~55% (ARC-1) | 2024 |
| OpenAI o3 (high compute) | ~40% (ARC-2) | 2025 |
| Claude 3.5 Sonnet | ~25% (ARC-2) | 2025 |
| GPT-4o | ~20% (ARC-2) | 2025 |
The gap between human performance (~80%) and the best AI systems (~40%) on ARC-AGI-2 remains one of the largest among major benchmarks, making it a key indicator for progress toward genuine reasoning.
Limitations
- Grid-based only — Tests visual/spatial reasoning but not linguistic, social, or physical reasoning
- Binary scoring — No partial credit for “almost correct” solutions
- Compute-sensitive — Some approaches brute-force solutions with massive compute, gaming the benchmark
- Narrow reasoning type — Does not capture all forms of intelligence
References
- ARC Prize — Official prize and leaderboard
- On the Measure of Intelligence — François Chollet’s original paper defining the ARC framework
- ARC-AGI-2 Announcement — Technical details on the updated benchmark