Humanity's Last Exam

Overview

Humanity’s Last Exam (HLE) is a crowdsourced benchmark of approximately 3,000 questions spanning over 100 academic disciplines, created by domain experts from around the world. It was designed with a provocative mission: to be the last exam humanity would need to create — one so difficult that only a truly general intelligence could pass it. Created by the Center for AI Safety (CAIS) and Scale AI, Humanity’s Last Exam represents the most ambitious attempt to test the frontier of AI knowledge and reasoning across the full breadth of human expertise.

Key Details

Property	Value
Created by	Center for AI Safety (CAIS) & Scale AI
Released	January 2025
Task type	Expert-level academic questions
Dataset size	~3,000 questions
Domains	100+ academic disciplines
Question types	Multiple-choice and short-answer
Evaluation	Accuracy (% correct)

How It Works

Expert contribution: Thousands of researchers and domain experts submitted their hardest questions — problems that would challenge even PhD-level specialists
Curation: Questions were filtered for quality, uniqueness, and difficulty
Format: Mix of multiple-choice and short-answer questions
Evaluation: Models are tested zero-shot with standard prompting

Example Domains

The benchmark spans an extraordinary breadth of human knowledge:

Domain Cluster	Example Subjects
STEM	Quantum mechanics, algebraic topology, organic chemistry, genomics
Humanities	Classical philology, art history, medieval literature, philosophy of mind
Social Sciences	Behavioral economics, political theory, linguistics, anthropology
Professional	Clinical medicine, constitutional law, patent examination, actuarial science
Specialized	Entomology, numismatics, paleoclimate reconstruction, music theory

Why It Matters

Humanity’s Last Exam addresses a critical problem in AI evaluation: benchmark saturation. As models quickly approach ceiling performance on existing benchmarks (MMLU, GPQA, etc.), the field needs harder tests that can differentiate frontier capabilities.

Expert-caliber difficulty — Questions are designed to be challenging even for specialists in each field
Breadth over depth — Tests the generality of knowledge, not just narrow expertise
Contamination-resistant — Novel questions not found in training data
AGI milestone — Performance on HLE is widely tracked as a proxy for progress toward general intelligence

Notable Results

Model	Accuracy	Date
OpenAI o3	~26%	2025
Claude 3.5 Sonnet	~18%	2025
Gemini 2.0 Ultra	~22%	2025
GPT-4o	~15%	2025
DeepSeek-V3	~16%	2025

Humanity’s Last Exam remains far from solved. The best models get approximately 1 in 4 questions correct, highlighting the vast gap between current AI and comprehensive human-level expertise.

Scoring Breakdown

Performance varies dramatically across domains:

Domain	Best Model Score	Observation
Mathematics	~35%	Models perform relatively better on formal reasoning
Physics	~30%	Strong on computational problems, weak on conceptual ones
Biology	~20%	Specialized knowledge gaps are evident
Humanities	~12%	Nuanced cultural and interpretive knowledge remains very hard
Niche specialties	~8%	Rare domains expose the limits of training data breadth

Limitations

Static dataset — Will become contaminated over time as questions leak into training data
Expert bias — Questions reflect the priorities of contributing researchers, not uniform coverage
Format constraints — Multiple-choice and short-answer don’t test open-ended reasoning or synthesis
No partial credit — Binary scoring doesn’t capture “close” answers or sound reasoning with wrong conclusions

References

Humanity’s Last Exam — Official website
CAIS Announcement — Center for AI Safety blog post
Paper — Technical description and analysis

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Overview

Key Details

How It Works

Example Domains

Why It Matters

Notable Results

Scoring Breakdown

Limitations

References

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Documentation Index

​Overview

​Key Details

​How It Works

​Example Domains

​Why It Matters

​Notable Results

​Scoring Breakdown

​Limitations

​References

Overview

Key Details

How It Works

Example Domains

Why It Matters

Notable Results

Scoring Breakdown

Limitations

References