Skip to main content

Documentation Index

Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Humanity's Last Exam

Overview

Humanity’s Last Exam (HLE) is a crowdsourced benchmark of approximately 3,000 questions spanning over 100 academic disciplines, created by domain experts from around the world. It was designed with a provocative mission: to be the last exam humanity would need to create — one so difficult that only a truly general intelligence could pass it. Created by the Center for AI Safety (CAIS) and Scale AI, Humanity’s Last Exam represents the most ambitious attempt to test the frontier of AI knowledge and reasoning across the full breadth of human expertise.

Key Details

PropertyValue
Created byCenter for AI Safety (CAIS) & Scale AI
ReleasedJanuary 2025
Task typeExpert-level academic questions
Dataset size~3,000 questions
Domains100+ academic disciplines
Question typesMultiple-choice and short-answer
EvaluationAccuracy (% correct)

How It Works

  1. Expert contribution: Thousands of researchers and domain experts submitted their hardest questions — problems that would challenge even PhD-level specialists
  2. Curation: Questions were filtered for quality, uniqueness, and difficulty
  3. Format: Mix of multiple-choice and short-answer questions
  4. Evaluation: Models are tested zero-shot with standard prompting

Example Domains

The benchmark spans an extraordinary breadth of human knowledge:
Domain ClusterExample Subjects
STEMQuantum mechanics, algebraic topology, organic chemistry, genomics
HumanitiesClassical philology, art history, medieval literature, philosophy of mind
Social SciencesBehavioral economics, political theory, linguistics, anthropology
ProfessionalClinical medicine, constitutional law, patent examination, actuarial science
SpecializedEntomology, numismatics, paleoclimate reconstruction, music theory

Why It Matters

Humanity’s Last Exam addresses a critical problem in AI evaluation: benchmark saturation. As models quickly approach ceiling performance on existing benchmarks (MMLU, GPQA, etc.), the field needs harder tests that can differentiate frontier capabilities.
  • Expert-caliber difficulty — Questions are designed to be challenging even for specialists in each field
  • Breadth over depth — Tests the generality of knowledge, not just narrow expertise
  • Contamination-resistant — Novel questions not found in training data
  • AGI milestone — Performance on HLE is widely tracked as a proxy for progress toward general intelligence

Notable Results

ModelAccuracyDate
OpenAI o3~26%2025
Claude 3.5 Sonnet~18%2025
Gemini 2.0 Ultra~22%2025
GPT-4o~15%2025
DeepSeek-V3~16%2025
Humanity’s Last Exam remains far from solved. The best models get approximately 1 in 4 questions correct, highlighting the vast gap between current AI and comprehensive human-level expertise.

Scoring Breakdown

Performance varies dramatically across domains:
DomainBest Model ScoreObservation
Mathematics~35%Models perform relatively better on formal reasoning
Physics~30%Strong on computational problems, weak on conceptual ones
Biology~20%Specialized knowledge gaps are evident
Humanities~12%Nuanced cultural and interpretive knowledge remains very hard
Niche specialties~8%Rare domains expose the limits of training data breadth

Limitations

  • Static dataset — Will become contaminated over time as questions leak into training data
  • Expert bias — Questions reflect the priorities of contributing researchers, not uniform coverage
  • Format constraints — Multiple-choice and short-answer don’t test open-ended reasoning or synthesis
  • No partial credit — Binary scoring doesn’t capture “close” answers or sound reasoning with wrong conclusions

References