Documentation Index
Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Humanity’s Last Exam (HLE) is a crowdsourced benchmark of approximately 3,000 questions spanning over 100 academic disciplines, created by domain experts from around the world. It was designed with a provocative mission: to be the last exam humanity would need to create — one so difficult that only a truly general intelligence could pass it.
Created by the Center for AI Safety (CAIS) and Scale AI, Humanity’s Last Exam represents the most ambitious attempt to test the frontier of AI knowledge and reasoning across the full breadth of human expertise.
Key Details
| Property | Value |
|---|
| Created by | Center for AI Safety (CAIS) & Scale AI |
| Released | January 2025 |
| Task type | Expert-level academic questions |
| Dataset size | ~3,000 questions |
| Domains | 100+ academic disciplines |
| Question types | Multiple-choice and short-answer |
| Evaluation | Accuracy (% correct) |
How It Works
- Expert contribution: Thousands of researchers and domain experts submitted their hardest questions — problems that would challenge even PhD-level specialists
- Curation: Questions were filtered for quality, uniqueness, and difficulty
- Format: Mix of multiple-choice and short-answer questions
- Evaluation: Models are tested zero-shot with standard prompting
Example Domains
The benchmark spans an extraordinary breadth of human knowledge:
| Domain Cluster | Example Subjects |
|---|
| STEM | Quantum mechanics, algebraic topology, organic chemistry, genomics |
| Humanities | Classical philology, art history, medieval literature, philosophy of mind |
| Social Sciences | Behavioral economics, political theory, linguistics, anthropology |
| Professional | Clinical medicine, constitutional law, patent examination, actuarial science |
| Specialized | Entomology, numismatics, paleoclimate reconstruction, music theory |
Why It Matters
Humanity’s Last Exam addresses a critical problem in AI evaluation: benchmark saturation. As models quickly approach ceiling performance on existing benchmarks (MMLU, GPQA, etc.), the field needs harder tests that can differentiate frontier capabilities.
- Expert-caliber difficulty — Questions are designed to be challenging even for specialists in each field
- Breadth over depth — Tests the generality of knowledge, not just narrow expertise
- Contamination-resistant — Novel questions not found in training data
- AGI milestone — Performance on HLE is widely tracked as a proxy for progress toward general intelligence
Notable Results
| Model | Accuracy | Date |
|---|
| OpenAI o3 | ~26% | 2025 |
| Claude 3.5 Sonnet | ~18% | 2025 |
| Gemini 2.0 Ultra | ~22% | 2025 |
| GPT-4o | ~15% | 2025 |
| DeepSeek-V3 | ~16% | 2025 |
Humanity’s Last Exam remains far from solved. The best models get approximately 1 in 4 questions correct, highlighting the vast gap between current AI and comprehensive human-level expertise.
Scoring Breakdown
Performance varies dramatically across domains:
| Domain | Best Model Score | Observation |
|---|
| Mathematics | ~35% | Models perform relatively better on formal reasoning |
| Physics | ~30% | Strong on computational problems, weak on conceptual ones |
| Biology | ~20% | Specialized knowledge gaps are evident |
| Humanities | ~12% | Nuanced cultural and interpretive knowledge remains very hard |
| Niche specialties | ~8% | Rare domains expose the limits of training data breadth |
Limitations
- Static dataset — Will become contaminated over time as questions leak into training data
- Expert bias — Questions reflect the priorities of contributing researchers, not uniform coverage
- Format constraints — Multiple-choice and short-answer don’t test open-ended reasoning or synthesis
- No partial credit — Binary scoring doesn’t capture “close” answers or sound reasoning with wrong conclusions
References