Documentation Index
Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
SimpleBench is a benchmark designed to test AI models with questions that appear trivially easy to humans but consistently trip up even the most advanced AI systems. Created to expose fundamental gaps in spatial, temporal, social, and logical reasoning, SimpleBench reveals that high performance on complex benchmarks doesn’t always translate to reliable common-sense reasoning. The key insight: if a model can’t answer “simple” everyday questions correctly, can it truly be trusted with complex real-world tasks?Key Details
| Property | Value |
|---|---|
| Created by | SimpleBench Team |
| Task type | Multiple-choice reasoning questions |
| Categories | Spatial, social, temporal, logical, causal reasoning |
| Question format | Simple scenarios + multiple-choice answers |
| Evaluation | Accuracy (% correct) |
| Leaderboard | simple-bench.com |
How It Works
- Input: A short, everyday scenario described in plain language
- Question: A straightforward question about the scenario
- Options: Multiple-choice answers (typically 3-4 options)
- Evaluation: The model must select the correct answer
Example Questions
Scenario: Alice puts a ball in a red box, then leaves the room. While Alice is away, Bob moves the ball to the blue box. Alice returns. Question: Where will Alice look for the ball?
- A) In the red box ✅
- B) In the blue box
- C) She won’t look for it
Scenario: A heavy rock is placed on top of a piece of paper on a table. You pull the paper quickly. Question: Where is the rock now?These questions test theory of mind, physical intuition, and causal reasoning — areas where models frequently fail despite their apparent sophistication.
- A) On the floor
- B) On the table ✅
- C) On the paper
Reasoning Categories
| Category | What It Tests | Example |
|---|---|---|
| Spatial Reasoning | Understanding physical relationships and positions | Object locations after movement |
| Social / Theory of Mind | Understanding others’ beliefs and knowledge states | False-belief tasks (Sally-Anne) |
| Temporal Reasoning | Understanding time sequences and ordering | What happens first/last in a sequence |
| Logical Reasoning | Basic deduction and inference | If A then B; not B; therefore not A |
| Causal Reasoning | Understanding cause and effect | Physical interactions and consequences |
Why It Matters
SimpleBench is important because it reveals a critical gap in AI evaluation:- Benchmarks can be misleading — Models that score 90%+ on graduate-level exams still fail trivial reasoning questions
- Real-world reliability — Users expect models to handle simple questions correctly; failures here erode trust
- Reasoning vs. pattern matching — SimpleBench distinguishes genuine understanding from statistical correlation
- Safety implications — If a model can’t reason about basic cause-and-effect, it shouldn’t be trusted with consequential decisions
Notable Results
| Model | Accuracy | Date |
|---|---|---|
| OpenAI o3 | ~83% | 2025 |
| Claude 3.5 Sonnet | ~75% | 2025 |
| GPT-4o | ~70% | 2025 |
| Gemini 2.0 Pro | ~68% | 2025 |
| Llama 3.1 405B | ~55% | 2025 |
Even the best models fail ~17% of questions that most humans would answer correctly. SimpleBench remains unsaturated despite rapid model improvements.
Limitations
- English-only — Does not test multilingual reasoning
- Multiple-choice format — Limits evaluation to discrete answers, not open-ended reasoning
- Static dataset — Subject to contamination as it becomes widely known
References
- SimpleBench — Official website and leaderboard