Skip to main content

Documentation Index

Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

SimpleBench is a benchmark designed to test AI models with questions that appear trivially easy to humans but consistently trip up even the most advanced AI systems. Created to expose fundamental gaps in spatial, temporal, social, and logical reasoning, SimpleBench reveals that high performance on complex benchmarks doesn’t always translate to reliable common-sense reasoning. The key insight: if a model can’t answer “simple” everyday questions correctly, can it truly be trusted with complex real-world tasks?

Key Details

PropertyValue
Created bySimpleBench Team
Task typeMultiple-choice reasoning questions
CategoriesSpatial, social, temporal, logical, causal reasoning
Question formatSimple scenarios + multiple-choice answers
EvaluationAccuracy (% correct)
Leaderboardsimple-bench.com

How It Works

  1. Input: A short, everyday scenario described in plain language
  2. Question: A straightforward question about the scenario
  3. Options: Multiple-choice answers (typically 3-4 options)
  4. Evaluation: The model must select the correct answer

Example Questions

Scenario: Alice puts a ball in a red box, then leaves the room. While Alice is away, Bob moves the ball to the blue box. Alice returns. Question: Where will Alice look for the ball?
  • A) In the red box ✅
  • B) In the blue box
  • C) She won’t look for it
Scenario: A heavy rock is placed on top of a piece of paper on a table. You pull the paper quickly. Question: Where is the rock now?
  • A) On the floor
  • B) On the table ✅
  • C) On the paper
These questions test theory of mind, physical intuition, and causal reasoning — areas where models frequently fail despite their apparent sophistication.

Reasoning Categories

CategoryWhat It TestsExample
Spatial ReasoningUnderstanding physical relationships and positionsObject locations after movement
Social / Theory of MindUnderstanding others’ beliefs and knowledge statesFalse-belief tasks (Sally-Anne)
Temporal ReasoningUnderstanding time sequences and orderingWhat happens first/last in a sequence
Logical ReasoningBasic deduction and inferenceIf A then B; not B; therefore not A
Causal ReasoningUnderstanding cause and effectPhysical interactions and consequences

Why It Matters

SimpleBench is important because it reveals a critical gap in AI evaluation:
  • Benchmarks can be misleading — Models that score 90%+ on graduate-level exams still fail trivial reasoning questions
  • Real-world reliability — Users expect models to handle simple questions correctly; failures here erode trust
  • Reasoning vs. pattern matching — SimpleBench distinguishes genuine understanding from statistical correlation
  • Safety implications — If a model can’t reason about basic cause-and-effect, it shouldn’t be trusted with consequential decisions

Notable Results

ModelAccuracyDate
OpenAI o3~83%2025
Claude 3.5 Sonnet~75%2025
GPT-4o~70%2025
Gemini 2.0 Pro~68%2025
Llama 3.1 405B~55%2025
Even the best models fail ~17% of questions that most humans would answer correctly. SimpleBench remains unsaturated despite rapid model improvements.

Limitations

  • English-only — Does not test multilingual reasoning
  • Multiple-choice format — Limits evaluation to discrete answers, not open-ended reasoning
  • Static dataset — Subject to contamination as it becomes widely known

References