SimpleBench

Overview

SimpleBench is a benchmark designed to test AI models with questions that appear trivially easy to humans but consistently trip up even the most advanced AI systems. Created to expose fundamental gaps in spatial, temporal, social, and logical reasoning, SimpleBench reveals that high performance on complex benchmarks doesn’t always translate to reliable common-sense reasoning. The key insight: if a model can’t answer “simple” everyday questions correctly, can it truly be trusted with complex real-world tasks?

Key Details

Property	Value
Created by	SimpleBench Team
Task type	Multiple-choice reasoning questions
Categories	Spatial, social, temporal, logical, causal reasoning
Question format	Simple scenarios + multiple-choice answers
Evaluation	Accuracy (% correct)
Leaderboard	simple-bench.com

How It Works

Input: A short, everyday scenario described in plain language
Question: A straightforward question about the scenario
Options: Multiple-choice answers (typically 3-4 options)
Evaluation: The model must select the correct answer

Example Questions

Scenario: Alice puts a ball in a red box, then leaves the room. While Alice is away, Bob moves the ball to the blue box. Alice returns. Question: Where will Alice look for the ball?

A) In the red box ✅

B) In the blue box

C) She won’t look for it

Scenario: A heavy rock is placed on top of a piece of paper on a table. You pull the paper quickly. Question: Where is the rock now?

A) On the floor

B) On the table ✅

C) On the paper

These questions test theory of mind, physical intuition, and causal reasoning — areas where models frequently fail despite their apparent sophistication.

Reasoning Categories

Category	What It Tests	Example
Spatial Reasoning	Understanding physical relationships and positions	Object locations after movement
Social / Theory of Mind	Understanding others’ beliefs and knowledge states	False-belief tasks (Sally-Anne)
Temporal Reasoning	Understanding time sequences and ordering	What happens first/last in a sequence
Logical Reasoning	Basic deduction and inference	If A then B; not B; therefore not A
Causal Reasoning	Understanding cause and effect	Physical interactions and consequences

Why It Matters

SimpleBench is important because it reveals a critical gap in AI evaluation:

Benchmarks can be misleading — Models that score 90%+ on graduate-level exams still fail trivial reasoning questions
Real-world reliability — Users expect models to handle simple questions correctly; failures here erode trust
Reasoning vs. pattern matching — SimpleBench distinguishes genuine understanding from statistical correlation
Safety implications — If a model can’t reason about basic cause-and-effect, it shouldn’t be trusted with consequential decisions

Notable Results

Model	Accuracy	Date
OpenAI o3	~83%	2025
Claude 3.5 Sonnet	~75%	2025
GPT-4o	~70%	2025
Gemini 2.0 Pro	~68%	2025
Llama 3.1 405B	~55%	2025

Even the best models fail ~17% of questions that most humans would answer correctly. SimpleBench remains unsaturated despite rapid model improvements.

Limitations

English-only — Does not test multilingual reasoning
Multiple-choice format — Limits evaluation to discrete answers, not open-ended reasoning
Static dataset — Subject to contamination as it becomes widely known

References

SimpleBench — Official website and leaderboard

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Overview

Key Details

How It Works

Example Questions

Reasoning Categories

Why It Matters

Notable Results

Limitations

References

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Documentation Index

​Overview

​Key Details

​How It Works

​Example Questions

​Reasoning Categories

​Why It Matters

​Notable Results

​Limitations

​References

Overview

Key Details

How It Works

Example Questions

Reasoning Categories

Why It Matters

Notable Results

Limitations

References