GeoBench

Overview

GeoBench is a benchmark that evaluates AI models on geospatial reasoning and geographic knowledge — the ability to understand maps, spatial relationships, coordinate systems, satellite imagery, and Earth science concepts. It tests whether models can serve as reliable tools for geographic analysis, urban planning, environmental science, and location-based reasoning. As AI systems are increasingly used in GIS (Geographic Information Systems), climate science, and geospatial intelligence, GeoBench provides a standardized way to measure these capabilities.

Key Details

Property	Value
Created by	GeoBench Research Team
Task type	Geospatial reasoning and knowledge
Categories	Map reading, spatial analysis, remote sensing, geoscience
Format	Multiple-choice, coordinate prediction, spatial reasoning
Evaluation	Accuracy, spatial error distance

How It Works

Input: A geographic question — may include text descriptions, coordinate data, or map/satellite imagery
Task: Answer questions about spatial relationships, identify locations, analyze geographic patterns
Evaluation: Answers are scored for correctness; coordinate predictions are scored by distance from ground truth

Task Categories

Category	Description	Example Questions
Geographic Knowledge	Factual knowledge about places, borders, features	”What country borders both France and Portugal?”
Spatial Reasoning	Understanding distances, directions, and relationships	”Which city is closest to the midpoint of NYC and LA?”
Map Interpretation	Reading and analyzing map data	”Based on this topographic map, which route avoids elevations above 2000m?”
Remote Sensing	Analyzing satellite/aerial imagery	”Identify land use categories in this satellite image”
Coordinate Systems	Working with lat/long, projections, and GIS formats	”Convert these UTM coordinates to decimal degrees”
Earth Science	Climate, geology, hydrology, and environmental systems	”Based on these soil and rainfall patterns, which area is at highest flood risk?”

Why It Matters

Geospatial reasoning is critical for many high-impact applications:

Climate and environment — AI-assisted analysis of environmental change, disaster prediction, and resource management
Urban planning — Evaluating site suitability, transportation routing, and infrastructure planning
Intelligence and security — Geospatial analysis for situational awareness
Navigation and logistics — Optimizing routes and understanding spatial constraints
Scientific research — Supporting geologists, ecologists, and climate scientists

GeoBench tests whether models can move beyond textual knowledge to spatial understanding — a capability that requires fundamentally different reasoning skills.

Notable Results

Model	Accuracy	Date
Gemini 2.0 Pro (multimodal)	~70%	2025
GPT-4o (multimodal)	~65%	2025
Claude 3.5 Sonnet	~60%	2025

Performance varies significantly between text-only and multimodal tasks. Models with strong vision capabilities have a significant advantage on map-reading and remote-sensing tasks.

Evaluation Metrics

Metric	Description
Accuracy	Percentage of correct answers for discrete questions
Spatial Error	Average distance (km) between predicted and actual coordinates
Category Breakdown	Performance split across geographic knowledge, spatial reasoning, remote sensing, etc.
Multimodal Gap	Difference between text-based and image-based task performance

Limitations

English-centric — Geographic naming and descriptions are primarily in English
Static data — Geographic and political boundaries change; the benchmark requires maintenance
Image quality — Satellite imagery tasks depend on resolution and clarity of provided images
Western bias — Coverage of geographic locations may be uneven

References

GeoBench — Official benchmark and evaluation framework

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Overview

Key Details

How It Works

Task Categories

Why It Matters

Notable Results

Evaluation Metrics

Limitations

References

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Documentation Index

​Overview

​Key Details

​How It Works

​Task Categories

​Why It Matters

​Notable Results

​Evaluation Metrics

​Limitations

​References

Overview

Key Details

How It Works

Task Categories

Why It Matters

Notable Results

Evaluation Metrics

Limitations

References