DeepResearchBench

Overview

DeepResearchBench evaluates AI systems on their ability to conduct deep research — tasks that require gathering information from multiple sources, synthesizing findings, resolving contradictions, and producing comprehensive analytical reports. It tests the kind of research work that typically takes a human expert hours or days. As “deep research” features are launched by major AI labs (OpenAI Deep Research, Gemini Deep Research, etc.), DeepResearchBench provides a standardized way to compare their capabilities.

Key Details

Property	Value
Created by	DeepResearchBench Team
Task type	Multi-source research and synthesis
Categories	Literature review, fact-checking, comparative analysis, investigative research
Evaluation	Research quality score (completeness, accuracy, sourcing)
Context requirements	Very long — requires processing 50-200+ pages of source material

How It Works

Input: A research question or investigation brief (e.g., “Compare the effectiveness of three different approaches to carbon capture and provide a recommendation with evidence”)
Sources: The system has access to a corpus of documents, papers, articles, and/or web search
Research: The AI must gather relevant information, evaluate source quality, and synthesize findings
Output: A comprehensive research report with citations, analysis, and conclusions
Evaluation: Human experts and automated metrics assess quality across multiple dimensions

Research Question
       │
       ▼
┌──────────────────────┐
│  Deep Research Agent  │
│  ┌──────────────────┐ │
│  │ Query Planning   │ │ → Identify what information is needed
│  │ Source Gathering  │ │ → Search, retrieve, filter documents
│  │ Analysis         │ │ → Extract key findings from each source
│  │ Synthesis        │ │ → Resolve contradictions, find patterns
│  │ Report Writing   │ │ → Structured output with citations
│  └──────────────────┘ │
└──────────┬───────────┘
           │
           ▼
  Research Report + Citations

Evaluation Dimensions

Dimension	What It Measures	Weight
Completeness	Did the research cover all relevant aspects?	25%
Accuracy	Are the facts and claims correct?	25%
Source Quality	Were reliable, relevant sources used?	15%
Synthesis	Were findings meaningfully integrated (not just summarized)?	20%
Citation Quality	Are claims properly attributed with verifiable references?	15%

Task Categories

Category	Description	Difficulty
Literature Review	Summarize the state of research on a topic	Medium
Fact-Checking Investigation	Verify a complex claim using multiple sources	Medium-Hard
Comparative Analysis	Compare multiple approaches/products/policies	Hard
Trend Analysis	Identify patterns across temporal data	Hard
Investigative Research	Deep dive on a complex topic with conflicting sources	Very Hard
Multi-domain Synthesis	Research spanning multiple fields of expertise	Very Hard

Why It Matters

Deep research is one of the highest-value applications of AI:

Knowledge work automation — Research tasks consume enormous amounts of human expert time
Information overload — Modern research requires synthesizing far more sources than any human can read
Quality assurance — Automated research must be accurate and well-sourced to be useful
Long context stress test — Tests whether models can maintain coherence across very long information streams
Real-world impact — Directly measures utility for analysts, researchers, journalists, and consultants

Notable Results

System	Research Quality Score	Date
OpenAI Deep Research (o3)	~62%	2025
Gemini Deep Research	~58%	2025
Claude (extended thinking + search)	~55%	2025
Perplexity Pro	~50%	2025

Deep research quality is subjective and harder to automate than other benchmarks. Human evaluation remains the gold standard, which makes large-scale benchmarking more expensive and slower.

Key Challenges

Source reliability — Models must assess whether sources are trustworthy, not just relevant
Contradiction resolution — Real-world sources often disagree; the model must handle this explicitly
Depth vs. breadth — Balancing comprehensive coverage with deep analysis of key findings
Hallucinated citations — Models may fabricate references that don’t exist — a critical failure mode
Recency — Training data cutoffs mean models may miss the most recent research

Comparison with Other Long-Context Benchmarks

Benchmark	Focus	Task Length	Output
DeepResearchBench	Multi-source research synthesis	Hours of research	Long-form report
RULER	Long-context retrieval	Single long document	Short answers
∞Bench	Ultra-long context understanding	100K+ tokens input	Short answers
LongBench	General long-context tasks	Medium-long documents	Various

DeepResearchBench is unique in testing the full research pipeline — not just reading long documents, but actively gathering, evaluating, and synthesizing information.

Limitations

Subjective evaluation — Research quality assessment has inherent subjectivity
Expensive to evaluate — Requires human expert reviewers for high-quality scoring
Reproducibility — Web-based research tasks may yield different source material over time
Domain coverage — Cannot cover all possible research domains equally

References

DeepResearchBench — Official benchmark and evaluation framework

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Overview

Key Details

How It Works

Evaluation Dimensions

Task Categories

Why It Matters

Notable Results

Key Challenges

Comparison with Other Long-Context Benchmarks

Limitations

References

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Documentation Index

​Overview

​Key Details

​How It Works

​Evaluation Dimensions

​Task Categories

​Why It Matters

​Notable Results

​Key Challenges

​Comparison with Other Long-Context Benchmarks

​Limitations

​References

Overview

Key Details

How It Works

Evaluation Dimensions

Task Categories

Why It Matters

Notable Results

Key Challenges

Comparison with Other Long-Context Benchmarks

Limitations

References