Skip to main content

Documentation Index

Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

DeepResearchBench evaluates AI systems on their ability to conduct deep research — tasks that require gathering information from multiple sources, synthesizing findings, resolving contradictions, and producing comprehensive analytical reports. It tests the kind of research work that typically takes a human expert hours or days. As “deep research” features are launched by major AI labs (OpenAI Deep Research, Gemini Deep Research, etc.), DeepResearchBench provides a standardized way to compare their capabilities.

Key Details

PropertyValue
Created byDeepResearchBench Team
Task typeMulti-source research and synthesis
CategoriesLiterature review, fact-checking, comparative analysis, investigative research
EvaluationResearch quality score (completeness, accuracy, sourcing)
Context requirementsVery long — requires processing 50-200+ pages of source material

How It Works

  1. Input: A research question or investigation brief (e.g., “Compare the effectiveness of three different approaches to carbon capture and provide a recommendation with evidence”)
  2. Sources: The system has access to a corpus of documents, papers, articles, and/or web search
  3. Research: The AI must gather relevant information, evaluate source quality, and synthesize findings
  4. Output: A comprehensive research report with citations, analysis, and conclusions
  5. Evaluation: Human experts and automated metrics assess quality across multiple dimensions
Research Question


┌──────────────────────┐
│  Deep Research Agent  │
│  ┌──────────────────┐ │
│  │ Query Planning   │ │ → Identify what information is needed
│  │ Source Gathering  │ │ → Search, retrieve, filter documents
│  │ Analysis         │ │ → Extract key findings from each source
│  │ Synthesis        │ │ → Resolve contradictions, find patterns
│  │ Report Writing   │ │ → Structured output with citations
│  └──────────────────┘ │
└──────────┬───────────┘


  Research Report + Citations

Evaluation Dimensions

DimensionWhat It MeasuresWeight
CompletenessDid the research cover all relevant aspects?25%
AccuracyAre the facts and claims correct?25%
Source QualityWere reliable, relevant sources used?15%
SynthesisWere findings meaningfully integrated (not just summarized)?20%
Citation QualityAre claims properly attributed with verifiable references?15%

Task Categories

CategoryDescriptionDifficulty
Literature ReviewSummarize the state of research on a topicMedium
Fact-Checking InvestigationVerify a complex claim using multiple sourcesMedium-Hard
Comparative AnalysisCompare multiple approaches/products/policiesHard
Trend AnalysisIdentify patterns across temporal dataHard
Investigative ResearchDeep dive on a complex topic with conflicting sourcesVery Hard
Multi-domain SynthesisResearch spanning multiple fields of expertiseVery Hard

Why It Matters

Deep research is one of the highest-value applications of AI:
  • Knowledge work automation — Research tasks consume enormous amounts of human expert time
  • Information overload — Modern research requires synthesizing far more sources than any human can read
  • Quality assurance — Automated research must be accurate and well-sourced to be useful
  • Long context stress test — Tests whether models can maintain coherence across very long information streams
  • Real-world impact — Directly measures utility for analysts, researchers, journalists, and consultants

Notable Results

SystemResearch Quality ScoreDate
OpenAI Deep Research (o3)~62%2025
Gemini Deep Research~58%2025
Claude (extended thinking + search)~55%2025
Perplexity Pro~50%2025
Deep research quality is subjective and harder to automate than other benchmarks. Human evaluation remains the gold standard, which makes large-scale benchmarking more expensive and slower.

Key Challenges

  1. Source reliability — Models must assess whether sources are trustworthy, not just relevant
  2. Contradiction resolution — Real-world sources often disagree; the model must handle this explicitly
  3. Depth vs. breadth — Balancing comprehensive coverage with deep analysis of key findings
  4. Hallucinated citations — Models may fabricate references that don’t exist — a critical failure mode
  5. Recency — Training data cutoffs mean models may miss the most recent research

Comparison with Other Long-Context Benchmarks

BenchmarkFocusTask LengthOutput
DeepResearchBenchMulti-source research synthesisHours of researchLong-form report
RULERLong-context retrievalSingle long documentShort answers
∞BenchUltra-long context understanding100K+ tokens inputShort answers
LongBenchGeneral long-context tasksMedium-long documentsVarious
DeepResearchBench is unique in testing the full research pipeline — not just reading long documents, but actively gathering, evaluating, and synthesizing information.

Limitations

  • Subjective evaluation — Research quality assessment has inherent subjectivity
  • Expensive to evaluate — Requires human expert reviewers for high-quality scoring
  • Reproducibility — Web-based research tasks may yield different source material over time
  • Domain coverage — Cannot cover all possible research domains equally

References

  • DeepResearchBench — Official benchmark and evaluation framework