Documentation Index
Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
DeepResearchBench evaluates AI systems on their ability to conduct deep research — tasks that require gathering information from multiple sources, synthesizing findings, resolving contradictions, and producing comprehensive analytical reports. It tests the kind of research work that typically takes a human expert hours or days.
As “deep research” features are launched by major AI labs (OpenAI Deep Research, Gemini Deep Research, etc.), DeepResearchBench provides a standardized way to compare their capabilities.
Key Details
| Property | Value |
|---|
| Created by | DeepResearchBench Team |
| Task type | Multi-source research and synthesis |
| Categories | Literature review, fact-checking, comparative analysis, investigative research |
| Evaluation | Research quality score (completeness, accuracy, sourcing) |
| Context requirements | Very long — requires processing 50-200+ pages of source material |
How It Works
- Input: A research question or investigation brief (e.g., “Compare the effectiveness of three different approaches to carbon capture and provide a recommendation with evidence”)
- Sources: The system has access to a corpus of documents, papers, articles, and/or web search
- Research: The AI must gather relevant information, evaluate source quality, and synthesize findings
- Output: A comprehensive research report with citations, analysis, and conclusions
- Evaluation: Human experts and automated metrics assess quality across multiple dimensions
Research Question
│
▼
┌──────────────────────┐
│ Deep Research Agent │
│ ┌──────────────────┐ │
│ │ Query Planning │ │ → Identify what information is needed
│ │ Source Gathering │ │ → Search, retrieve, filter documents
│ │ Analysis │ │ → Extract key findings from each source
│ │ Synthesis │ │ → Resolve contradictions, find patterns
│ │ Report Writing │ │ → Structured output with citations
│ └──────────────────┘ │
└──────────┬───────────┘
│
▼
Research Report + Citations
Evaluation Dimensions
| Dimension | What It Measures | Weight |
|---|
| Completeness | Did the research cover all relevant aspects? | 25% |
| Accuracy | Are the facts and claims correct? | 25% |
| Source Quality | Were reliable, relevant sources used? | 15% |
| Synthesis | Were findings meaningfully integrated (not just summarized)? | 20% |
| Citation Quality | Are claims properly attributed with verifiable references? | 15% |
Task Categories
| Category | Description | Difficulty |
|---|
| Literature Review | Summarize the state of research on a topic | Medium |
| Fact-Checking Investigation | Verify a complex claim using multiple sources | Medium-Hard |
| Comparative Analysis | Compare multiple approaches/products/policies | Hard |
| Trend Analysis | Identify patterns across temporal data | Hard |
| Investigative Research | Deep dive on a complex topic with conflicting sources | Very Hard |
| Multi-domain Synthesis | Research spanning multiple fields of expertise | Very Hard |
Why It Matters
Deep research is one of the highest-value applications of AI:
- Knowledge work automation — Research tasks consume enormous amounts of human expert time
- Information overload — Modern research requires synthesizing far more sources than any human can read
- Quality assurance — Automated research must be accurate and well-sourced to be useful
- Long context stress test — Tests whether models can maintain coherence across very long information streams
- Real-world impact — Directly measures utility for analysts, researchers, journalists, and consultants
Notable Results
| System | Research Quality Score | Date |
|---|
| OpenAI Deep Research (o3) | ~62% | 2025 |
| Gemini Deep Research | ~58% | 2025 |
| Claude (extended thinking + search) | ~55% | 2025 |
| Perplexity Pro | ~50% | 2025 |
Deep research quality is subjective and harder to automate than other benchmarks. Human evaluation remains the gold standard, which makes large-scale benchmarking more expensive and slower.
Key Challenges
- Source reliability — Models must assess whether sources are trustworthy, not just relevant
- Contradiction resolution — Real-world sources often disagree; the model must handle this explicitly
- Depth vs. breadth — Balancing comprehensive coverage with deep analysis of key findings
- Hallucinated citations — Models may fabricate references that don’t exist — a critical failure mode
- Recency — Training data cutoffs mean models may miss the most recent research
Comparison with Other Long-Context Benchmarks
| Benchmark | Focus | Task Length | Output |
|---|
| DeepResearchBench | Multi-source research synthesis | Hours of research | Long-form report |
| RULER | Long-context retrieval | Single long document | Short answers |
| ∞Bench | Ultra-long context understanding | 100K+ tokens input | Short answers |
| LongBench | General long-context tasks | Medium-long documents | Various |
DeepResearchBench is unique in testing the full research pipeline — not just reading long documents, but actively gathering, evaluating, and synthesizing information.
Limitations
- Subjective evaluation — Research quality assessment has inherent subjectivity
- Expensive to evaluate — Requires human expert reviewers for high-quality scoring
- Reproducibility — Web-based research tasks may yield different source material over time
- Domain coverage — Cannot cover all possible research domains equally
References
- DeepResearchBench — Official benchmark and evaluation framework