Documentation Index
Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
SWE-bench is the most widely cited benchmark for evaluating AI systems on real-world software engineering tasks. Created by researchers at Princeton University, it tests whether an AI model can resolve actual GitHub issues from popular open-source Python repositories. Unlike synthetic coding benchmarks, SWE-bench uses real bug reports and feature requests from projects like Django, Flask, scikit-learn, sympy, and more. The model must understand the codebase, localize the relevant files, and produce a patch that passes the project’s test suite.Key Details
| Property | Value |
|---|---|
| Created by | Princeton NLP (Carlos E. Jimenez et al.) |
| Released | October 2023 |
| Task type | Code generation / Bug fixing |
| Dataset size | 2,294 task instances (full), 500 (Verified), 300 (Lite) |
| Languages | Python |
| Evaluation | Unit test pass rate |
| Leaderboard | swebench.com |
How It Works
- Input: The model receives a GitHub issue description and access to the repository codebase
- Task: Generate a code patch (diff) that resolves the issue
- Evaluation: The patch is applied to the repository and the project’s existing test suite is run
- Success: A task is “resolved” only if all relevant tests pass after applying the patch
Variants
SWE-bench Full
The complete dataset of 2,294 task instances across 12 Python repositories. Due to noisy or ambiguous tasks, the full set is less commonly used for leaderboard ranking.SWE-bench Verified
A human-validated subset of 500 tasks where annotators confirmed the problem statements are clear and the test cases are correct. This is the primary leaderboard used by most teams.SWE-bench Lite
A smaller subset of 300 tasks designed for faster evaluation cycles. Useful for development and iteration.Why It Matters
SWE-bench is the closest benchmark to real-world developer work:- Tests end-to-end software engineering, not just code completion
- Requires understanding large codebases (thousands of files)
- Demands reasoning about test requirements and edge cases
- Patches must be production-quality (they run against real test suites)
Notable Results
| Model / System | SWE-bench Verified | Date |
|---|---|---|
| OpenAI o3 + Codex | ~65% | 2026 |
| Claude 3.5 Sonnet + SWE-agent | ~55% | 2025 |
| GPT-4o + Agentless | ~45% | 2025 |
| DeepSeek-V3 | ~42% | 2025 |
| Claude 3 Opus (baseline) | ~22% | 2024 |
Scores evolve rapidly. SWE-bench results depend heavily on the scaffolding (agent framework) used with the model, not just the model itself.
Limitations
- Python only — Does not test other programming languages
- Agent-dependent — Performance varies significantly based on the scaffolding / agent framework used
- Repository scope — Limited to 12 specific open-source projects
- Test-based evaluation — A correct fix that doesn’t match the expected test structure may be marked as failed
References
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Original paper by Jimenez et al. (2023)
- SWE-bench Leaderboard — Live rankings