Overview
SWE-bench is the most widely cited benchmark for evaluating AI systems on real-world software engineering tasks. Created by researchers at Princeton University, it tests whether an AI model can resolve actual GitHub issues from popular open-source Python repositories.
Unlike synthetic coding benchmarks, SWE-bench uses real bug reports and feature requests from projects like Django, Flask, scikit-learn, sympy, and more. The model must understand the codebase, localize the relevant files, and produce a patch that passes the project’s test suite.
Key Details
| Property | Value |
|---|
| Created by | Princeton NLP (Carlos E. Jimenez et al.) |
| Released | October 2023 |
| Task type | Code generation / Bug fixing |
| Dataset size | 2,294 task instances (full), 500 (Verified), 300 (Lite) |
| Languages | Python |
| Evaluation | Unit test pass rate |
| Leaderboard | swebench.com |
How It Works
- Input: The model receives a GitHub issue description and access to the repository codebase
- Task: Generate a code patch (diff) that resolves the issue
- Evaluation: The patch is applied to the repository and the project’s existing test suite is run
- Success: A task is “resolved” only if all relevant tests pass after applying the patch
GitHub Issue Description
│
▼
┌──────────────────┐
│ AI Model │ → Analyzes codebase + issue
│ (SWE-agent, │ → Localizes relevant files
│ Devin, etc.) │ → Generates patch
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Test Suite │ → Applies patch to repo
│ Evaluation │ → Runs unit tests
│ │ → Pass/Fail verdict
└──────────────────┘
Variants
SWE-bench Full
The complete dataset of 2,294 task instances across 12 Python repositories. Due to noisy or ambiguous tasks, the full set is less commonly used for leaderboard ranking.
SWE-bench Verified
A human-validated subset of 500 tasks where annotators confirmed the problem statements are clear and the test cases are correct. This is the primary leaderboard used by most teams.
SWE-bench Lite
A smaller subset of 300 tasks designed for faster evaluation cycles. Useful for development and iteration.
Why It Matters
SWE-bench is the closest benchmark to real-world developer work:
- Tests end-to-end software engineering, not just code completion
- Requires understanding large codebases (thousands of files)
- Demands reasoning about test requirements and edge cases
- Patches must be production-quality (they run against real test suites)
Performance on SWE-bench is directly correlated with how useful a model is as an AI coding assistant in production workflows.
Notable Results
| Model / System | SWE-bench Verified | Date |
|---|
| OpenAI o3 + Codex | ~65% | 2026 |
| Claude 3.5 Sonnet + SWE-agent | ~55% | 2025 |
| GPT-4o + Agentless | ~45% | 2025 |
| DeepSeek-V3 | ~42% | 2025 |
| Claude 3 Opus (baseline) | ~22% | 2024 |
Scores evolve rapidly. SWE-bench results depend heavily on the scaffolding (agent framework) used with the model, not just the model itself.
Limitations
- Python only — Does not test other programming languages
- Agent-dependent — Performance varies significantly based on the scaffolding / agent framework used
- Repository scope — Limited to 12 specific open-source projects
- Test-based evaluation — A correct fix that doesn’t match the expected test structure may be marked as failed
References