SWE-bench

Overview

SWE-bench is the most widely cited benchmark for evaluating AI systems on real-world software engineering tasks. Created by researchers at Princeton University, it tests whether an AI model can resolve actual GitHub issues from popular open-source Python repositories. Unlike synthetic coding benchmarks, SWE-bench uses real bug reports and feature requests from projects like Django, Flask, scikit-learn, sympy, and more. The model must understand the codebase, localize the relevant files, and produce a patch that passes the project’s test suite.

Key Details

Property	Value
Created by	Princeton NLP (Carlos E. Jimenez et al.)
Released	October 2023
Task type	Code generation / Bug fixing
Dataset size	2,294 task instances (full), 500 (Verified), 300 (Lite)
Languages	Python
Evaluation	Unit test pass rate
Leaderboard	swebench.com

How It Works

Input: The model receives a GitHub issue description and access to the repository codebase
Task: Generate a code patch (diff) that resolves the issue
Evaluation: The patch is applied to the repository and the project’s existing test suite is run
Success: A task is “resolved” only if all relevant tests pass after applying the patch

GitHub Issue Description
        │
        ▼
┌──────────────────┐
│  AI Model        │  → Analyzes codebase + issue
│  (SWE-agent,     │  → Localizes relevant files
│   Devin, etc.)   │  → Generates patch
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Test Suite      │  → Applies patch to repo
│  Evaluation      │  → Runs unit tests
│                  │  → Pass/Fail verdict
└──────────────────┘

Variants

SWE-bench Full

The complete dataset of 2,294 task instances across 12 Python repositories. Due to noisy or ambiguous tasks, the full set is less commonly used for leaderboard ranking.

SWE-bench Verified

A human-validated subset of 500 tasks where annotators confirmed the problem statements are clear and the test cases are correct. This is the primary leaderboard used by most teams.

SWE-bench Lite

A smaller subset of 300 tasks designed for faster evaluation cycles. Useful for development and iteration.

Why It Matters

SWE-bench is the closest benchmark to real-world developer work:

Tests end-to-end software engineering, not just code completion
Requires understanding large codebases (thousands of files)
Demands reasoning about test requirements and edge cases
Patches must be production-quality (they run against real test suites)

Performance on SWE-bench is directly correlated with how useful a model is as an AI coding assistant in production workflows.

Notable Results

Model / System	SWE-bench Verified	Date
OpenAI o3 + Codex	~65%	2026
Claude 3.5 Sonnet + SWE-agent	~55%	2025
GPT-4o + Agentless	~45%	2025
DeepSeek-V3	~42%	2025
Claude 3 Opus (baseline)	~22%	2024

Scores evolve rapidly. SWE-bench results depend heavily on the scaffolding (agent framework) used with the model, not just the model itself.

Limitations

Python only — Does not test other programming languages
Agent-dependent — Performance varies significantly based on the scaffolding / agent framework used
Repository scope — Limited to 12 specific open-source projects
Test-based evaluation — A correct fix that doesn’t match the expected test structure may be marked as failed

References

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Original paper by Jimenez et al. (2023)
SWE-bench Leaderboard — Live rankings

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Overview

Key Details

How It Works

Variants

SWE-bench Full

SWE-bench Verified

SWE-bench Lite

Why It Matters

Notable Results

Limitations

References

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Documentation Index

​Overview

​Key Details

​How It Works

​Variants

​SWE-bench Full

​SWE-bench Verified

​SWE-bench Lite

​Why It Matters

​Notable Results

​Limitations

​References

Overview

Key Details

How It Works

Variants

SWE-bench Full

SWE-bench Verified

SWE-bench Lite

Why It Matters

Notable Results

Limitations

References