Terminal-Bench 2.0

Overview

Terminal-Bench 2.0 evaluates AI models on their ability to perform complex terminal operations — from file management and system administration to scripting, debugging, and infrastructure tasks. It tests whether models can operate effectively in a real command-line environment with multi-step reasoning. Unlike simple code-completion benchmarks, Terminal-Bench 2.0 requires models to interact with an actual terminal, chain multiple commands, handle errors, and achieve specific system-state objectives.

Key Details

Property	Value
Created by	Terminal-Bench Team
Version	2.0 (updated from v1)
Task type	Terminal / CLI task completion
Categories	File ops, networking, system admin, scripting, debugging
Evaluation	Task completion success rate
Environment	Sandboxed Linux terminal

How It Works

Input: The model receives a natural language description of a terminal task (e.g., “Find all Python files modified in the last 7 days and compress them into a tarball”)
Execution: The model generates and executes terminal commands in a sandboxed environment
Multi-step: Tasks often require multiple sequential commands with intermediate checks
Evaluation: Success is determined by verifying the final system state matches the expected outcome

Task Description (Natural Language)
        │
        ▼
┌──────────────────┐
│  AI Model        │  → Plans command sequence
│                  │  → Executes commands
│                  │  → Handles errors/retries
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Sandbox         │  → Verifies final state
│  Verification    │  → File existence, content,
│                  │    permissions, processes
└──────────────────┘

Task Categories

Category	Examples	Difficulty
File Management	Find, move, rename, compress, parse files	Easy-Medium
Text Processing	grep, sed, awk pipelines, log analysis	Medium
System Administration	User management, service config, cron jobs	Medium-Hard
Networking	curl, wget, SSH tunneling, port scanning	Hard
Scripting	Write bash/python scripts from requirements	Hard
Debugging	Diagnose failing services, fix configurations	Very Hard
Infrastructure	Docker, databases, web server configuration	Very Hard

Why It Matters

Terminal-Bench 2.0 tests a fundamentally different skill set from code-generation benchmarks:

Practical operations — Tests tasks that developers and sysadmins actually perform daily
Stateful reasoning — Commands depend on previous results; models must track system state
Error recovery — Real terminals produce unexpected errors; models must adapt
Security awareness — Some tasks test whether models avoid dangerous operations (e.g., rm -rf /)

As AI models are increasingly used as coding agents with terminal access (Claude Code, Cursor, Windsurf), this benchmark directly measures real-world utility.

Notable Results

Model	Task Completion Rate	Date
Claude 3.5 Sonnet	~45%	2025
GPT-4o	~38%	2025
Gemini 2.0 Pro	~35%	2025

Terminal-Bench 2.0 remains highly challenging — even the best models fail on more than half the tasks, particularly multi-step operations that require error recovery.

Improvements over v1

2x more tasks with broader coverage
Multi-turn interactions — Models can execute multiple commands sequentially
Realistic environments — Sandboxed Linux with real package managers, file systems, and services
Partial credit scoring — Intermediate progress is tracked, not just binary pass/fail

Limitations

Linux-only — Does not test Windows or macOS terminal operations
Sandboxed — Some real-world scenarios (cloud APIs, external services) are not reproducible
Command-line focus — Does not test GUI or IDE-based workflows

References

Terminal-Bench 2.0 — Official benchmark repository and leaderboard

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Overview

Key Details

How It Works

Task Categories

Why It Matters

Notable Results

Improvements over v1

Limitations

References

Overview

Run Evaluations

Attack Datasets

AI Benchmarks

Documentation Index

​Overview

​Key Details

​How It Works

​Task Categories

​Why It Matters

​Notable Results

​Improvements over v1

​Limitations

​References

Overview

Key Details

How It Works

Task Categories

Why It Matters

Notable Results

Improvements over v1

Limitations

References