Skip to main content

Documentation Index

Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Terminal-Bench 2.0 evaluates AI models on their ability to perform complex terminal operations — from file management and system administration to scripting, debugging, and infrastructure tasks. It tests whether models can operate effectively in a real command-line environment with multi-step reasoning. Unlike simple code-completion benchmarks, Terminal-Bench 2.0 requires models to interact with an actual terminal, chain multiple commands, handle errors, and achieve specific system-state objectives.

Key Details

PropertyValue
Created byTerminal-Bench Team
Version2.0 (updated from v1)
Task typeTerminal / CLI task completion
CategoriesFile ops, networking, system admin, scripting, debugging
EvaluationTask completion success rate
EnvironmentSandboxed Linux terminal

How It Works

  1. Input: The model receives a natural language description of a terminal task (e.g., “Find all Python files modified in the last 7 days and compress them into a tarball”)
  2. Execution: The model generates and executes terminal commands in a sandboxed environment
  3. Multi-step: Tasks often require multiple sequential commands with intermediate checks
  4. Evaluation: Success is determined by verifying the final system state matches the expected outcome
Task Description (Natural Language)


┌──────────────────┐
│  AI Model        │  → Plans command sequence
│                  │  → Executes commands
│                  │  → Handles errors/retries
└────────┬─────────┘


┌──────────────────┐
│  Sandbox         │  → Verifies final state
│  Verification    │  → File existence, content,
│                  │    permissions, processes
└──────────────────┘

Task Categories

CategoryExamplesDifficulty
File ManagementFind, move, rename, compress, parse filesEasy-Medium
Text Processinggrep, sed, awk pipelines, log analysisMedium
System AdministrationUser management, service config, cron jobsMedium-Hard
Networkingcurl, wget, SSH tunneling, port scanningHard
ScriptingWrite bash/python scripts from requirementsHard
DebuggingDiagnose failing services, fix configurationsVery Hard
InfrastructureDocker, databases, web server configurationVery Hard

Why It Matters

Terminal-Bench 2.0 tests a fundamentally different skill set from code-generation benchmarks:
  • Practical operations — Tests tasks that developers and sysadmins actually perform daily
  • Stateful reasoning — Commands depend on previous results; models must track system state
  • Error recovery — Real terminals produce unexpected errors; models must adapt
  • Security awareness — Some tasks test whether models avoid dangerous operations (e.g., rm -rf /)
As AI models are increasingly used as coding agents with terminal access (Claude Code, Cursor, Windsurf), this benchmark directly measures real-world utility.

Notable Results

ModelTask Completion RateDate
Claude 3.5 Sonnet~45%2025
GPT-4o~38%2025
Gemini 2.0 Pro~35%2025
Terminal-Bench 2.0 remains highly challenging — even the best models fail on more than half the tasks, particularly multi-step operations that require error recovery.

Improvements over v1

  • 2x more tasks with broader coverage
  • Multi-turn interactions — Models can execute multiple commands sequentially
  • Realistic environments — Sandboxed Linux with real package managers, file systems, and services
  • Partial credit scoring — Intermediate progress is tracked, not just binary pass/fail

Limitations

  • Linux-only — Does not test Windows or macOS terminal operations
  • Sandboxed — Some real-world scenarios (cloud APIs, external services) are not reproducible
  • Command-line focus — Does not test GUI or IDE-based workflows

References

  • Terminal-Bench 2.0 — Official benchmark repository and leaderboard