Documentation Index
Fetch the complete documentation index at: https://hydroxai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Terminal-Bench 2.0 evaluates AI models on their ability to perform complex terminal operations — from file management and system administration to scripting, debugging, and infrastructure tasks. It tests whether models can operate effectively in a real command-line environment with multi-step reasoning. Unlike simple code-completion benchmarks, Terminal-Bench 2.0 requires models to interact with an actual terminal, chain multiple commands, handle errors, and achieve specific system-state objectives.Key Details
| Property | Value |
|---|---|
| Created by | Terminal-Bench Team |
| Version | 2.0 (updated from v1) |
| Task type | Terminal / CLI task completion |
| Categories | File ops, networking, system admin, scripting, debugging |
| Evaluation | Task completion success rate |
| Environment | Sandboxed Linux terminal |
How It Works
- Input: The model receives a natural language description of a terminal task (e.g., “Find all Python files modified in the last 7 days and compress them into a tarball”)
- Execution: The model generates and executes terminal commands in a sandboxed environment
- Multi-step: Tasks often require multiple sequential commands with intermediate checks
- Evaluation: Success is determined by verifying the final system state matches the expected outcome
Task Categories
| Category | Examples | Difficulty |
|---|---|---|
| File Management | Find, move, rename, compress, parse files | Easy-Medium |
| Text Processing | grep, sed, awk pipelines, log analysis | Medium |
| System Administration | User management, service config, cron jobs | Medium-Hard |
| Networking | curl, wget, SSH tunneling, port scanning | Hard |
| Scripting | Write bash/python scripts from requirements | Hard |
| Debugging | Diagnose failing services, fix configurations | Very Hard |
| Infrastructure | Docker, databases, web server configuration | Very Hard |
Why It Matters
Terminal-Bench 2.0 tests a fundamentally different skill set from code-generation benchmarks:- Practical operations — Tests tasks that developers and sysadmins actually perform daily
- Stateful reasoning — Commands depend on previous results; models must track system state
- Error recovery — Real terminals produce unexpected errors; models must adapt
- Security awareness — Some tasks test whether models avoid dangerous operations (e.g.,
rm -rf /)
Notable Results
| Model | Task Completion Rate | Date |
|---|---|---|
| Claude 3.5 Sonnet | ~45% | 2025 |
| GPT-4o | ~38% | 2025 |
| Gemini 2.0 Pro | ~35% | 2025 |
Terminal-Bench 2.0 remains highly challenging — even the best models fail on more than half the tasks, particularly multi-step operations that require error recovery.
Improvements over v1
- 2x more tasks with broader coverage
- Multi-turn interactions — Models can execute multiple commands sequentially
- Realistic environments — Sandboxed Linux with real package managers, file systems, and services
- Partial credit scoring — Intermediate progress is tracked, not just binary pass/fail
Limitations
- Linux-only — Does not test Windows or macOS terminal operations
- Sandboxed — Some real-world scenarios (cloud APIs, external services) are not reproducible
- Command-line focus — Does not test GUI or IDE-based workflows
References
- Terminal-Bench 2.0 — Official benchmark repository and leaderboard