Terminal-Bench
Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.
Progress Over Time
Interactive timeline showing model performance evolution on Terminal-Bench
State-of-the-art frontier
Open
Proprietary
Terminal-Bench Leaderboard
23 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Anthropic | — | 200K | $3.00 $15.00 | ||
2 | MiniMax | 230B | 1.0M | $0.30 $1.20 | ||
3 | Moonshot AI | 1.0T | — | — | ||
4 | MiniMax | 230B | 1.0M | $0.30 $1.20 | ||
5 | Anthropic | — | 200K | $15.00 $75.00 | ||
6 | Anthropic | — | 200K | $1.00 $5.00 | ||
7 | Zhipu AI | 357B | 131K | $0.55 $2.19 | ||
8 | Meituan | 560B | 128K | $0.30 $1.20 | ||
9 | Anthropic | — | 200K | $15.00 $75.00 | ||
10 | DeepSeek | 685B | — | — | ||
11 | Zhipu AI | 355B | 131K | $0.40 $1.60 | ||
12 | Anthropic | — | 200K | $3.00 $15.00 | ||
13 | Anthropic | — | 200K | $3.00 $15.00 | ||
14 | Meituan | 69B | 256K | $0.10 $0.40 | ||
15 | Zhipu AI | 358B | 205K | $0.60 $2.20 | ||
16 | DeepSeek | 671B | 164K | $0.27 $1.00 | ||
17 | Xiaomi | 309B | 256K | $0.10 $0.30 | ||
18 | Moonshot AI | 1.0T | 200K | $0.50 $0.50 | ||
18 | Zhipu AI | 106B | — | — | ||
20 | 120B | 262K | $0.10 $0.50 | |||
21 | Moonshot AI | 1.0T | — | — | ||
22 | 32B | 262K | $0.06 $0.24 | |||
23 | DeepSeek | 671B | 131K | $0.50 $2.15 |
Notice missing or incorrect data?
FAQ
Common questions about Terminal-Bench
Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.
The Terminal-Bench leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.500. The average score across all models is 0.345.
The highest Terminal-Bench score is 0.500, achieved by Claude Sonnet 4.5 from Anthropic.
23 models have been evaluated on the Terminal-Bench benchmark, with 0 verified results and 23 self-reported results.
Terminal-Bench is categorized under agents, code, and reasoning. The benchmark evaluates text models.