Terminal-Bench
Progress Over Time
Interactive timeline showing model performance evolution on Terminal-Bench
Terminal-Bench Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 2 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 3 | Moonshot AI | 1.0T | — | — | ||
| 4 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 5 | Anthropic | — | — | — | ||
| 6 | Amazon | — | — | — | ||
| 7 | Anthropic | — | 200K | $1.00 / $5.00 | ||
| 8 | Zhipu AI | 357B | — | — | ||
| 9 | Meituan | 560B | — | — | ||
| 10 | Anthropic | — | — | — | ||
| 11 | DeepSeek | 685B | — | — | ||
| 12 | Zhipu AI | 355B | — | — | ||
| 13 | Anthropic | — | — | — | ||
| 14 | Anthropic | — | — | — | ||
| 15 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 16 | Zhipu AI | 358B | — | — | ||
| 17 | Amazon | — | 1.0M | $0.30 / $2.50 | ||
| 18 | DeepSeek | 671B | — | — | ||
| 19 | Xiaomi | 309B | — | — | ||
| 20 | Moonshot AI | 1.0T | — | — | ||
| 20 | Zhipu AI | 106B | — | — | ||
| 22 | 120B | — | — | |||
| 23 | Moonshot AI | 1.0T | — | — | ||
| 24 | 32B | 262K | $0.06 / $0.24 | |||
| 25 | DeepSeek | 671B | 131K | $0.55 / $2.19 |
Sub-benchmarks
Terminal-Bench 2.0
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
Terminal-Bench 2.1
Terminal-Bench 2.1 is an updated release of the Terminal-Bench benchmark that tests AI agents' ability to operate a computer via the terminal. It evaluates how well models handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, data science workflows, and security tasks.
Terminal-Bench Hard
Terminal-Bench Hard is a harder terminal-agent benchmark variant evaluated with the Terminus-2 harness in Cohere's Command A+ and North Mini Code releases.
What is Terminal-Bench?
Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.
Terminal-Bench is a text benchmark evaluating models on reasoning, agents, and code tasks. LLM Stats tracks 25 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.5.
Compare leaders on the best AI for reasoning, best AI for agents and best AI for code leaderboards.
Current leaders
Claude Sonnet 4.5 from Anthropic currently leads the Terminal-Bench leaderboard with a score of 0.500 across 25 evaluated AI models.
FAQ
Common questions about the Terminal-Bench benchmark and leaderboard.