Terminal-Bench 2.0
Progress Over Time
Interactive timeline showing model performance evolution on Terminal-Bench 2.0
Terminal-Bench 2.0 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 1.1M | $5.00 / $30.00 | ||
| 2 | Anthropic | — | — | — | ||
| 3 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 4 | Google | — | 1.0M | $1.50 / $9.00 | ||
| 5 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 6 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 7 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.32 / $1.28 | ||
| 8 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 9 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 10 | Zhipu AI | 754B | 200K | $1.40 / $4.40 | ||
| 11 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 12 | Xiaomi | 1.0T | 1.0M | $0.43 / $0.87 | ||
| 13 | DeepSeek | 1.6T | 1.0M | $1.60 / $3.20 | ||
| 14 | Moonshot AI | 1.0T | 262K | $0.75 / $3.50 | ||
| 15 | Xiaomi | 311B | 1.0M | $0.17 / $0.34 | ||
| 16 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 17 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 18 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 19 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 20 | Alibaba Cloud / Qwen Team | 28B | 262K | $0.60 / $3.60 | ||
| 20 | Anthropic | — | — | — | ||
| 22 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 23 | Meta | — | — | — | ||
| 24 | Xiaomi | 1.0T | — | — | ||
| 25 | MiniMax | — | 205K | $0.30 / $1.20 | ||
| 26 | DeepSeek | 284B | 1.0M | $0.10 / $0.20 | ||
| 27 | Zhipu AI | 744B | 200K | $1.00 / $3.20 | ||
| 28 | Microsoft | — | — | — | ||
| 29 | Google | — | — | — | ||
| 30 | OpenAI | — | — | — | ||
| 31 | Alibaba Cloud / Qwen Team | 397B | — | — | ||
| 32 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 33 | StepFun | 196B | 66K | $0.10 / $0.40 | ||
| 34 | Moonshot AI | 1.0T | — | — | ||
| 35 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 36 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 37 | DeepSeek | 685B | — | — | ||
| 37 | DeepSeek | 685B | — | — | ||
| 37 | DeepSeek | 685B | — | — | ||
| 40 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 41 | Microsoft | 1.0T | — | — | ||
| 42 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 43 | Zhipu AI | 358B | — | — | ||
| 44 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 45 | Xiaomi | 309B | — | — | ||
| 46 | Alibaba Cloud / Qwen Team | 480B | — | — | ||
| 47 | Cohere | 30B | — | — | ||
| 48 | 120B | — | — |
What is Terminal-Bench 2.0?
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
Terminal-Bench 2.0 is a text benchmark evaluating models on reasoning, agents, code, and tool calling tasks. LLM Stats tracks 48 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.
Compare leaders on the best AI for reasoning, best AI for agents, best AI for code and best AI for tool calling leaderboards.
Current leaders
GPT-5.5 from OpenAI currently leads the Terminal-Bench 2.0 leaderboard with a score of 0.827 across 48 evaluated AI models.
FAQ
Common questions about the Terminal-Bench 2.0 benchmark and leaderboard.