Terminal-Bench 2.0
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
Progress Over Time
Interactive timeline showing model performance evolution on Terminal-Bench 2.0
State-of-the-art frontier
Open
Proprietary
Terminal-Bench 2.0 Leaderboard
25 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | OpenAI | — | 400K | $1.75 $14.00 | ||
2 | OpenAI | — | 1.0M | $2.50 $15.00 | ||
3 | Google | — | 1.0M | $2.50 $15.00 | ||
4 | Anthropic | — | 1.0M | $5.00 $25.00 | ||
5 | OpenAI | — | 400K | $1.75 $14.00 | ||
6 | OpenAI | — | 400K | $0.75 $4.50 | ||
7 | Anthropic | — | 200K | $5.00 $25.00 | ||
8 | Anthropic | — | 200K | $3.00 $15.00 | ||
9 | Zhipu AI | 744B | 200K | $1.00 $3.20 | ||
10 | Google | — | — | — | ||
11 | OpenAI | — | 400K | $1.25 $10.00 | ||
12 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 $3.60 | ||
13 | StepFun | 196B | 66K | $0.10 $0.40 | ||
14 | Moonshot AI | 1.0T | 262K | $0.60 $2.50 | ||
15 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 $3.20 | ||
16 | Google | — | 1.0M | $0.50 $3.00 | ||
17 | DeepSeek | 685B | — | — | ||
17 | DeepSeek | 685B | — | — | ||
19 | OpenAI | — | 400K | $0.20 $1.25 | ||
20 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
21 | Zhipu AI | 358B | 205K | $0.60 $2.20 | ||
22 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 $2.00 | ||
23 | Xiaomi | 309B | 256K | $0.10 $0.30 | ||
24 | Alibaba Cloud / Qwen Team | 480B | — | — | ||
25 | 120B | 262K | $0.10 $0.50 |
Notice missing or incorrect data?
FAQ
Common questions about Terminal-Bench 2.0
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
The Terminal-Bench 2.0 leaderboard ranks 25 AI models based on their performance on this benchmark. Currently, GPT-5.3 Codex by OpenAI leads with a score of 0.773. The average score across all models is 0.525.
The highest Terminal-Bench 2.0 score is 0.773, achieved by GPT-5.3 Codex from OpenAI.
25 models have been evaluated on the Terminal-Bench 2.0 benchmark, with 0 verified results and 25 self-reported results.
Terminal-Bench 2.0 is categorized under agents, code, reasoning, and tool calling. The benchmark evaluates text models.