Benchmarks/agents/Terminal-Bench 2.0

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

Progress Over Time

Interactive timeline showing model performance evolution on Terminal-Bench 2.0

State-of-the-art frontier
Open
Proprietary

Terminal-Bench 2.0 Leaderboard

25 models • 0 verified
ContextCostLicense
1
400K
$1.75
$14.00
2
OpenAI
OpenAI
1.0M
$2.50
$15.00
3
1.0M
$2.50
$15.00
4
1.0M
$5.00
$25.00
5
400K
$1.75
$14.00
6
400K
$0.75
$4.50
7
200K
$5.00
$25.00
8
200K
$3.00
$15.00
9
Zhipu AI
Zhipu AI
744B200K
$1.00
$3.20
10
11
400K
$1.25
$10.00
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
13
196B66K
$0.10
$0.40
14
Moonshot AI
Moonshot AI
1.0T262K
$0.60
$2.50
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K
$0.40
$3.20
16
1.0M
$0.50
$3.00
17
685B
17
685B
19
400K
$0.20
$1.25
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
21
Zhipu AI
Zhipu AI
358B205K
$0.60
$2.20
22
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K
$0.25
$2.00
23
309B256K
$0.10
$0.30
24
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
25
120B262K
$0.10
$0.50
Notice missing or incorrect data?

FAQ

Common questions about Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
The Terminal-Bench 2.0 leaderboard ranks 25 AI models based on their performance on this benchmark. Currently, GPT-5.3 Codex by OpenAI leads with a score of 0.773. The average score across all models is 0.525.
The highest Terminal-Bench 2.0 score is 0.773, achieved by GPT-5.3 Codex from OpenAI.
25 models have been evaluated on the Terminal-Bench 2.0 benchmark, with 0 verified results and 25 self-reported results.
Terminal-Bench 2.0 is categorized under agents, code, reasoning, and tool calling. The benchmark evaluates text models.