Terminal-Bench 2.0

Progress Over Time

Interactive timeline showing model performance evolution on Terminal-Bench 2.0

State-of-the-art frontier
Open
Proprietary

Terminal-Bench 2.0 Leaderboard

48 models
ContextCostLicense
1
OpenAI
OpenAI
1.1M$5.00 / $30.00
2
3400K$1.75 / $14.00
41.0M$1.50 / $9.00
5
OpenAI
OpenAI
1.0M$2.50 / $15.00
61.0M$5.00 / $25.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.32 / $1.28
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
91.0M$5.00 / $25.00
10
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
111.0M$2.50 / $15.00
121.0T1.0M$0.43 / $0.87
131.6T1.0M$1.60 / $3.20
14
Moonshot AI
Moonshot AI
1.0T262K$0.75 / $3.50
15
Xiaomi
Xiaomi
311B1.0M$0.17 / $0.34
161.0M$5.00 / $25.00
17400K$1.75 / $14.00
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
19400K$0.75 / $4.50
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
28B262K$0.60 / $3.60
20
22200K$3.00 / $15.00
23
241.0T
25205K$0.30 / $1.20
26284B1.0M$0.10 / $0.20
27
Zhipu AI
Zhipu AI
744B200K$1.00 / $3.20
28
29
30
31
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
32
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
33196B66K$0.10 / $0.40
34
Moonshot AI
Moonshot AI
1.0T
35
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
361.0M$0.50 / $3.00
37685B
37685B
37685B
40400K$0.20 / $1.25
411.0T
42
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
43
Zhipu AI
Zhipu AI
358B
44
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
45309B
46
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
4730B
48120B
Notice missing or incorrect data?
About this benchmark

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

Terminal-Bench 2.0 is a text benchmark evaluating models on reasoning, agents, code, and tool calling tasks. LLM Stats tracks 48 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.

Compare leaders on the best AI for reasoning, best AI for agents, best AI for code and best AI for tool calling leaderboards.

Current leaders

GPT-5.5 from OpenAI currently leads the Terminal-Bench 2.0 leaderboard with a score of 0.827 across 48 evaluated AI models.

1GPT-5.5OpenAI82.7%
2Claude Mythos PreviewAnthropic82.0%
3GPT-5.3 CodexOpenAI77.3%
OSSGLM-5.1#10 open-weight69.0%

FAQ

Common questions about the Terminal-Bench 2.0 benchmark and leaderboard.

What is the Terminal-Bench 2.0 benchmark?

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

What is the Terminal-Bench 2.0 leaderboard?

The Terminal-Bench 2.0 leaderboard ranks 48 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.827. The average score across all models is 0.575.

What is the highest Terminal-Bench 2.0 score?

The highest Terminal-Bench 2.0 score is 0.827, achieved by GPT-5.5 from OpenAI.

How many models are evaluated on Terminal-Bench 2.0?

48 models have been evaluated on the Terminal-Bench 2.0 benchmark, with 0 verified results and 48 self-reported results.

What categories does Terminal-Bench 2.0 cover?

Terminal-Bench 2.0 is categorized under reasoning, agents, code, and tool calling. The benchmark evaluates text models.

What's the difference between Terminal-Bench 2.0 and Terminal-Bench?

Terminal-Bench 2.0 is a variant of Terminal-Bench. See the Terminal-Bench leaderboard for the broader benchmark and per-model comparison.

What is the best open-source model on Terminal-Bench 2.0?

GLM-5.1 by Zhipu AI is the top-ranked open-source model on Terminal-Bench 2.0, with a score of 0.690 (rank #10).

Which model offers the best value on Terminal-Bench 2.0?

Among models scoring within 10% of the leader, Gemini 3.5 Flash from Google is the cheapest, at $1.50 per million input tokens with a score of 0.762.

How is Terminal-Bench 2.0 scored?

Terminal-Bench 2.0 is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the Terminal-Bench 2.0 leaderboard results?

The Terminal-Bench 2.0 leaderboard was last updated in June 2026 and currently includes 48 evaluated models.