Benchmarks/agents/Terminal-Bench

Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

Progress Over Time

Interactive timeline showing model performance evolution on Terminal-Bench

State-of-the-art frontier
Open
Proprietary

Terminal-Bench Leaderboard

23 models • 0 verified
ContextCostLicense
1
200K
$3.00
$15.00
2
230B1.0M
$0.30
$1.20
3
1.0T
4
MiniMax
MiniMax
230B1.0M
$0.30
$1.20
5
200K
$15.00
$75.00
6
200K
$1.00
$5.00
7
Zhipu AI
Zhipu AI
357B131K
$0.55
$2.19
8
560B128K
$0.30
$1.20
9
Anthropic
Anthropic
200K
$15.00
$75.00
10
685B
11
Zhipu AI
Zhipu AI
355B131K
$0.40
$1.60
12
200K
$3.00
$15.00
13
200K
$3.00
$15.00
14
69B256K
$0.10
$0.40
15
Zhipu AI
Zhipu AI
358B205K
$0.60
$2.20
16
671B164K
$0.27
$1.00
17
309B256K
$0.10
$0.30
18
Moonshot AI
Moonshot AI
1.0T200K
$0.50
$0.50
18
Zhipu AI
Zhipu AI
106B
20
120B262K
$0.10
$0.50
21
1.0T
22
32B262K
$0.06
$0.24
23
671B131K
$0.50
$2.15
Notice missing or incorrect data?

FAQ

Common questions about Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.
The Terminal-Bench leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.500. The average score across all models is 0.345.
The highest Terminal-Bench score is 0.500, achieved by Claude Sonnet 4.5 from Anthropic.
23 models have been evaluated on the Terminal-Bench benchmark, with 0 verified results and 23 self-reported results.
Terminal-Bench is categorized under agents, code, and reasoning. The benchmark evaluates text models.

Sub-benchmarks