Terminal-Bench

Progress Over Time

Interactive timeline showing model performance evolution on Terminal-Bench

State-of-the-art frontier
Open
Proprietary

Terminal-Bench Leaderboard

25 models
ContextCostLicense
1200K$3.00 / $15.00
2230B1.0M$0.30 / $1.20
31.0T
4
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
5
6
7200K$1.00 / $5.00
8
Zhipu AI
Zhipu AI
357B
9560B
10
Anthropic
Anthropic
11685B
12
Zhipu AI
Zhipu AI
355B
13
14
1569B256K$0.10 / $0.40
16
Zhipu AI
Zhipu AI
358B
171.0M$0.30 / $2.50
18671B
19309B
20
Zhipu AI
Zhipu AI
106B
20
Moonshot AI
Moonshot AI
1.0T
22120B
231.0T
2432B262K$0.06 / $0.24
25671B131K$0.55 / $2.19
Notice missing or incorrect data?

Sub-benchmarks

About this benchmark

What is Terminal-Bench?

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

Terminal-Bench is a text benchmark evaluating models on reasoning, agents, and code tasks. LLM Stats tracks 25 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.5.

Compare leaders on the best AI for reasoning, best AI for agents and best AI for code leaderboards.

Current leaders

Claude Sonnet 4.5 from Anthropic currently leads the Terminal-Bench leaderboard with a score of 0.500 across 25 evaluated AI models.

1Claude Sonnet 4.5Anthropic50.0%
2MiniMax M2.1MiniMax47.9%
3Kimi K2-Thinking-0905Moonshot AI47.1%

FAQ

Common questions about the Terminal-Bench benchmark and leaderboard.

What is the Terminal-Bench benchmark?

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

What is the Terminal-Bench leaderboard?

The Terminal-Bench leaderboard ranks 25 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.500. The average score across all models is 0.347.

What is the highest Terminal-Bench score?

The highest Terminal-Bench score is 0.500, achieved by Claude Sonnet 4.5 from Anthropic.

How many models are evaluated on Terminal-Bench?

25 models have been evaluated on the Terminal-Bench benchmark, with 0 verified results and 25 self-reported results.

What categories does Terminal-Bench cover?

Terminal-Bench is categorized under reasoning, agents, and code. The benchmark evaluates text models.

Are there variants of Terminal-Bench?

Yes. Terminal-Bench has 3 related variants: Terminal-Bench 2.0, Terminal-Bench 2.1, Terminal-Bench Hard.

What is the best open-source model on Terminal-Bench?

MiniMax M2.1 by MiniMax is the top-ranked open-source model on Terminal-Bench, with a score of 0.479 (rank #2).

Which model offers the best value on Terminal-Bench?

Among models scoring within 10% of the leader, MiniMax M2.1 from MiniMax is the cheapest, at $0.30 per million input tokens with a score of 0.479.

How recent are the Terminal-Bench leaderboard results?

The Terminal-Bench leaderboard was last updated in June 2026 and currently includes 25 evaluated models.