Terminal-Bench Hard

Progress Over Time

Interactive timeline showing model performance evolution on Terminal-Bench Hard

State-of-the-art frontier
Open
Proprietary

Terminal-Bench Hard Leaderboard

2 models
ContextCostLicense
130B
2218B
Notice missing or incorrect data?
About this benchmark

What is Terminal-Bench Hard?

Terminal-Bench Hard is a harder terminal-agent benchmark variant evaluated with the Terminus-2 harness in Cohere's Command A+ and North Mini Code releases.

Terminal-Bench Hard is a text benchmark evaluating models on reasoning, agents, code, and tool calling tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.3.

Compare leaders on the best AI for reasoning, best AI for agents, best AI for code and best AI for tool calling leaderboards.

Current leaders

North Mini Code 1.0 from Cohere currently leads the Terminal-Bench Hard leaderboard with a score of 0.311 across 2 evaluated AI models.

2Command A+Cohere25.0%

FAQ

Common questions about the Terminal-Bench Hard benchmark and leaderboard.

What is the Terminal-Bench Hard benchmark?

Terminal-Bench Hard is a harder terminal-agent benchmark variant evaluated with the Terminus-2 harness in Cohere's Command A+ and North Mini Code releases.

What is the Terminal-Bench Hard leaderboard?

The Terminal-Bench Hard leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, North Mini Code 1.0 by Cohere leads with a score of 0.311. The average score across all models is 0.280.

What is the highest Terminal-Bench Hard score?

The highest Terminal-Bench Hard score is 0.311, achieved by North Mini Code 1.0 from Cohere.

How many models are evaluated on Terminal-Bench Hard?

2 models have been evaluated on the Terminal-Bench Hard benchmark, with 0 verified results and 2 self-reported results.

What categories does Terminal-Bench Hard cover?

Terminal-Bench Hard is categorized under reasoning, agents, code, and tool calling. The benchmark evaluates text models.

What's the difference between Terminal-Bench Hard and Terminal-Bench?

Terminal-Bench Hard is a variant of Terminal-Bench. See the Terminal-Bench leaderboard for the broader benchmark and per-model comparison.

What is the best open-source model on Terminal-Bench Hard?

North Mini Code 1.0 by Cohere is the top-ranked open-source model on Terminal-Bench Hard, with a score of 0.311 (rank #1).

How recent are the Terminal-Bench Hard leaderboard results?

The Terminal-Bench Hard leaderboard was last updated in June 2026 and currently includes 2 evaluated models.