Terminal-Bench Hard

Name: Terminal-Bench Hard Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Progress Over Time

Interactive timeline showing model performance evolution on Terminal-Bench Hard

State-of-the-art frontier

Open

Proprietary

Terminal-Bench Hard Leaderboard

2 models

				Context	Cost	License
1	North Mini Code 1.0 Cohere		30B	—	—
2	Command A+ Cohere		218B	—	—

Notice missing or incorrect data?

About this benchmark

What is Terminal-Bench Hard?

Terminal-Bench Hard is a harder terminal-agent benchmark variant evaluated with the Terminus-2 harness in Cohere's Command A+ and North Mini Code releases.

Terminal-Bench Hard is a text benchmark evaluating models on reasoning, agents, code, and tool calling tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.3.

Compare leaders on the best AI for reasoning, best AI for agents, best AI for code and best AI for tool calling leaderboards.

Current leaders

North Mini Code 1.0 from Cohere currently leads the Terminal-Bench Hard leaderboard with a score of 0.311 across 2 evaluated AI models.

North Mini Code 1.0Cohere31.1%

Command A+Cohere25.0%

FAQ

Common questions about the Terminal-Bench Hard benchmark and leaderboard.

What is the Terminal-Bench Hard benchmark?

Terminal-Bench Hard is a harder terminal-agent benchmark variant evaluated with the Terminus-2 harness in Cohere's Command A+ and North Mini Code releases.

What is the Terminal-Bench Hard leaderboard?

The Terminal-Bench Hard leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, North Mini Code 1.0 by Cohere leads with a score of 0.311. The average score across all models is 0.280.

What is the highest Terminal-Bench Hard score?

The highest Terminal-Bench Hard score is 0.311, achieved by North Mini Code 1.0 from Cohere.

How many models are evaluated on Terminal-Bench Hard?

2 models have been evaluated on the Terminal-Bench Hard benchmark, with 0 verified results and 2 self-reported results.

What categories does Terminal-Bench Hard cover?

Terminal-Bench Hard is categorized under reasoning, agents, code, and tool calling. The benchmark evaluates text models.

What's the difference between Terminal-Bench Hard and Terminal-Bench?

Terminal-Bench Hard is a variant of Terminal-Bench. See the Terminal-Bench leaderboard for the broader benchmark and per-model comparison.

What is the best open-source model on Terminal-Bench Hard?

North Mini Code 1.0 by Cohere is the top-ranked open-source model on Terminal-Bench Hard, with a score of 0.311 (rank #1).

How recent are the Terminal-Bench Hard leaderboard results?

The Terminal-Bench Hard leaderboard was last updated in June 2026 and currently includes 2 evaluated models.