Terminal-Bench Hard
Progress Over Time
Interactive timeline showing model performance evolution on Terminal-Bench Hard
Terminal-Bench Hard Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Cohere | 30B | — | — | ||
| 2 | Cohere | 218B | — | — |
What is Terminal-Bench Hard?
Terminal-Bench Hard is a harder terminal-agent benchmark variant evaluated with the Terminus-2 harness in Cohere's Command A+ and North Mini Code releases.
Terminal-Bench Hard is a text benchmark evaluating models on reasoning, agents, code, and tool calling tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.3.
Compare leaders on the best AI for reasoning, best AI for agents, best AI for code and best AI for tool calling leaderboards.
Current leaders
North Mini Code 1.0 from Cohere currently leads the Terminal-Bench Hard leaderboard with a score of 0.311 across 2 evaluated AI models.
FAQ
Common questions about the Terminal-Bench Hard benchmark and leaderboard.