Terminus
Terminal-Bench is a benchmark for testing AI agents in real terminal environments, evaluating how well agents can handle real-world, end-to-end tasks autonomously. The benchmark includes tasks spanning coding, system administration, security, data science, model training, file operations, version control, and web development. Terminus is the neutral test-bed agent designed to work with Terminal-Bench, operating purely through tmux sessions without dedicated tools.
Progress Over Time
Interactive timeline showing model performance evolution on Terminus
State-of-the-art frontier
Open
Proprietary
Terminus Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 |
Notice missing or incorrect data?
FAQ
Common questions about Terminus
Terminal-Bench is a benchmark for testing AI agents in real terminal environments, evaluating how well agents can handle real-world, end-to-end tasks autonomously. The benchmark includes tasks spanning coding, system administration, security, data science, model training, file operations, version control, and web development. Terminus is the neutral test-bed agent designed to work with Terminal-Bench, operating purely through tmux sessions without dedicated tools.
The Terminus paper is available at https://github.com/laude-institute/terminal-bench. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Terminus leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.250. The average score across all models is 0.250.
The highest Terminus score is 0.250, achieved by Kimi K2 Instruct from Moonshot AI.
1 models have been evaluated on the Terminus benchmark, with 0 verified results and 1 self-reported results.
Terminus is categorized under agents, code, and reasoning. The benchmark evaluates text models.