SWT-Bench

Software Test Benchmark evaluating LLM ability to write tests for software repositories

MiniMax M2.1 from MiniMax currently leads the SWT-Bench leaderboard with a score of 0.693 across 1 evaluated AI models.

MiniMax M2.1 leads with 69.3%.

Progress Over Time

Interactive timeline showing model performance evolution on SWT-Bench

State-of-the-art frontier

Open

Proprietary

SWT-Bench Leaderboard

1 models

				Context	Cost	License
1	MiniMax M2.1 MiniMax		230B	1.0M	$0.30 / $1.20

Notice missing or incorrect data?

FAQ

Common questions about SWT-Bench.

What is the SWT-Bench benchmark?

Software Test Benchmark evaluating LLM ability to write tests for software repositories

What is the SWT-Bench leaderboard?

The SWT-Bench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MiniMax M2.1 by MiniMax leads with a score of 0.693. The average score across all models is 0.693.

What is the highest SWT-Bench score?

The highest SWT-Bench score is 0.693, achieved by MiniMax M2.1 from MiniMax.

How many models are evaluated on SWT-Bench?

1 models have been evaluated on the SWT-Bench benchmark, with 0 verified results and 1 self-reported results.

What categories does SWT-Bench cover?

SWT-Bench is categorized under code. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all code →

SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

code

89 models

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

code

71 models

HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

code

66 models

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

code

39 models

SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

code

27 models

Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

code

23 models