Multi-SWE-Bench
A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.
Progress Over Time
Interactive timeline showing model performance evolution on Multi-SWE-Bench
State-of-the-art frontier
Open
Proprietary
Multi-SWE-Bench Leaderboard
4 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | MiniMax | — | — | — | ||
2 | MiniMax | 230B | — | — | ||
3 | MiniMax | 230B | — | — | ||
4 | MiniMax | 230B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about Multi-SWE-Bench
A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.
The Multi-SWE-Bench paper is available at https://arxiv.org/abs/2504.02605. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Multi-SWE-Bench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, MiniMax M2.7 by MiniMax leads with a score of 0.527. The average score across all models is 0.474.
The highest Multi-SWE-Bench score is 0.527, achieved by MiniMax M2.7 from MiniMax.
4 models have been evaluated on the Multi-SWE-Bench benchmark, with 0 verified results and 4 self-reported results.
Multi-SWE-Bench is categorized under code and reasoning. The benchmark evaluates text models with multilingual support.