Benchmarks/code/Multi-SWE-Bench

Multi-SWE-Bench

A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Multi-SWE-Bench

State-of-the-art frontier
Open
Proprietary

Multi-SWE-Bench Leaderboard

4 models • 0 verified
ContextCostLicense
1
2
230B
3
230B
4
MiniMax
MiniMax
230B
Notice missing or incorrect data?

FAQ

Common questions about Multi-SWE-Bench

A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.
The Multi-SWE-Bench paper is available at https://arxiv.org/abs/2504.02605. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Multi-SWE-Bench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, MiniMax M2.7 by MiniMax leads with a score of 0.527. The average score across all models is 0.474.
The highest Multi-SWE-Bench score is 0.527, achieved by MiniMax M2.7 from MiniMax.
4 models have been evaluated on the Multi-SWE-Bench benchmark, with 0 verified results and 4 self-reported results.
Multi-SWE-Bench is categorized under code and reasoning. The benchmark evaluates text models with multilingual support.