Multi-SWE-Bench

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Multi-SWE-Bench

State-of-the-art frontier
Open
Proprietary

Multi-SWE-Bench Leaderboard

6 models
ContextCostLicense
1205K$0.30 / $1.20
2230B1.0M$0.30 / $1.20
3230B1.0M$0.30 / $1.20
41.0T
5
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
Notice missing or incorrect data?
About this benchmark

What is Multi-SWE-Bench?

A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.

Multi-SWE-Bench is a text benchmark evaluating models on reasoning and code tasks. LLM Stats tracks 6 models on this benchmark, scored on a 0–1 scale. The current average is 0.4, with the leader at 0.5.

Compare leaders on the best AI for reasoning and best AI for code leaderboards.

Current leaders

MiniMax M2.7 from MiniMax currently leads the Multi-SWE-Bench leaderboard with a score of 0.527 across 6 evaluated AI models.

1MiniMax M2.7MiniMax52.7%
2MiniMax M2.5MiniMax51.3%
3MiniMax M2.1MiniMax49.4%

Source paper

Title
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Authors
Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, and 15 others
Published
Abstract

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

FAQ

Common questions about the Multi-SWE-Bench benchmark and leaderboard.

What is the Multi-SWE-Bench benchmark?

A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.

What is the Multi-SWE-Bench leaderboard?

The Multi-SWE-Bench leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, MiniMax M2.7 by MiniMax leads with a score of 0.527. The average score across all models is 0.429.

What is the highest Multi-SWE-Bench score?

The highest Multi-SWE-Bench score is 0.527, achieved by MiniMax M2.7 from MiniMax.

How many models are evaluated on Multi-SWE-Bench?

6 models have been evaluated on the Multi-SWE-Bench benchmark, with 0 verified results and 6 self-reported results.

Where can I find the Multi-SWE-Bench paper?

The Multi-SWE-Bench paper is available at https://arxiv.org/abs/2504.02605. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Multi-SWE-Bench cover?

Multi-SWE-Bench is categorized under reasoning and code. The benchmark evaluates text models with multilingual support.

What is the best open-source model on Multi-SWE-Bench?

MiniMax M2.7 by MiniMax is the top-ranked open-source model on Multi-SWE-Bench, with a score of 0.527 (rank #1).

Which model offers the best value on Multi-SWE-Bench?

Among models scoring within 10% of the leader, MiniMax M2.7 from MiniMax is the cheapest, at $0.30 per million input tokens with a score of 0.527.

How recent are the Multi-SWE-Bench leaderboard results?

The Multi-SWE-Bench leaderboard was last updated in July 2026 and currently includes 6 evaluated models.