SWE-bench Multilingual

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-bench Multilingual

State-of-the-art frontier
Open
Proprietary

SWE-bench Multilingual Leaderboard

33 models
ContextCostLicense
1
21.0M$5.00 / $25.00
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
3
Anthropic
Anthropic
1.0M$3.00 / $15.00
51.0M$5.00 / $25.00
6
Moonshot AI
Moonshot AI
1.0T262K$0.75 / $3.50
7205K$0.30 / $1.20
81.6T1.0M$1.60 / $3.20
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.32 / $1.28
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
11284B1.0M$0.10 / $0.20
12
Moonshot AI
Moonshot AI
1.0T
13230B1.0M$0.30 / $1.20
141.0T
14309B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
28B262K$0.60 / $3.60
17685B
17685B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
20550B
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
22
Zhipu AI
Zhipu AI
358B
23
241.0T
25685B
26
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
28671B
291.0T
29
Moonshot AI
Moonshot AI
1.0T
31120B
3269B256K$0.10 / $0.40
33671B131K$0.55 / $2.19
Notice missing or incorrect data?
About this benchmark

What is SWE-bench Multilingual?

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

SWE-bench Multilingual is a text benchmark evaluating models on reasoning and code tasks. LLM Stats tracks 33 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.

Compare leaders on the best AI for reasoning and best AI for code leaderboards.

Current leaders

Claude Mythos Preview from Anthropic currently leads the SWE-bench Multilingual leaderboard with a score of 0.873 across 33 evaluated AI models.

1Claude Mythos PreviewAnthropic87.3%
2Claude Opus 4.8Anthropic84.4%
3Qwen3.7 MaxAlibaba Cloud / Qwen Team78.3%
OSSKimi K2.6#6 open-weight76.7%

Source paper

Title
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Authors
Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, and 15 others
Published
Abstract

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

FAQ

Common questions about the SWE-bench Multilingual benchmark and leaderboard.

What is the SWE-bench Multilingual benchmark?

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

What is the SWE-bench Multilingual leaderboard?

The SWE-bench Multilingual leaderboard ranks 33 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.873. The average score across all models is 0.663.

What is the highest SWE-bench Multilingual score?

The highest SWE-bench Multilingual score is 0.873, achieved by Claude Mythos Preview from Anthropic.

How many models are evaluated on SWE-bench Multilingual?

33 models have been evaluated on the SWE-bench Multilingual benchmark, with 0 verified results and 33 self-reported results.

Where can I find the SWE-bench Multilingual paper?

The SWE-bench Multilingual paper is available at https://arxiv.org/abs/2504.02605. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SWE-bench Multilingual cover?

SWE-bench Multilingual is categorized under reasoning and code. The benchmark evaluates text models with multilingual support.

What is the best open-source model on SWE-bench Multilingual?

Kimi K2.6 by Moonshot AI is the top-ranked open-source model on SWE-bench Multilingual, with a score of 0.767 (rank #6).

Which model offers the best value on SWE-bench Multilingual?

Among models scoring within 10% of the leader, Claude Opus 4.8 from Anthropic is the cheapest, at $5.00 per million input tokens with a score of 0.844.

How recent are the SWE-bench Multilingual leaderboard results?

The SWE-bench Multilingual leaderboard was last updated in July 2026 and currently includes 33 evaluated models.