Benchmarks/code/SWE-bench Multilingual

SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-bench Multilingual

State-of-the-art frontier
Open
Proprietary

SWE-bench Multilingual Leaderboard

19 models • 0 verified
ContextCostLicense
1
1.0M
$5.00
$25.00
2
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4
Moonshot AI
Moonshot AI
1.0T262K
$0.60
$2.50
5
230B1.0M
$0.30
$1.20
6
309B256K
$0.10
$0.30
7
685B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
9
Zhipu AI
Zhipu AI
358B205K
$0.60
$2.20
10
1.0T
11
685B
12
MiniMax
MiniMax
230B1.0M
$0.30
$1.20
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
14
671B164K
$0.27
$1.00
15
Moonshot AI
Moonshot AI
1.0T200K
$0.50
$0.50
15
1.0T
17
120B262K
$0.10
$0.50
18
69B256K
$0.10
$0.40
19
671B131K
$0.50
$2.15
Notice missing or incorrect data?

FAQ

Common questions about SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.
The SWE-bench Multilingual paper is available at https://arxiv.org/abs/2504.02605. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-bench Multilingual leaderboard ranks 19 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.778. The average score across all models is 0.603.
The highest SWE-bench Multilingual score is 0.778, achieved by Claude Opus 4.6 from Anthropic.
19 models have been evaluated on the SWE-bench Multilingual benchmark, with 0 verified results and 19 self-reported results.
SWE-bench Multilingual is categorized under code and reasoning. The benchmark evaluates text models with multilingual support.