SWE-bench Multilingual
A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.
Progress Over Time
Interactive timeline showing model performance evolution on SWE-bench Multilingual
State-of-the-art frontier
Open
Proprietary
SWE-bench Multilingual Leaderboard
19 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Anthropic | — | 1.0M | $5.00 $25.00 | ||
2 | MiniMax | — | — | — | ||
3 | Qwen3.6 PlusNew Alibaba Cloud / Qwen Team | — | — | — | ||
4 | Moonshot AI | 1.0T | 262K | $0.60 $2.50 | ||
5 | MiniMax | 230B | 1.0M | $0.30 $1.20 | ||
6 | Xiaomi | 309B | 256K | $0.10 $0.30 | ||
7 | DeepSeek | 685B | — | — | ||
8 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 $3.60 | ||
9 | Zhipu AI | 358B | 205K | $0.60 $2.20 | ||
10 | Moonshot AI | 1.0T | — | — | ||
11 | DeepSeek | 685B | — | — | ||
12 | MiniMax | 230B | 1.0M | $0.30 $1.20 | ||
13 | Alibaba Cloud / Qwen Team | 480B | — | — | ||
14 | DeepSeek | 671B | 164K | $0.27 $1.00 | ||
15 | Moonshot AI | 1.0T | 200K | $0.50 $0.50 | ||
15 | Moonshot AI | 1.0T | — | — | ||
17 | 120B | 262K | $0.10 $0.50 | |||
18 | Meituan | 69B | 256K | $0.10 $0.40 | ||
19 | DeepSeek | 671B | 131K | $0.50 $2.15 |
Notice missing or incorrect data?
FAQ
Common questions about SWE-bench Multilingual
A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.
The SWE-bench Multilingual paper is available at https://arxiv.org/abs/2504.02605. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-bench Multilingual leaderboard ranks 19 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.778. The average score across all models is 0.603.
The highest SWE-bench Multilingual score is 0.778, achieved by Claude Opus 4.6 from Anthropic.
19 models have been evaluated on the SWE-bench Multilingual benchmark, with 0 verified results and 19 self-reported results.
SWE-bench Multilingual is categorized under code and reasoning. The benchmark evaluates text models with multilingual support.