SWE-bench Multilingual
Progress Over Time
Interactive timeline showing model performance evolution on SWE-bench Multilingual
SWE-bench Multilingual Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | — | ||
| 2 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 3 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 3 | Anthropic | — | 1.0M | $3.00 / $15.00 | ||
| 5 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 6 | Moonshot AI | 1.0T | 262K | $0.75 / $3.50 | ||
| 7 | MiniMax | — | 205K | $0.30 / $1.20 | ||
| 8 | DeepSeek | 1.6T | 1.0M | $1.60 / $3.20 | ||
| 9 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.32 / $1.28 | ||
| 10 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 11 | DeepSeek | 284B | 1.0M | $0.10 / $0.20 | ||
| 12 | Moonshot AI | 1.0T | — | — | ||
| 13 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 14 | Xiaomi | 1.0T | — | — | ||
| 14 | Xiaomi | 309B | — | — | ||
| 16 | Alibaba Cloud / Qwen Team | 28B | 262K | $0.60 / $3.60 | ||
| 17 | DeepSeek | 685B | — | — | ||
| 17 | DeepSeek | 685B | — | — | ||
| 19 | Alibaba Cloud / Qwen Team | 397B | — | — | ||
| 20 | 550B | — | — | |||
| 21 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 22 | Zhipu AI | 358B | — | — | ||
| 23 | Microsoft | — | — | — | ||
| 24 | Moonshot AI | 1.0T | — | — | ||
| 25 | DeepSeek | 685B | — | — | ||
| 26 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 27 | Alibaba Cloud / Qwen Team | 480B | — | — | ||
| 28 | DeepSeek | 671B | — | — | ||
| 29 | Moonshot AI | 1.0T | — | — | ||
| 29 | Moonshot AI | 1.0T | — | — | ||
| 31 | 120B | — | — | |||
| 32 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 33 | DeepSeek | 671B | 131K | $0.55 / $2.19 |
What is SWE-bench Multilingual?
A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.
SWE-bench Multilingual is a text benchmark evaluating models on reasoning and code tasks. LLM Stats tracks 33 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.
Compare leaders on the best AI for reasoning and best AI for code leaderboards.
Current leaders
Claude Mythos Preview from Anthropic currently leads the SWE-bench Multilingual leaderboard with a score of 0.873 across 33 evaluated AI models.
Source paper
- Title
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
- Authors
- Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, and 15 others
- Published
- arXiv
- 2504.02605
Abstract
The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.
FAQ
Common questions about the SWE-bench Multilingual benchmark and leaderboard.