MGSM

MGSM (Multilingual Grade School Math) is a benchmark of grade-school math problems. Contains 250 grade-school math problems manually translated from the GSM8K dataset into ten typologically diverse languages: Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, and Telugu. Evaluates multilingual mathematical reasoning capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MGSM

State-of-the-art frontier
Open
Proprietary

MGSM Leaderboard

31 models • 0 verified
ContextCostLicense
1
0.923400B1.0M
$0.17
$0.85
2
OpenAI
OpenAI
0.920200K
$1.10
$4.40
3
0.916200K
$3.00
$15.00
3
0.916200K
$3.00
$15.00
5
0.91170B128K
$0.20
$0.20
6
0.908128K
$15.00
$60.00
7
Anthropic
Anthropic
0.907200K
$15.00
$75.00
8
0.906109B10.0M
$0.08
$0.30
9
OpenAI
OpenAI
0.905128K
$2.50
$10.00
10
OpenAI
OpenAI
0.893200K
$15.00
$60.00
11
0.885128K
$10.00
$30.00
12
0.8752.1M
$2.50
$10.00
13
0.870128K
$0.15
$0.60
14
0.86990B128K
$0.35
$0.40
15
0.856200K
$0.80
$4.00
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.835235B128K
$0.10
$0.10
17
0.835200K
$3.00
$15.00
18
0.8261.0M
$0.15
$0.60
19
Microsoft
Microsoft
0.80615B16K
$0.07
$0.14
20
0.751200K
$0.25
$1.25
21
OpenAI
OpenAI
0.74533K
$30.00
$60.00
22
0.68911B128K
$0.05
$0.05
23
0.6708B32K
$20.00
$40.00
24
Microsoft
Microsoft
0.6394B
25
0.6072B
26
0.58760B
27
0.5823B128K
$0.01
$0.02
28
0.56316K
$0.50
$1.50
29
0.5312B
29
0.5318B
31
0.4794B128K
$0.10
$0.10
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about MGSM

MGSM (Multilingual Grade School Math) is a benchmark of grade-school math problems. Contains 250 grade-school math problems manually translated from the GSM8K dataset into ten typologically diverse languages: Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, and Telugu. Evaluates multilingual mathematical reasoning capabilities.
The MGSM paper is available at https://arxiv.org/abs/2210.03057. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MGSM leaderboard ranks 31 AI models based on their performance on this benchmark. Currently, Llama 4 Maverick by Meta leads with a score of 0.923. The average score across all models is 0.779.
The highest MGSM score is 0.923, achieved by Llama 4 Maverick from Meta.
31 models have been evaluated on the MGSM benchmark, with 0 verified results and 30 self-reported results.
MGSM is categorized under math and reasoning. The benchmark evaluates text models with multilingual support.