GSM8k

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on GSM8k

State-of-the-art frontier
Open
Proprietary

GSM8k Leaderboard

47 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
2
OpenAI
OpenAI
200K$15.00 / $60.00
3
OpenAI
OpenAI
128K$75.00 / $150.00
4405B128K$0.89 / $0.89
5200K$3.00 / $15.00
5200K$3.00 / $15.00
727B131K$0.10 / $0.20
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K$0.35 / $0.40
10236B8K$0.14 / $0.28
11
Anthropic
Anthropic
200K$15.00 / $75.00
12
Amazon
Amazon
300K$0.80 / $3.20
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
14
Amazon
Amazon
300K$0.06 / $0.24
1512B131K$0.05 / $0.10
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B128K$0.10 / $0.10
17
Mistral AI
Mistral AI
123B128K$2.00 / $6.00
18200K$3.00 / $15.00
18128K$0.03 / $0.14
20
Moonshot AI
Moonshot AI
1.0T
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B131K$0.30 / $0.30
2270B
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B128K$0.09 / $0.09
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
252.1M$2.50 / $10.00
26
274B131K$0.02 / $0.04
28200K$0.25 / $1.25
29
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
2960B
31
Microsoft
Microsoft
4B
32398B256K$2.00 / $8.00
331.0M$0.15 / $0.60
334B128K$0.10 / $0.10
35
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
36
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
378B128K$0.50 / $0.50
3824B
393B128K$0.01 / $0.02
4052B256K$0.20 / $0.40
4127B
42104B128K$0.25 / $1.00
437B
449B
451B
468B
4721B128K$0.40 / $4.00
Notice missing or incorrect data?

FAQ

Common questions about GSM8k

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
The GSM8k paper is available at https://arxiv.org/abs/2110.14168. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The GSM8k leaderboard ranks 47 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.973. The average score across all models is 0.864.
The highest GSM8k score is 0.973, achieved by Kimi K2 Instruct from Moonshot AI.
47 models have been evaluated on the GSM8k benchmark, with 0 verified results and 47 self-reported results.
GSM8k is categorized under math and reasoning. The benchmark evaluates text models.