GSM8k
Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
Progress Over Time
Interactive timeline showing model performance evolution on GSM8k
State-of-the-art frontier
Open
Proprietary
GSM8k Leaderboard
47 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 2 | OpenAI | — | 200K | $15.00 / $60.00 | ||
| 3 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 4 | 405B | 128K | $0.89 / $0.89 | |||
| 5 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 5 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 7 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 7 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 73B | 131K | $0.35 / $0.40 | ||
| 10 | DeepSeek | 236B | 8K | $0.14 / $0.28 | ||
| 11 | Anthropic | — | 200K | $15.00 / $75.00 | ||
| 12 | Amazon | — | 300K | $0.80 / $3.20 | ||
| 12 | Alibaba Cloud / Qwen Team | 15B | — | — | ||
| 14 | Amazon | — | 300K | $0.06 / $0.24 | ||
| 15 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 16 | Alibaba Cloud / Qwen Team | 235B | 128K | $0.10 / $0.10 | ||
| 17 | Mistral AI | 123B | 128K | $2.00 / $6.00 | ||
| 18 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 18 | Amazon | — | 128K | $0.03 / $0.14 | ||
| 20 | Moonshot AI | 1.0T | — | — | ||
| 21 | Alibaba Cloud / Qwen Team | 8B | 131K | $0.30 / $0.30 | ||
| 22 | 70B | — | — | |||
| 23 | Alibaba Cloud / Qwen Team | 32B | 128K | $0.09 / $0.09 | ||
| 23 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 25 | Google | — | 2.1M | $2.50 / $10.00 | ||
| 26 | xAI | — | — | — | ||
| 27 | Google | 4B | 131K | $0.02 / $0.04 | ||
| 28 | Anthropic | — | 200K | $0.25 / $1.25 | ||
| 29 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 29 | Microsoft | 60B | — | — | ||
| 31 | Microsoft | 4B | — | — | ||
| 32 | AI21 Labs | 398B | 256K | $2.00 / $8.00 | ||
| 33 | Google | — | 1.0M | $0.15 / $0.60 | ||
| 33 | Microsoft | 4B | 128K | $0.10 / $0.10 | ||
| 35 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 36 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 37 | 8B | 128K | $0.50 / $0.50 | |||
| 38 | Mistral AI | 24B | — | — | ||
| 39 | 3B | 128K | $0.01 / $0.02 | |||
| 40 | AI21 Labs | 52B | 256K | $0.20 / $0.40 | ||
| 41 | Google | 27B | — | — | ||
| 42 | Cohere | 104B | 128K | $0.25 / $1.00 | ||
| 43 | 7B | — | — | |||
| 44 | Google | 9B | — | — | ||
| 45 | Google | 1B | — | — | ||
| 46 | 8B | — | — | |||
| 47 | Baidu | 21B | 128K | $0.40 / $4.00 |
Notice missing or incorrect data?
FAQ
Common questions about GSM8k
Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
The GSM8k paper is available at https://arxiv.org/abs/2110.14168. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The GSM8k leaderboard ranks 47 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.973. The average score across all models is 0.864.
The highest GSM8k score is 0.973, achieved by Kimi K2 Instruct from Moonshot AI.
47 models have been evaluated on the GSM8k benchmark, with 0 verified results and 47 self-reported results.
GSM8k is categorized under math and reasoning. The benchmark evaluates text models.