Benchmarks/math/GSM8k

GSM8k

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on GSM8k

State-of-the-art frontier

Open

Proprietary

GSM8k Leaderboard

47 models

			Context	Cost
1	Kimi K2 Instruct Moonshot AI	1.0T	200K	$0.50 / $0.50
2	o1 OpenAI	—	200K	$15.00 / $60.00
3	GPT-4.5 OpenAI	—	128K	$75.00 / $150.00
4	Llama 3.1 405B Instruct Meta	405B	128K	$0.89 / $0.89
5	Claude 3.5 Sonnet Anthropic	—	200K	$3.00 / $15.00
5	Claude 3.5 Sonnet Anthropic	—	200K	$3.00 / $15.00
7	Gemma 3 27B Google	27B	131K	$0.10 / $0.20
7	Qwen2.5 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
9	Qwen2.5 72B Instruct Alibaba Cloud / Qwen Team	73B	131K	$0.35 / $0.40
10	DeepSeek-V2.5 DeepSeek	236B	8K	$0.14 / $0.28
11	Claude 3 Opus Anthropic	—	200K	$15.00 / $75.00
12	Nova Pro Amazon	—	300K	$0.80 / $3.20
12	Qwen2.5 14B Instruct Alibaba Cloud / Qwen Team	15B	—	—
14	Nova Lite Amazon	—	300K	$0.06 / $0.24
15	Gemma 3 12B Google	12B	131K	$0.05 / $0.10
16	Qwen3 235B A22B Alibaba Cloud / Qwen Team	235B	128K	$0.10 / $0.10
17	Mistral Large 2 Mistral AI	123B	128K	$2.00 / $6.00
18	Claude 3 Sonnet Anthropic	—	200K	$3.00 / $15.00
18	Nova Micro Amazon	—	128K	$0.03 / $0.14
20	Kimi K2 Base Moonshot AI	1.0T	—	—
21	Qwen2.5 7B Instruct Alibaba Cloud / Qwen Team	8B	131K	$0.30 / $0.30
22	Llama 3.1 Nemotron 70B Instruct NVIDIA	70B	—	—
23	Qwen2.5-Coder 32B Instruct Alibaba Cloud / Qwen Team	32B	128K	$0.09 / $0.09
23	Qwen2 72B Instruct Alibaba Cloud / Qwen Team	72B	—	—
25	Gemini 1.5 Pro Google	—	2.1M	$2.50 / $10.00
26	Grok-1.5 xAI	—	—	—
27	Gemma 3 4B Google	4B	131K	$0.02 / $0.04
28	Claude 3 Haiku Anthropic	—	200K	$0.25 / $1.25
29	Qwen2.5-Omni-7B Alibaba Cloud / Qwen Team	7B	—	—
29	Phi-3.5-MoE-instruct Microsoft	60B	—	—
31	Phi 4 Mini Microsoft	4B	—	—
32	Jamba 1.5 Large AI21 Labs	398B	256K	$2.00 / $8.00
33	Gemini 1.5 Flash Google	—	1.0M	$0.15 / $0.60
33	Phi-3.5-mini-instruct Microsoft	4B	128K	$0.10 / $0.10
35	Qwen2.5-Coder 7B Instruct Alibaba Cloud / Qwen Team	7B	—	—
36	Qwen2 7B Instruct Alibaba Cloud / Qwen Team	8B	—	—
37	Granite 3.3 8B Instruct IBM	8B	128K	$0.50 / $0.50
38	Mistral Small 3 24B Base Mistral AI	24B	—	—
39	Llama 3.2 3B Instruct Meta	3B	128K	$0.01 / $0.02
40	Jamba 1.5 Mini AI21 Labs	52B	256K	$0.20 / $0.40
41	Gemma 2 27B Google	27B	—	—
42	Command R+ Cohere	104B	128K	$0.25 / $1.00
43	IBM Granite 4.0 Tiny Preview IBM	7B	—	—
44	Gemma 2 9B Google	9B	—	—
45	Gemma 3 1B Google	1B	—	—
46	Granite 3.3 8B Base IBM	8B	—	—
47	ERNIE 4.5 Baidu	21B	128K	$0.40 / $4.00

Notice missing or incorrect data?

FAQ

Common questions about GSM8k

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

The GSM8k paper is available at https://arxiv.org/abs/2110.14168. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The GSM8k leaderboard ranks 47 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.973. The average score across all models is 0.864.

The highest GSM8k score is 0.973, achieved by Kimi K2 Instruct from Moonshot AI.

47 models have been evaluated on the GSM8k benchmark, with 0 verified results and 47 self-reported results.

GSM8k is categorized under math and reasoning. The benchmark evaluates text models.

GSM8k

Progress Over Time

GSM8k Leaderboard

FAQ

What is the GSM8k benchmark?

Where can I find the GSM8k paper?

What is the GSM8k leaderboard?

What is the highest GSM8k score?

How many models are evaluated on GSM8k?

What categories does GSM8k cover?