Benchmarks/biology/GPQA

GPQA

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

Paper Dataset Code

Progress Over Time

Interactive timeline showing model performance evolution on GPQA

State-of-the-art frontier

Open

Proprietary

GPQA Leaderboard

209 models

			Context	Cost
1	Claude Mythos Preview Anthropic	—	—	$25.00 / $125.00
2	Gemini 3.1 Pro Google	—	1.0M	$2.50 / $15.00
3	Claude Opus 4.7New Anthropic	—	1.0M	$5.00 / $25.00
4	GPT-5.2 Pro OpenAI	—	400K	$21.00 / $168.00
5	GPT-5.4 OpenAI	—	1.0M	$2.50 / $15.00
6	GPT-5.2 OpenAI	—	400K	$1.75 / $14.00
7	Gemini 3 Pro Google	—	—	—
8	Claude Opus 4.6 Anthropic	—	1.0M	$5.00 / $25.00
9	Kimi K2.6New Moonshot AI	1.0T	262K	$0.95 / $4.00
10	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	—	—
10	Gemini 3 Flash Google	—	1.0M	$0.50 / $3.00
12	Claude Sonnet 4.6 Anthropic	—	200K	$3.00 / $15.00
13	Muse Spark Meta	—	—	—
14	Seed 2.0 Pro ByteDance	—	—	—
15	Grok-4 Heavy xAI	—	—	—
15	Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team	397B	262K	$0.60 / $3.60
17	GPT-5.1 High OpenAI	—	—	—
17	GPT-5 Medium OpenAI	—	400K	$1.25 / $10.00
17	GPT-5.1 Thinking OpenAI	—	400K	$1.25 / $10.00
17	GPT-5.1 Instant OpenAI	—	400K	$1.25 / $10.00
17	GPT-5.1 OpenAI	—	400K	$1.25 / $10.00
22	GPT-5.4 mini OpenAI	—	400K	$0.75 / $4.50
23	Kimi K2.5 Moonshot AI	1.0T	262K	$0.60 / $3.00
24	Grok-4 xAI	—	—	—
25	GPT-5 High OpenAI	—	—	—
26	Claude Opus 4.5 Anthropic	—	200K	$5.00 / $25.00
27	Gemini 3.1 Flash-Lite Google	—	1.0M	$0.25 / $1.50
28	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	262K	$0.40 / $3.20
29	Gemini 2.5 Pro Preview 06-05 Google	—	1.0M	$1.25 / $10.00
30	GLM-5.1 Zhipu AI	754B	200K	$1.40 / $4.40
31	Qwen3.6-35B-A3BNew Alibaba Cloud / Qwen Team	35B	—	—
32	Grok 4 Fast xAI	—	2.0M	$0.20 / $0.50
32	GLM-4.7 Zhipu AI	358B	205K	$0.60 / $2.20
32	GPT-5 OpenAI	—	—	—
35	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
36	Seed 2.0 Lite ByteDance	—	—	—
37	ERNIE 5.0 Baidu	—	—	—
38	Claude 3.7 Sonnet Anthropic	—	200K	$3.00 / $15.00
39	Grok-3 xAI	—	128K	$3.00 / $15.00
40	Kimi K2-Thinking-0905 Moonshot AI	1.0T	—	—
41	Gemma 4 31B Google	31B	262K	$0.14 / $0.40
42	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	262K	$0.25 / $2.00
43	Grok-3 Mini xAI	—	128K	$0.30 / $0.50
43	ChatGPT-4o Latest OpenAI	—	128K	$2.50 / $10.00
45	MiMo-V2-Flash Xiaomi	309B	256K	$0.10 / $0.30
46	Claude Sonnet 4.5 Anthropic	—	200K	$3.00 / $15.00
47	o3 OpenAI	—	200K	$2.00 / $8.00
48	Gemini 2.5 Pro Google	—	1.0M	$1.25 / $10.00
49	Gemini 2.5 Flash Google	—	1.0M	$0.30 / $2.50
49	GPT-5.4 nano OpenAI	—	400K	$0.20 / $1.25

1–50 of 209

1/5

Notice missing or incorrect data?

FAQ

Common questions about GPQA

The GPQA paper is available at https://arxiv.org/abs/2311.12022. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The GPQA leaderboard ranks 209 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.946. The average score across all models is 0.652.

The highest GPQA score is 0.946, achieved by Claude Mythos Preview from Anthropic.

209 models have been evaluated on the GPQA benchmark, with 0 verified results and 207 self-reported results.

GPQA is categorized under biology, chemistry, general, physics, and reasoning. The benchmark evaluates text models.

GPQA

Progress Over Time

GPQA Leaderboard

FAQ

What is the GPQA benchmark?

Where can I find the GPQA paper?

What is the GPQA leaderboard?

What is the highest GPQA score?

How many models are evaluated on GPQA?

What categories does GPQA cover?