GPQA

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

Progress Over Time

Interactive timeline showing model performance evolution on GPQA

State-of-the-art frontier
Open
Proprietary

GPQA Leaderboard

207 models
ContextCostLicense
1$25.00 / $125.00
21.0M$2.50 / $15.00
3
Anthropic
Anthropic
1.0M$5.00 / $25.00
4400K$21.00 / $168.00
5
OpenAI
OpenAI
1.0M$2.50 / $15.00
6
OpenAI
OpenAI
400K$1.75 / $14.00
7
81.0M$5.00 / $25.00
91.0M$0.50 / $3.00
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
11200K$3.00 / $15.00
12
13
ByteDance
ByteDance
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
14
16
16400K$1.25 / $10.00
16400K$1.25 / $10.00
16400K$1.25 / $10.00
16
OpenAI
OpenAI
400K$1.25 / $10.00
21400K$0.75 / $4.50
22
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $3.00
23
24
25200K$5.00 / $25.00
261.0M$0.25 / $1.50
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
281.0M$1.25 / $10.00
29
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
30
OpenAI
OpenAI
302.0M$0.20 / $0.50
30
Zhipu AI
Zhipu AI
358B205K$0.60 / $2.20
33
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
34
ByteDance
ByteDance
35
36200K$3.00 / $15.00
37128K$3.00 / $15.00
381.0T
3931B262K$0.14 / $0.40
40
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
41128K$2.50 / $10.00
41128K$0.30 / $0.50
43309B256K$0.10 / $0.30
44200K$3.00 / $15.00
45
OpenAI
OpenAI
200K$2.00 / $8.00
461.0M$1.25 / $10.00
471.0M$0.30 / $2.50
47400K$0.20 / $1.25
49120B
50685B164K$0.26 / $0.38
150 of 207
1/5
Notice missing or incorrect data?

FAQ

Common questions about GPQA

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
The GPQA paper is available at https://arxiv.org/abs/2311.12022. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The GPQA leaderboard ranks 207 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.946. The average score across all models is 0.650.
The highest GPQA score is 0.946, achieved by Claude Mythos Preview from Anthropic.
207 models have been evaluated on the GPQA benchmark, with 0 verified results and 205 self-reported results.
GPQA is categorized under biology, chemistry, general, physics, and reasoning. The benchmark evaluates text models.