GPQA
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
Progress Over Time
Interactive timeline showing model performance evolution on GPQA
State-of-the-art frontier
Open
Proprietary
GPQA Leaderboard
209 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | $25.00 / $125.00 | ||
| 2 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 3 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 4 | OpenAI | — | 400K | $21.00 / $168.00 | ||
| 5 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 6 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 7 | Google | — | — | — | ||
| 8 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 9 | Kimi K2.6New Moonshot AI | 1.0T | 262K | $0.95 / $4.00 | ||
| 10 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 10 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 12 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 13 | Meta | — | — | — | ||
| 14 | ByteDance | — | — | — | ||
| 15 | xAI | — | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 17 | OpenAI | — | — | — | ||
| 17 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 17 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 17 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 17 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 22 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 23 | Moonshot AI | 1.0T | 262K | $0.60 / $3.00 | ||
| 24 | xAI | — | — | — | ||
| 25 | OpenAI | — | — | — | ||
| 26 | Anthropic | — | 200K | $5.00 / $25.00 | ||
| 27 | Google | — | 1.0M | $0.25 / $1.50 | ||
| 28 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 29 | — | 1.0M | $1.25 / $10.00 | |||
| 30 | Zhipu AI | 754B | 200K | $1.40 / $4.40 | ||
| 31 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 32 | xAI | — | 2.0M | $0.20 / $0.50 | ||
| 32 | Zhipu AI | 358B | 205K | $0.60 / $2.20 | ||
| 32 | OpenAI | — | — | — | ||
| 35 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 36 | ByteDance | — | — | — | ||
| 37 | Baidu | — | — | — | ||
| 38 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 39 | xAI | — | 128K | $3.00 / $15.00 | ||
| 40 | Moonshot AI | 1.0T | — | — | ||
| 41 | Google | 31B | 262K | $0.14 / $0.40 | ||
| 42 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 43 | xAI | — | 128K | $0.30 / $0.50 | ||
| 43 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 45 | Xiaomi | 309B | 256K | $0.10 / $0.30 | ||
| 46 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 47 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 48 | Google | — | 1.0M | $1.25 / $10.00 | ||
| 49 | Google | — | 1.0M | $0.30 / $2.50 | ||
| 49 | OpenAI | — | 400K | $0.20 / $1.25 |
1–50 of 209
1/5
Notice missing or incorrect data?
FAQ
Common questions about GPQA
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
The GPQA paper is available at https://arxiv.org/abs/2311.12022. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The GPQA leaderboard ranks 209 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.946. The average score across all models is 0.652.
The highest GPQA score is 0.946, achieved by Claude Mythos Preview from Anthropic.
209 models have been evaluated on the GPQA benchmark, with 0 verified results and 207 self-reported results.
GPQA is categorized under biology, chemistry, general, physics, and reasoning. The benchmark evaluates text models.