SimpleQA
SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.
Progress Over Time
Interactive timeline showing model performance evolution on SimpleQA
State-of-the-art frontier
Open
Proprietary
SimpleQA Leaderboard
44 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 685B | — | — | ||
| 2 | xAI | — | 2.0M | $0.20 / $0.50 | ||
| 3 | DeepSeek | 671B | 164K | $0.27 / $1.00 | ||
| 4 | DeepSeek | 671B | 131K | $0.50 / $2.15 | ||
| 5 | Baidu | — | — | — | ||
| 6 | Google | — | — | — | ||
| 7 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 8 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 9 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 / $0.80 | ||
| 11 | — | 1.0M | $1.25 / $10.00 | |||
| 12 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 13 | Google | — | 1.0M | $1.25 / $10.00 | ||
| 14 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 15 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 16 | OpenAI | — | 200K | $15.00 / $60.00 | ||
| 17 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 18 | Google | — | 1.0M | $0.25 / $1.50 | ||
| 19 | OpenAI | — | 128K | $15.00 / $60.00 | ||
| 20 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 21 | Moonshot AI | 1.0T | — | — | ||
| 22 | Moonshot AI | 1.0T | — | — | ||
| 22 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 24 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 25 | Google | — | 1.0M | $0.30 / $2.50 | ||
| 26 | DeepSeek | 671B | 131K | $0.27 / $1.10 | ||
| 27 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 28 | 675B | — | — | |||
| 28 | Mistral AI | 675B | 262K | $0.50 / $1.50 | ||
| 28 | Mistral AI | 675B | — | — | ||
| 28 | 675B | — | — | |||
| 32 | Google | — | 1.0M | $0.07 / $0.30 | ||
| 33 | MiniMax | 456B | 1.0M | $0.55 / $2.20 | ||
| 34 | MiniMax | 456B | — | — | ||
| 35 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 36 | Mistral AI | 24B | — | — | ||
| 37 | Google | — | 1.0M | $0.10 / $0.40 | ||
| 38 | Mistral AI | 24B | — | — | ||
| 39 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 40 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 41 | Google | 4B | 131K | $0.02 / $0.04 | ||
| 42 | Microsoft | 15B | 16K | $0.07 / $0.14 | ||
| 43 | Google | 1B | — | — | ||
| 44 | Baidu | 21B | 128K | $0.40 / $4.00 |
Notice missing or incorrect data?
FAQ
Common questions about SimpleQA
SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.
The SimpleQA paper is available at https://arxiv.org/abs/2411.04368. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SimpleQA leaderboard ranks 44 AI models based on their performance on this benchmark. Currently, DeepSeek-V3.2-Exp by DeepSeek leads with a score of 0.971. The average score across all models is 0.378.
The highest SimpleQA score is 0.971, achieved by DeepSeek-V3.2-Exp from DeepSeek.
44 models have been evaluated on the SimpleQA benchmark, with 0 verified results and 44 self-reported results.
SimpleQA is categorized under factuality, general, and reasoning. The benchmark evaluates text models.