SimpleQA

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SimpleQA

State-of-the-art frontier
Open
Proprietary

SimpleQA Leaderboard

44 models
ContextCostLicense
1685B
22.0M$0.20 / $0.50
3671B164K$0.27 / $1.00
4671B131K$0.50 / $2.15
5
6
71.0M$0.50 / $3.00
8
OpenAI
OpenAI
128K$75.00 / $150.00
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
111.0M$1.25 / $10.00
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
131.0M$1.25 / $10.00
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
16
OpenAI
OpenAI
200K$15.00 / $60.00
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
181.0M$0.25 / $1.50
19128K$15.00 / $60.00
20
OpenAI
OpenAI
128K$2.50 / $10.00
21
Moonshot AI
Moonshot AI
1.0T
221.0T
22
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
24
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
251.0M$0.30 / $2.50
26
DeepSeek
DeepSeek
671B131K$0.27 / $1.10
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
28675B
28675B262K$0.50 / $1.50
28675B
28675B
321.0M$0.07 / $0.30
33456B1.0M$0.55 / $2.20
34456B
35
OpenAI
OpenAI
200K$1.10 / $4.40
3624B
371.0M$0.10 / $0.40
3824B
3927B131K$0.10 / $0.20
4012B131K$0.05 / $0.10
414B131K$0.02 / $0.04
42
Microsoft
Microsoft
15B16K$0.07 / $0.14
431B
4421B128K$0.40 / $4.00
Notice missing or incorrect data?

FAQ

Common questions about SimpleQA

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.
The SimpleQA paper is available at https://arxiv.org/abs/2411.04368. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SimpleQA leaderboard ranks 44 AI models based on their performance on this benchmark. Currently, DeepSeek-V3.2-Exp by DeepSeek leads with a score of 0.971. The average score across all models is 0.378.
The highest SimpleQA score is 0.971, achieved by DeepSeek-V3.2-Exp from DeepSeek.
44 models have been evaluated on the SimpleQA benchmark, with 0 verified results and 44 self-reported results.
SimpleQA is categorized under factuality, general, and reasoning. The benchmark evaluates text models.