What is the SimpleQA leaderboard?

The SimpleQA leaderboard ranks 46 AI models based on their performance on this benchmark. Currently, DeepSeek-V3.2-Exp by DeepSeek leads with a score of 0.971. The average score across all models is 0.382.

What is the highest SimpleQA score?

The highest SimpleQA score is 0.971, achieved by DeepSeek-V3.2-Exp from DeepSeek.

How many models are evaluated on SimpleQA?

46 models have been evaluated on the SimpleQA benchmark, with 0 verified results and 46 self-reported results.

Where can I find the SimpleQA paper?

The SimpleQA paper is available at https://arxiv.org/abs/2411.04368. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SimpleQA cover?

SimpleQA is categorized under factuality, general, and reasoning. The benchmark evaluates text models.

All benchmarks

SimpleQA

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.

DeepSeek-V3.2-Exp from DeepSeek currently leads the SimpleQA leaderboard with a score of 0.971 across 46 evaluated AI models.

Paper