SimpleQA

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.

DeepSeek-V3.2-Exp from DeepSeek currently leads the SimpleQA leaderboard with a score of 0.971 across 46 evaluated AI models.

Paper