CommonSenseQA

Paper

Progress Over Time

Interactive timeline showing model performance evolution on CommonSenseQA

State-of-the-art frontier
Open
Proprietary

CommonSenseQA Leaderboard

1 models
ContextCostLicense
112B
Notice missing or incorrect data?
About this benchmark

What is CommonSenseQA?

CommonSenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict correct answers. It contains 12,102 questions with one correct answer and four distractors, designed to test semantic reasoning and conceptual relationships. Questions are created based on ConceptNet concepts and require prior world knowledge for accurate reasoning.

CommonSenseQA is a text benchmark evaluating models on reasoning and language tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.7.

Compare leaders on the best AI for reasoning and best AI for language leaderboards.

Current leaders

Mistral NeMo Instruct from Mistral AI currently leads the CommonSenseQA leaderboard with a score of 0.704 across 1 evaluated AI models.

1Mistral NeMo InstructMistral AI70.4%

Source paper

Title
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Authors
Alon Talmor, Jonathan Herzig, Nicholas Lourie, Jonathan Berant
Published
Abstract

When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present CommonsenseQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from ConceptNet (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.

FAQ

Common questions about the CommonSenseQA benchmark and leaderboard.

What is the CommonSenseQA benchmark?

CommonSenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict correct answers. It contains 12,102 questions with one correct answer and four distractors, designed to test semantic reasoning and conceptual relationships. Questions are created based on ConceptNet concepts and require prior world knowledge for accurate reasoning.

What is the CommonSenseQA leaderboard?

The CommonSenseQA leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Mistral NeMo Instruct by Mistral AI leads with a score of 0.704. The average score across all models is 0.704.

What is the highest CommonSenseQA score?

The highest CommonSenseQA score is 0.704, achieved by Mistral NeMo Instruct from Mistral AI.

How many models are evaluated on CommonSenseQA?

1 models have been evaluated on the CommonSenseQA benchmark, with 0 verified results and 1 self-reported results.

Where can I find the CommonSenseQA paper?

The CommonSenseQA paper is available at https://arxiv.org/abs/1811.00937. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does CommonSenseQA cover?

CommonSenseQA is categorized under reasoning and language. The benchmark evaluates text models.

What is the best open-source model on CommonSenseQA?

Mistral NeMo Instruct by Mistral AI is the top-ranked open-source model on CommonSenseQA, with a score of 0.704 (rank #1).

How recent are the CommonSenseQA leaderboard results?

The CommonSenseQA leaderboard was last updated in July 2026 and currently includes 1 evaluated models.