PIQA

Paper

Progress Over Time

Interactive timeline showing model performance evolution on PIQA

State-of-the-art frontier
Open
Proprietary

PIQA Leaderboard

11 models
ContextCostLicense
160B
2
Nous Research
Nous Research
70B
327B
49B
54B
58B
52B
82B
88B
10
Microsoft
Microsoft
4B
1121B
Notice missing or incorrect data?

Sub-benchmarks

About this benchmark

What is PIQA?

PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' ability to answer questions requiring physical world knowledge through multiple choice questions with everyday situations, focusing on atypical solutions inspired by instructables.com. The dataset contains 21,000 multiple choice questions where models must choose the most appropriate solution for physical interactions.

PIQA is a text benchmark evaluating models on physics, reasoning, and general tasks. LLM Stats tracks 11 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.

Compare leaders on the best AI for physics, best AI for reasoning and best AI for general leaderboards.

Current leaders

Phi-3.5-MoE-instruct from Microsoft currently leads the PIQA leaderboard with a score of 0.886 across 11 evaluated AI models.

1Phi-3.5-MoE-instructMicrosoft88.6%
2Hermes 3 70BNous Research84.4%
3Gemma 2 27BGoogle83.2%

Source paper

Title
PIQA: Reasoning about Physical Commonsense in Natural Language
Authors
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and 1 others
Published
Abstract

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains - such as news articles and encyclopedia entries, where text is plentiful - in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.

FAQ

Common questions about the PIQA benchmark and leaderboard.

What is the PIQA benchmark?

PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' ability to answer questions requiring physical world knowledge through multiple choice questions with everyday situations, focusing on atypical solutions inspired by instructables.com. The dataset contains 21,000 multiple choice questions where models must choose the most appropriate solution for physical interactions.

What is the PIQA leaderboard?

The PIQA leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.886. The average score across all models is 0.792.

What is the highest PIQA score?

The highest PIQA score is 0.886, achieved by Phi-3.5-MoE-instruct from Microsoft.

How many models are evaluated on PIQA?

11 models have been evaluated on the PIQA benchmark, with 0 verified results and 11 self-reported results.

Where can I find the PIQA paper?

The PIQA paper is available at https://arxiv.org/abs/1911.11641. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does PIQA cover?

PIQA is categorized under physics, reasoning, and general. The benchmark evaluates text models.

Are there variants of PIQA?

Yes. PIQA has 1 related variant: Global PIQA.

What is the best open-source model on PIQA?

Phi-3.5-MoE-instruct by Microsoft is the top-ranked open-source model on PIQA, with a score of 0.886 (rank #1).

How recent are the PIQA leaderboard results?

The PIQA leaderboard was last updated in June 2026 and currently includes 11 evaluated models.