PIQA

PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' ability to answer questions requiring physical world knowledge through multiple choice questions with everyday situations, focusing on atypical solutions inspired by instructables.com. The dataset contains 21,000 multiple choice questions where models must choose the most appropriate solution for physical interactions.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on PIQA

State-of-the-art frontier
Open
Proprietary

PIQA Leaderboard

11 models • 0 verified
ContextCostLicense
1
60B
2
Nous Research
Nous Research
70B
3
27B
4
9B
5
4B128K
$0.10
$0.10
5
8B
5
2B
8
2B
8
8B
10
Microsoft
Microsoft
4B
11
21B128K
$0.40
$4.00
Notice missing or incorrect data?

FAQ

Common questions about PIQA

PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' ability to answer questions requiring physical world knowledge through multiple choice questions with everyday situations, focusing on atypical solutions inspired by instructables.com. The dataset contains 21,000 multiple choice questions where models must choose the most appropriate solution for physical interactions.
The PIQA paper is available at https://arxiv.org/abs/1911.11641. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The PIQA leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.886. The average score across all models is 0.792.
The highest PIQA score is 0.886, achieved by Phi-3.5-MoE-instruct from Microsoft.
11 models have been evaluated on the PIQA benchmark, with 0 verified results and 11 self-reported results.
PIQA is categorized under general, physics, and reasoning. The benchmark evaluates text models.

Sub-benchmarks