PIQA
PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' ability to answer questions requiring physical world knowledge through multiple choice questions with everyday situations, focusing on atypical solutions inspired by instructables.com. The dataset contains 21,000 multiple choice questions where models must choose the most appropriate solution for physical interactions.
Progress Over Time
Interactive timeline showing model performance evolution on PIQA
State-of-the-art frontier
Open
Proprietary
PIQA Leaderboard
11 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Microsoft | 60B | — | — | ||
2 | Nous Research | 70B | — | — | ||
3 | Google | 27B | — | — | ||
4 | Google | 9B | — | — | ||
5 | Microsoft | 4B | 128K | $0.10 $0.10 | ||
5 | Google | 8B | — | — | ||
5 | 2B | — | — | |||
8 | 2B | — | — | |||
8 | Google | 8B | — | — | ||
10 | Microsoft | 4B | — | — | ||
11 | Baidu | 21B | 128K | $0.40 $4.00 |
Notice missing or incorrect data?
FAQ
Common questions about PIQA
PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' ability to answer questions requiring physical world knowledge through multiple choice questions with everyday situations, focusing on atypical solutions inspired by instructables.com. The dataset contains 21,000 multiple choice questions where models must choose the most appropriate solution for physical interactions.
The PIQA paper is available at https://arxiv.org/abs/1911.11641. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The PIQA leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.886. The average score across all models is 0.792.
The highest PIQA score is 0.886, achieved by Phi-3.5-MoE-instruct from Microsoft.
11 models have been evaluated on the PIQA benchmark, with 0 verified results and 11 self-reported results.
PIQA is categorized under general, physics, and reasoning. The benchmark evaluates text models.