Qasper
QASPER is a dataset of 5,049 information-seeking questions and answers anchored in 1,585 NLP research papers. Questions are written by NLP practitioners who read only titles and abstracts, while answers require understanding the full paper text and provide supporting evidence. The dataset challenges models with complex reasoning across document sections for academic document question answering. Each question seeks information present in the full text and is answered by a separate set of NLP practitioners who also provide supporting evidence to answers.
Progress Over Time
Interactive timeline showing model performance evolution on Qasper
State-of-the-art frontier
Open
Proprietary
Qasper Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Microsoft | 4B | 128K | $0.10 / $0.10 | ||
| 2 | Microsoft | 60B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about Qasper
QASPER is a dataset of 5,049 information-seeking questions and answers anchored in 1,585 NLP research papers. Questions are written by NLP practitioners who read only titles and abstracts, while answers require understanding the full paper text and provide supporting evidence. The dataset challenges models with complex reasoning across document sections for academic document question answering. Each question seeks information present in the full text and is answered by a separate set of NLP practitioners who also provide supporting evidence to answers.
The Qasper paper is available at https://arxiv.org/abs/2105.03011. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Qasper leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Phi-3.5-mini-instruct by Microsoft leads with a score of 0.419. The average score across all models is 0.409.
The highest Qasper score is 0.419, achieved by Phi-3.5-mini-instruct from Microsoft.
2 models have been evaluated on the Qasper benchmark, with 0 verified results and 2 self-reported results.
Qasper is categorized under long context and reasoning. The benchmark evaluates text models.