Qasper
Progress Over Time
Interactive timeline showing model performance evolution on Qasper
Qasper Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Microsoft | 4B | — | — | ||
| 2 | Microsoft | 60B | — | — |
What is Qasper?
QASPER is a dataset of 5,049 information-seeking questions and answers anchored in 1,585 NLP research papers. Questions are written by NLP practitioners who read only titles and abstracts, while answers require understanding the full paper text and provide supporting evidence. The dataset challenges models with complex reasoning across document sections for academic document question answering. Each question seeks information present in the full text and is answered by a separate set of NLP practitioners who also provide supporting evidence to answers.
Qasper is a text benchmark evaluating models on long context and reasoning tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.4, with the leader at 0.4.
Compare leaders on the best AI for long context and best AI for reasoning leaderboards.
Current leaders
Phi-3.5-mini-instruct from Microsoft currently leads the Qasper leaderboard with a score of 0.419 across 2 evaluated AI models.
Source paper
- Title
- A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
- Authors
- Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, and 2 others
- Published
- arXiv
- 2105.03011
Abstract
Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present QASPER, a dataset of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.
FAQ
Common questions about the Qasper benchmark and leaderboard.