Qasper

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Qasper

State-of-the-art frontier
Open
Proprietary

Qasper Leaderboard

2 models
ContextCostLicense
14B
260B
Notice missing or incorrect data?
About this benchmark

What is Qasper?

QASPER is a dataset of 5,049 information-seeking questions and answers anchored in 1,585 NLP research papers. Questions are written by NLP practitioners who read only titles and abstracts, while answers require understanding the full paper text and provide supporting evidence. The dataset challenges models with complex reasoning across document sections for academic document question answering. Each question seeks information present in the full text and is answered by a separate set of NLP practitioners who also provide supporting evidence to answers.

Qasper is a text benchmark evaluating models on long context and reasoning tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.4, with the leader at 0.4.

Compare leaders on the best AI for long context and best AI for reasoning leaderboards.

Current leaders

Phi-3.5-mini-instruct from Microsoft currently leads the Qasper leaderboard with a score of 0.419 across 2 evaluated AI models.

1Phi-3.5-mini-instructMicrosoft41.9%
2Phi-3.5-MoE-instructMicrosoft40.0%

Source paper

Title
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Authors
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, and 2 others
Published
Abstract

Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present QASPER, a dataset of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.

FAQ

Common questions about the Qasper benchmark and leaderboard.

What is the Qasper benchmark?

QASPER is a dataset of 5,049 information-seeking questions and answers anchored in 1,585 NLP research papers. Questions are written by NLP practitioners who read only titles and abstracts, while answers require understanding the full paper text and provide supporting evidence. The dataset challenges models with complex reasoning across document sections for academic document question answering. Each question seeks information present in the full text and is answered by a separate set of NLP practitioners who also provide supporting evidence to answers.

What is the Qasper leaderboard?

The Qasper leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Phi-3.5-mini-instruct by Microsoft leads with a score of 0.419. The average score across all models is 0.409.

What is the highest Qasper score?

The highest Qasper score is 0.419, achieved by Phi-3.5-mini-instruct from Microsoft.

How many models are evaluated on Qasper?

2 models have been evaluated on the Qasper benchmark, with 0 verified results and 2 self-reported results.

Where can I find the Qasper paper?

The Qasper paper is available at https://arxiv.org/abs/2105.03011. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Qasper cover?

Qasper is categorized under long context and reasoning. The benchmark evaluates text models.

What is the best open-source model on Qasper?

Phi-3.5-mini-instruct by Microsoft is the top-ranked open-source model on Qasper, with a score of 0.419 (rank #1).

How recent are the Qasper leaderboard results?

The Qasper leaderboard was last updated in July 2026 and currently includes 2 evaluated models.