DROP

Paper

Progress Over Time

Interactive timeline showing model performance evolution on DROP

State-of-the-art frontier
Open
Proprietary

DROP Leaderboard

30 models
ContextCostLicense
1
DeepSeek
DeepSeek
671B
2
2
41.0T1.0M$0.43 / $0.87
5128K$10.00 / $30.00
6
Amazon
Amazon
7405B
8
OpenAI
OpenAI
128K$2.50 / $10.00
9
Anthropic
Anthropic
9
11
OpenAI
OpenAI
12
Amazon
Amazon
13
1470B
15
16560B
17
18
19
Microsoft
Microsoft
15B
20
2116K$0.50 / $1.50
228B
222B
248B
258B
262B
268B
287B
298B
3021B
Notice missing or incorrect data?
About this benchmark

What is DROP?

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.

DROP is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 30 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.

Compare leaders on the best AI for math and best AI for reasoning leaderboards.

Current leaders

DeepSeek-V3 from DeepSeek currently leads the DROP leaderboard with a score of 0.916 across 30 evaluated AI models.

1DeepSeek-V3DeepSeek91.6%
2Claude 3.5 SonnetAnthropic87.1%
2Claude 3.5 SonnetAnthropic87.1%

Source paper

Title
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
Authors
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, and 2 others
Published
Abstract

Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.

FAQ

Common questions about the DROP benchmark and leaderboard.

What is the DROP benchmark?

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.

What is the DROP leaderboard?

The DROP leaderboard ranks 30 AI models based on their performance on this benchmark. Currently, DeepSeek-V3 by DeepSeek leads with a score of 0.916. The average score across all models is 0.725.

What is the highest DROP score?

The highest DROP score is 0.916, achieved by DeepSeek-V3 from DeepSeek.

How many models are evaluated on DROP?

30 models have been evaluated on the DROP benchmark, with 0 verified results and 29 self-reported results.

Where can I find the DROP paper?

The DROP paper is available at https://arxiv.org/abs/1903.00161. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does DROP cover?

DROP is categorized under math and reasoning. The benchmark evaluates text models.

What is the best open-source model on DROP?

DeepSeek-V3 by DeepSeek is the top-ranked open-source model on DROP, with a score of 0.916 (rank #1).

Which model offers the best value on DROP?

Among models scoring within 10% of the leader, MiMo-V2.5-Pro from Xiaomi is the cheapest, at $0.43 per million input tokens with a score of 0.863.

How recent are the DROP leaderboard results?

The DROP leaderboard was last updated in July 2026 and currently includes 30 evaluated models.