DROP

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on DROP

State-of-the-art frontier
Open
Proprietary

DROP Leaderboard

29 models
ContextCostLicense
1
DeepSeek
DeepSeek
671B131K$0.27 / $1.10
2200K$3.00 / $15.00
2200K$3.00 / $15.00
4128K$10.00 / $30.00
5
Amazon
Amazon
300K$0.80 / $3.20
6405B128K$0.89 / $0.89
7
OpenAI
OpenAI
128K$2.50 / $10.00
8200K$0.80 / $4.00
8
Anthropic
Anthropic
200K$15.00 / $75.00
10
OpenAI
OpenAI
33K$30.00 / $60.00
11
Amazon
Amazon
300K$0.06 / $0.24
12128K$0.15 / $0.60
1370B128K$0.20 / $0.20
14128K$0.03 / $0.14
15560B128K$0.30 / $1.20
16200K$3.00 / $15.00
17200K$0.25 / $1.25
18
Microsoft
Microsoft
15B16K$0.07 / $0.14
192.1M$2.50 / $10.00
2016K$0.50 / $1.50
212B
218B
238B131K$0.03 / $0.03
248B128K$0.50 / $0.50
258B
252B
277B
288B
2921B128K$0.40 / $4.00
Notice missing or incorrect data?

FAQ

Common questions about DROP

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.
The DROP paper is available at https://arxiv.org/abs/1903.00161. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The DROP leaderboard ranks 29 AI models based on their performance on this benchmark. Currently, DeepSeek-V3 by DeepSeek leads with a score of 0.916. The average score across all models is 0.720.
The highest DROP score is 0.916, achieved by DeepSeek-V3 from DeepSeek.
29 models have been evaluated on the DROP benchmark, with 0 verified results and 28 self-reported results.
DROP is categorized under math and reasoning. The benchmark evaluates text models.