DROP
Progress Over Time
Interactive timeline showing model performance evolution on DROP
DROP Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 671B | — | — | ||
| 2 | Anthropic | — | — | — | ||
| 2 | Anthropic | — | — | — | ||
| 4 | Xiaomi | 1.0T | 1.0M | $0.43 / $0.87 | ||
| 5 | OpenAI | — | 128K | $10.00 / $30.00 | ||
| 6 | Amazon | — | — | — | ||
| 7 | 405B | — | — | |||
| 8 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 9 | Anthropic | — | — | — | ||
| 9 | Anthropic | — | — | — | ||
| 11 | OpenAI | — | — | — | ||
| 12 | Amazon | — | — | — | ||
| 13 | OpenAI | — | — | — | ||
| 14 | 70B | — | — | |||
| 15 | Amazon | — | — | — | ||
| 16 | Meituan | 560B | — | — | ||
| 17 | Anthropic | — | — | — | ||
| 18 | Anthropic | — | — | — | ||
| 19 | Microsoft | 15B | — | — | ||
| 20 | Google | — | — | — | ||
| 21 | OpenAI | — | 16K | $0.50 / $1.50 | ||
| 22 | Google | 8B | — | — | ||
| 22 | 2B | — | — | |||
| 24 | 8B | — | — | |||
| 25 | 8B | — | — | |||
| 26 | 2B | — | — | |||
| 26 | Google | 8B | — | — | ||
| 28 | 7B | — | — | |||
| 29 | 8B | — | — | |||
| 30 | Baidu | 21B | — | — |
What is DROP?
DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.
DROP is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 30 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.
Compare leaders on the best AI for math and best AI for reasoning leaderboards.
Current leaders
DeepSeek-V3 from DeepSeek currently leads the DROP leaderboard with a score of 0.916 across 30 evaluated AI models.
Source paper
- Title
- DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
- Authors
- Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, and 2 others
- Published
- arXiv
- 1903.00161
Abstract
Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.
FAQ
Common questions about the DROP benchmark and leaderboard.