Bird-SQL (dev)
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQLs) is a comprehensive text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 databases (33.4 GB total) spanning 37+ professional domains. It evaluates large language models' ability to convert natural language to executable SQL queries in real-world scenarios with complex database schemas and dirty data.
Progress Over Time
Interactive timeline showing model performance evolution on Bird-SQL (dev)
State-of-the-art frontier
Open
Proprietary
Bird-SQL (dev) Leaderboard
7 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | — | 1.0M | $0.07 / $0.30 | ||
| 2 | Google | — | 1.0M | $0.10 / $0.40 | ||
| 3 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 4 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 5 | 120B | 262K | $0.10 / $0.50 | |||
| 6 | Google | 4B | 131K | $0.02 / $0.04 | ||
| 7 | Google | 1B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about Bird-SQL (dev)
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQLs) is a comprehensive text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 databases (33.4 GB total) spanning 37+ professional domains. It evaluates large language models' ability to convert natural language to executable SQL queries in real-world scenarios with complex database schemas and dirty data.
The Bird-SQL (dev) paper is available at https://arxiv.org/abs/2305.03111. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Bird-SQL (dev) leaderboard ranks 7 AI models based on their performance on this benchmark. Currently, Gemini 2.0 Flash-Lite by Google leads with a score of 0.574. The average score across all models is 0.430.
The highest Bird-SQL (dev) score is 0.574, achieved by Gemini 2.0 Flash-Lite from Google.
7 models have been evaluated on the Bird-SQL (dev) benchmark, with 0 verified results and 7 self-reported results.
Bird-SQL (dev) is categorized under reasoning. The benchmark evaluates text models.