Benchmarks/reasoning/Bird-SQL (dev)

Bird-SQL (dev)

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQLs) is a comprehensive text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 databases (33.4 GB total) spanning 37+ professional domains. It evaluates large language models' ability to convert natural language to executable SQL queries in real-world scenarios with complex database schemas and dirty data.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Bird-SQL (dev)

State-of-the-art frontier
Open
Proprietary

Bird-SQL (dev) Leaderboard

7 models
ContextCostLicense
11.0M$0.07 / $0.30
21.0M$0.10 / $0.40
327B131K$0.10 / $0.20
412B131K$0.05 / $0.10
5120B262K$0.10 / $0.50
64B131K$0.02 / $0.04
71B
Notice missing or incorrect data?

FAQ

Common questions about Bird-SQL (dev)

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQLs) is a comprehensive text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 databases (33.4 GB total) spanning 37+ professional domains. It evaluates large language models' ability to convert natural language to executable SQL queries in real-world scenarios with complex database schemas and dirty data.
The Bird-SQL (dev) paper is available at https://arxiv.org/abs/2305.03111. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Bird-SQL (dev) leaderboard ranks 7 AI models based on their performance on this benchmark. Currently, Gemini 2.0 Flash-Lite by Google leads with a score of 0.574. The average score across all models is 0.430.
The highest Bird-SQL (dev) score is 0.574, achieved by Gemini 2.0 Flash-Lite from Google.
7 models have been evaluated on the Bird-SQL (dev) benchmark, with 0 verified results and 7 self-reported results.
Bird-SQL (dev) is categorized under reasoning. The benchmark evaluates text models.