CRAG
CRAG (Comprehensive RAG Benchmark) is a factual question answering benchmark consisting of 4,409 question-answer pairs across 5 domains (finance, sports, music, movie, open domain) and 8 question categories. The benchmark includes mock APIs to simulate web and Knowledge Graph search, designed to represent the diverse and dynamic nature of real-world QA tasks with temporal dynamism ranging from years to seconds. It evaluates retrieval-augmented generation systems for trustworthy question answering.
Progress Over Time
Interactive timeline showing model performance evolution on CRAG
State-of-the-art frontier
Open
Proprietary
CRAG Leaderboard
3 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Amazon | — | 300K | $0.80 $3.20 | ||
2 | Amazon | — | 300K | $0.06 $0.24 | ||
3 | Amazon | — | 128K | $0.03 $0.14 |
Notice missing or incorrect data?
FAQ
Common questions about CRAG
CRAG (Comprehensive RAG Benchmark) is a factual question answering benchmark consisting of 4,409 question-answer pairs across 5 domains (finance, sports, music, movie, open domain) and 8 question categories. The benchmark includes mock APIs to simulate web and Knowledge Graph search, designed to represent the diverse and dynamic nature of real-world QA tasks with temporal dynamism ranging from years to seconds. It evaluates retrieval-augmented generation systems for trustworthy question answering.
The CRAG paper is available at https://arxiv.org/abs/2406.04744. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CRAG leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Nova Pro by Amazon leads with a score of 0.503. The average score across all models is 0.457.
The highest CRAG score is 0.503, achieved by Nova Pro from Amazon.
3 models have been evaluated on the CRAG benchmark, with 0 verified results and 3 self-reported results.
CRAG is categorized under economics, finance, reasoning, and search. The benchmark evaluates text models.