CRAG

Paper

Progress Over Time

Interactive timeline showing model performance evolution on CRAG

State-of-the-art frontier
Open
Proprietary

CRAG Leaderboard

3 models
ContextCostLicense
1
Amazon
Amazon
2
Amazon
Amazon
3
Notice missing or incorrect data?
About this benchmark

What is CRAG?

CRAG (Comprehensive RAG Benchmark) is a factual question answering benchmark consisting of 4,409 question-answer pairs across 5 domains (finance, sports, music, movie, open domain) and 8 question categories. The benchmark includes mock APIs to simulate web and Knowledge Graph search, designed to represent the diverse and dynamic nature of real-world QA tasks with temporal dynamism ranging from years to seconds. It evaluates retrieval-augmented generation systems for trustworthy question answering.

CRAG is a text benchmark evaluating models on reasoning, search, finance, and economics tasks. LLM Stats tracks 3 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 0.5.

Compare leaders on the best AI for reasoning, best AI for search, best AI for finance and best AI for economics leaderboards.

Current leaders

Nova Pro from Amazon currently leads the CRAG leaderboard with a score of 0.503 across 3 evaluated AI models.

1Nova ProAmazon50.3%
2Nova LiteAmazon43.8%
3Nova MicroAmazon43.1%

Source paper

Title
CRAG -- Comprehensive RAG Benchmark
Authors
Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, and 23 others
Published
Abstract

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation of this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% of questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge and attracted thousands of participants and submissions. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions. CRAG is available at https://github.com/facebookresearch/CRAG/.

FAQ

Common questions about the CRAG benchmark and leaderboard.

What is the CRAG benchmark?

CRAG (Comprehensive RAG Benchmark) is a factual question answering benchmark consisting of 4,409 question-answer pairs across 5 domains (finance, sports, music, movie, open domain) and 8 question categories. The benchmark includes mock APIs to simulate web and Knowledge Graph search, designed to represent the diverse and dynamic nature of real-world QA tasks with temporal dynamism ranging from years to seconds. It evaluates retrieval-augmented generation systems for trustworthy question answering.

What is the CRAG leaderboard?

The CRAG leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Nova Pro by Amazon leads with a score of 0.503. The average score across all models is 0.457.

What is the highest CRAG score?

The highest CRAG score is 0.503, achieved by Nova Pro from Amazon.

How many models are evaluated on CRAG?

3 models have been evaluated on the CRAG benchmark, with 0 verified results and 3 self-reported results.

Where can I find the CRAG paper?

The CRAG paper is available at https://arxiv.org/abs/2406.04744. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does CRAG cover?

CRAG is categorized under reasoning, search, finance, and economics. The benchmark evaluates text models.

How recent are the CRAG leaderboard results?

The CRAG leaderboard was last updated in July 2026 and currently includes 3 evaluated models.