Benchmarks/factuality/FACTS Grounding

FACTS Grounding

A benchmark evaluating language models' ability to generate factually accurate and well-grounded responses based on long-form input context, comprising 1,719 examples with documents up to 32k tokens requiring detailed responses that are fully grounded in provided documents

Progress Over Time

Interactive timeline showing model performance evolution on FACTS Grounding

State-of-the-art frontier
Open
Proprietary

FACTS Grounding Leaderboard

13 models
ContextCostLicense
11.0M$1.25 / $10.00
21.0M$0.30 / $2.50
31.0M$0.10 / $0.40
41.0M$0.10 / $0.40
41.0M$0.07 / $0.30
612B131K$0.05 / $0.10
727B131K$0.10 / $0.20
8
94B131K$0.02 / $0.04
101.0M$0.50 / $3.00
11
Zhipu AI
Zhipu AI
121.0M$0.25 / $1.50
131B
Notice missing or incorrect data?

FAQ

Common questions about FACTS Grounding

A benchmark evaluating language models' ability to generate factually accurate and well-grounded responses based on long-form input context, comprising 1,719 examples with documents up to 32k tokens requiring detailed responses that are fully grounded in provided documents
The FACTS Grounding paper is available at https://arxiv.org/abs/2501.03200. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The FACTS Grounding leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Gemini 2.5 Pro Preview 06-05 by Google leads with a score of 0.878. The average score across all models is 0.702.
The highest FACTS Grounding score is 0.878, achieved by Gemini 2.5 Pro Preview 06-05 from Google.
13 models have been evaluated on the FACTS Grounding benchmark, with 0 verified results and 13 self-reported results.
FACTS Grounding is categorized under factuality, grounding, and reasoning. The benchmark evaluates text models.