FActScore

A fine-grained atomic evaluation metric for factual precision in long-form text generation that breaks generated text into atomic facts and computes the percentage supported by reliable knowledge sources, with automated assessment using retrieval and language models

Paper

Progress Over Time

Interactive timeline showing model performance evolution on FActScore

State-of-the-art frontier
Open
Proprietary

FActScore Leaderboard

2 models
ContextCostLicense
1256K$3.00 / $15.00
2
OpenAI
OpenAI
400K$1.25 / $10.00
Notice missing or incorrect data?

FAQ

Common questions about FActScore

A fine-grained atomic evaluation metric for factual precision in long-form text generation that breaks generated text into atomic facts and computes the percentage supported by reliable knowledge sources, with automated assessment using retrieval and language models
The FActScore paper is available at https://arxiv.org/abs/2305.14251. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The FActScore leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Grok-4.1 by xAI leads with a score of 0.970. The average score across all models is 0.490.
The highest FActScore score is 0.970, achieved by Grok-4.1 from xAI.
2 models have been evaluated on the FActScore benchmark, with 0 verified results and 2 self-reported results.
FActScore is categorized under reasoning. The benchmark evaluates text models.