FActScore
What is FActScore?
A fine-grained atomic evaluation metric for factual precision in long-form text generation that breaks generated text into atomic facts and computes the percentage supported by reliable knowledge sources, with automated assessment using retrieval and language models
FActScore is a text benchmark evaluating models on reasoning tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 1.0.
Compare leaders on the best AI for reasoning leaderboards.
Current leaders
Grok-4.1 from xAI currently leads the FActScore leaderboard with a score of 0.970 across 2 evaluated AI models.
Source paper
- Title
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
- Authors
- Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, and 5 others
- Published
- arXiv
- 2305.14251
Abstract
Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via `pip install factscore`.
FAQ
Common questions about the FActScore benchmark and leaderboard.