FActScore

Paper

Progress Over Time

Interactive timeline showing model performance evolution on FActScore

State-of-the-art frontier
Open
Proprietary

FActScore Leaderboard

2 models
ContextCostLicense
1
2
OpenAI
OpenAI
Notice missing or incorrect data?
About this benchmark

What is FActScore?

A fine-grained atomic evaluation metric for factual precision in long-form text generation that breaks generated text into atomic facts and computes the percentage supported by reliable knowledge sources, with automated assessment using retrieval and language models

FActScore is a text benchmark evaluating models on reasoning tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 1.0.

Compare leaders on the best AI for reasoning leaderboards.

Current leaders

Grok-4.1 from xAI currently leads the FActScore leaderboard with a score of 0.970 across 2 evaluated AI models.

1Grok-4.1xAI97.0%
2GPT-5OpenAI1.0%

Source paper

Title
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Authors
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, and 5 others
Published
Abstract

Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via `pip install factscore`.

FAQ

Common questions about the FActScore benchmark and leaderboard.

What is the FActScore benchmark?

A fine-grained atomic evaluation metric for factual precision in long-form text generation that breaks generated text into atomic facts and computes the percentage supported by reliable knowledge sources, with automated assessment using retrieval and language models

What is the FActScore leaderboard?

The FActScore leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Grok-4.1 by xAI leads with a score of 0.970. The average score across all models is 0.490.

What is the highest FActScore score?

The highest FActScore score is 0.970, achieved by Grok-4.1 from xAI.

How many models are evaluated on FActScore?

2 models have been evaluated on the FActScore benchmark, with 0 verified results and 2 self-reported results.

Where can I find the FActScore paper?

The FActScore paper is available at https://arxiv.org/abs/2305.14251. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does FActScore cover?

FActScore is categorized under reasoning. The benchmark evaluates text models.

How recent are the FActScore leaderboard results?

The FActScore leaderboard was last updated in July 2026 and currently includes 2 evaluated models.