TruthfulQA

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TruthfulQA

State-of-the-art frontier
Open
Proprietary

TruthfulQA Leaderboard

18 models
ContextCostLicense
11.0T
260B
38B
4
Microsoft
Microsoft
4B
54B
6
Nous Research
Nous Research
70B
770B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
9398B
107B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
12104B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B
1552B
168B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
1812B
Notice missing or incorrect data?
About this benchmark

What is TruthfulQA?

TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.

TruthfulQA is a text benchmark evaluating models on reasoning, legal, finance, general, and healthcare tasks. LLM Stats tracks 18 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.9.

Compare leaders on the best AI for reasoning, best AI for legal, best AI for finance, best AI for general and best AI for healthcare leaderboards.

Current leaders

MAI-Thinking-1 from Microsoft currently leads the TruthfulQA leaderboard with a score of 0.880 across 18 evaluated AI models.

1MAI-Thinking-1Microsoft88.0%
2Phi-3.5-MoE-instructMicrosoft77.5%

Source paper

Title
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Authors
Stephanie Lin, Jacob Hilton, Owain Evans
Published
Abstract

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

FAQ

Common questions about the TruthfulQA benchmark and leaderboard.

What is the TruthfulQA benchmark?

TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.

What is the TruthfulQA leaderboard?

The TruthfulQA leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, MAI-Thinking-1 by Microsoft leads with a score of 0.880. The average score across all models is 0.605.

What is the highest TruthfulQA score?

The highest TruthfulQA score is 0.880, achieved by MAI-Thinking-1 from Microsoft.

How many models are evaluated on TruthfulQA?

18 models have been evaluated on the TruthfulQA benchmark, with 0 verified results and 18 self-reported results.

Where can I find the TruthfulQA paper?

The TruthfulQA paper is available at https://arxiv.org/abs/2109.07958. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does TruthfulQA cover?

TruthfulQA is categorized under reasoning, legal, finance, general, and healthcare. The benchmark evaluates text models.

What is the best open-source model on TruthfulQA?

Phi-3.5-MoE-instruct by Microsoft is the top-ranked open-source model on TruthfulQA, with a score of 0.775 (rank #2).

How recent are the TruthfulQA leaderboard results?

The TruthfulQA leaderboard was last updated in June 2026 and currently includes 18 evaluated models.