TruthfulQA
TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.
Progress Over Time
Interactive timeline showing model performance evolution on TruthfulQA
State-of-the-art frontier
Open
Proprietary
TruthfulQA Leaderboard
17 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Microsoft | 60B | — | — | ||
2 | 8B | 128K | $0.50 $0.50 | |||
3 | Microsoft | 4B | — | — | ||
4 | Microsoft | 4B | 128K | $0.10 $0.10 | ||
5 | Nous Research | 70B | — | — | ||
6 | 70B | — | — | |||
7 | Alibaba Cloud / Qwen Team | 15B | — | — | ||
8 | AI21 Labs | 398B | 256K | $2.00 $8.00 | ||
9 | 7B | — | — | |||
10 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
11 | Cohere | 104B | 128K | $0.25 $1.00 | ||
12 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
13 | Alibaba Cloud / Qwen Team | 32B | 128K | $0.09 $0.09 | ||
14 | AI21 Labs | 52B | 256K | $0.20 $0.40 | ||
15 | 8B | — | — | |||
16 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
17 | Mistral AI | 12B | 128K | $0.15 $0.15 |
Notice missing or incorrect data?
FAQ
Common questions about TruthfulQA
TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.
The TruthfulQA paper is available at https://arxiv.org/abs/2109.07958. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The TruthfulQA leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.775. The average score across all models is 0.589.
The highest TruthfulQA score is 0.775, achieved by Phi-3.5-MoE-instruct from Microsoft.
17 models have been evaluated on the TruthfulQA benchmark, with 0 verified results and 17 self-reported results.
TruthfulQA is categorized under finance, general, healthcare, legal, and reasoning. The benchmark evaluates text models.