Benchmarks/finance/TruthfulQA

TruthfulQA

TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TruthfulQA

State-of-the-art frontier
Open
Proprietary

TruthfulQA Leaderboard

17 models • 0 verified
ContextCostLicense
1
60B
2
8B128K
$0.50
$0.50
3
Microsoft
Microsoft
4B
4
4B128K
$0.10
$0.10
5
Nous Research
Nous Research
70B
6
70B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
8
398B256K
$2.00
$8.00
9
7B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
11
104B128K
$0.25
$1.00
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B128K
$0.09
$0.09
14
52B256K
$0.20
$0.40
15
8B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
17
12B128K
$0.15
$0.15
Notice missing or incorrect data?

FAQ

Common questions about TruthfulQA

TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.
The TruthfulQA paper is available at https://arxiv.org/abs/2109.07958. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The TruthfulQA leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.775. The average score across all models is 0.589.
The highest TruthfulQA score is 0.775, achieved by Phi-3.5-MoE-instruct from Microsoft.
17 models have been evaluated on the TruthfulQA benchmark, with 0 verified results and 17 self-reported results.
TruthfulQA is categorized under finance, general, healthcare, legal, and reasoning. The benchmark evaluates text models.