Benchmarks/reasoning/Hallusion Bench

Hallusion Bench

A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images and 1,129 carefully crafted questions to assess language hallucination and visual illusion

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Hallusion Bench

State-of-the-art frontier
Open
Proprietary

Hallusion Bench Leaderboard

15 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
Notice missing or incorrect data?

FAQ

Common questions about Hallusion Bench

A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images and 1,129 carefully crafted questions to assess language hallucination and visual illusion
The Hallusion Bench paper is available at https://arxiv.org/abs/2310.14566. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Hallusion Bench leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, Qwen3.5-27B by Alibaba Cloud / Qwen Team leads with a score of 0.700. The average score across all models is 0.634.
The highest Hallusion Bench score is 0.700, achieved by Qwen3.5-27B from Alibaba Cloud / Qwen Team.
15 models have been evaluated on the Hallusion Bench benchmark, with 0 verified results and 15 self-reported results.
Hallusion Bench is categorized under reasoning and vision. The benchmark evaluates multimodal models.