Hallusion Bench
A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images and 1,129 carefully crafted questions to assess language hallucination and visual illusion
Progress Over Time
Interactive timeline showing model performance evolution on Hallusion Bench
State-of-the-art frontier
Open
Proprietary
Hallusion Bench Leaderboard
15 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 3 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 4 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 6 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 7 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 8 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 9 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 11 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 12 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 13 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 14 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 8B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about Hallusion Bench
A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images and 1,129 carefully crafted questions to assess language hallucination and visual illusion
The Hallusion Bench paper is available at https://arxiv.org/abs/2310.14566. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Hallusion Bench leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, Qwen3.5-27B by Alibaba Cloud / Qwen Team leads with a score of 0.700. The average score across all models is 0.634.
The highest Hallusion Bench score is 0.700, achieved by Qwen3.5-27B from Alibaba Cloud / Qwen Team.
15 models have been evaluated on the Hallusion Bench benchmark, with 0 verified results and 15 self-reported results.
Hallusion Bench is categorized under reasoning and vision. The benchmark evaluates multimodal models.