Hallusion Bench

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Hallusion Bench

State-of-the-art frontier
Open
Proprietary

Hallusion Bench Leaderboard

16 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
Notice missing or incorrect data?
About this benchmark

What is Hallusion Bench?

A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images and 1,129 carefully crafted questions to assess language hallucination and visual illusion

Hallusion Bench is a multimodal benchmark evaluating models on reasoning and vision tasks. LLM Stats tracks 16 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.7.

Compare leaders on the best AI for reasoning and best AI for vision leaderboards.

Current leaders

Qwen3.5-27B from Alibaba Cloud / Qwen Team currently leads the Hallusion Bench leaderboard with a score of 0.700 across 16 evaluated AI models.

1Qwen3.5-27BAlibaba Cloud / Qwen Team70.0%
2Qwen3.6-35B-A3BAlibaba Cloud / Qwen Team69.8%
3Qwen3.5-35B-A3BAlibaba Cloud / Qwen Team67.9%

Source paper

Title
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Authors
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, and 8 others
Published
Abstract

We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.

FAQ

Common questions about the Hallusion Bench benchmark and leaderboard.

What is the Hallusion Bench benchmark?

A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images and 1,129 carefully crafted questions to assess language hallucination and visual illusion

What is the Hallusion Bench leaderboard?

The Hallusion Bench leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, Qwen3.5-27B by Alibaba Cloud / Qwen Team leads with a score of 0.700. The average score across all models is 0.638.

What is the highest Hallusion Bench score?

The highest Hallusion Bench score is 0.700, achieved by Qwen3.5-27B from Alibaba Cloud / Qwen Team.

How many models are evaluated on Hallusion Bench?

16 models have been evaluated on the Hallusion Bench benchmark, with 0 verified results and 16 self-reported results.

Where can I find the Hallusion Bench paper?

The Hallusion Bench paper is available at https://arxiv.org/abs/2310.14566. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Hallusion Bench cover?

Hallusion Bench is categorized under reasoning and vision. The benchmark evaluates multimodal models.

What is the best open-source model on Hallusion Bench?

Qwen3.5-27B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on Hallusion Bench, with a score of 0.700 (rank #1).

Which model offers the best value on Hallusion Bench?

Among models scoring within 10% of the leader, Qwen3 VL 4B Thinking from Alibaba Cloud / Qwen Team is the cheapest, at $0.10 per million input tokens with a score of 0.641.

How recent are the Hallusion Bench leaderboard results?

The Hallusion Bench leaderboard was last updated in July 2026 and currently includes 16 evaluated models.