MMStar

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MMStar

State-of-the-art frontier
Open
Proprietary

MMStar Leaderboard

22 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
28B262K$0.60 / $3.60
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
20
DeepSeek
DeepSeek
27B
2116B
223B
Notice missing or incorrect data?
About this benchmark

What is MMStar?

MMStar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans to evaluate 6 core capabilities and 18 detailed axes. The benchmark addresses issues of visual content unnecessity and unintentional data leakage in existing multimodal evaluations.

MMStar is a multimodal benchmark evaluating models on multimodal, reasoning, general, and vision tasks. LLM Stats tracks 22 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.

Compare leaders on the best AI for multimodal, best AI for reasoning, best AI for general and best AI for vision leaderboards.

Current leaders

Qwen3.6 Plus from Alibaba Cloud / Qwen Team currently leads the MMStar leaderboard with a score of 0.833 across 22 evaluated AI models.

1Qwen3.6 PlusAlibaba Cloud / Qwen Team83.3%
2Qwen3.5-122B-A10BAlibaba Cloud / Qwen Team82.9%
3Qwen3.5-35B-A3BAlibaba Cloud / Qwen Team81.9%

Source paper

Title
Are We on the Right Way for Evaluating Large Vision-Language Models?
Authors
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, and 7 others
Published
Abstract

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

FAQ

Common questions about the MMStar benchmark and leaderboard.

What is the MMStar benchmark?

MMStar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans to evaluate 6 core capabilities and 18 detailed axes. The benchmark addresses issues of visual content unnecessity and unintentional data leakage in existing multimodal evaluations.

What is the MMStar leaderboard?

The MMStar leaderboard ranks 22 AI models based on their performance on this benchmark. Currently, Qwen3.6 Plus by Alibaba Cloud / Qwen Team leads with a score of 0.833. The average score across all models is 0.725.

What is the highest MMStar score?

The highest MMStar score is 0.833, achieved by Qwen3.6 Plus from Alibaba Cloud / Qwen Team.

How many models are evaluated on MMStar?

22 models have been evaluated on the MMStar benchmark, with 0 verified results and 22 self-reported results.

Where can I find the MMStar paper?

The MMStar paper is available at https://arxiv.org/abs/2403.20330. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MMStar cover?

MMStar is categorized under multimodal, reasoning, general, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on MMStar?

Qwen3.5-122B-A10B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on MMStar, with a score of 0.829 (rank #2).

Which model offers the best value on MMStar?

Among models scoring within 10% of the leader, Qwen3 VL 8B Thinking from Alibaba Cloud / Qwen Team is the cheapest, at $0.18 per million input tokens with a score of 0.753.

How recent are the MMStar leaderboard results?

The MMStar leaderboard was last updated in July 2026 and currently includes 22 evaluated models.