MMStar

MMStar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans to evaluate 6 core capabilities and 18 detailed axes. The benchmark addresses issues of visual content unnecessity and unintentional data leakage in existing multimodal evaluations.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MMStar

State-of-the-art frontier
Open
Proprietary

MMStar Leaderboard

21 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
19
DeepSeek
DeepSeek
27B129K
2016B
213B
Notice missing or incorrect data?

FAQ

Common questions about MMStar

MMStar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans to evaluate 6 core capabilities and 18 detailed axes. The benchmark addresses issues of visual content unnecessity and unintentional data leakage in existing multimodal evaluations.
The MMStar paper is available at https://arxiv.org/abs/2403.20330. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMStar leaderboard ranks 21 AI models based on their performance on this benchmark. Currently, Qwen3.6 Plus by Alibaba Cloud / Qwen Team leads with a score of 0.833. The average score across all models is 0.720.
The highest MMStar score is 0.833, achieved by Qwen3.6 Plus from Alibaba Cloud / Qwen Team.
21 models have been evaluated on the MMStar benchmark, with 0 verified results and 21 self-reported results.
MMStar is categorized under general, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.