MMStar
MMStar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans to evaluate 6 core capabilities and 18 detailed axes. The benchmark addresses issues of visual content unnecessity and unintentional data leakage in existing multimodal evaluations.
Progress Over Time
Interactive timeline showing model performance evolution on MMStar
State-of-the-art frontier
Open
Proprietary
MMStar Leaderboard
21 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Qwen3.6 PlusNew Alibaba Cloud / Qwen Team | — | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 3 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 4 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 7 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 8 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 10 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 11 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 12 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 13 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 14 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 16 | Alibaba Cloud / Qwen Team | 34B | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 18 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 19 | DeepSeek | 27B | 129K | — | ||
| 20 | DeepSeek | 16B | — | — | ||
| 21 | DeepSeek | 3B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about MMStar
MMStar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans to evaluate 6 core capabilities and 18 detailed axes. The benchmark addresses issues of visual content unnecessity and unintentional data leakage in existing multimodal evaluations.
The MMStar paper is available at https://arxiv.org/abs/2403.20330. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMStar leaderboard ranks 21 AI models based on their performance on this benchmark. Currently, Qwen3.6 Plus by Alibaba Cloud / Qwen Team leads with a score of 0.833. The average score across all models is 0.720.
The highest MMStar score is 0.833, achieved by Qwen3.6 Plus from Alibaba Cloud / Qwen Team.
21 models have been evaluated on the MMStar benchmark, with 0 verified results and 21 self-reported results.
MMStar is categorized under general, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.