MME
A comprehensive evaluation benchmark for Multimodal Large Language Models measuring both perception and cognition abilities across 14 subtasks. Features manually designed instruction-answer pairs to avoid data leakage and provides systematic quantitative assessment of MLLM capabilities.
Progress Over Time
Interactive timeline showing model performance evolution on MME
State-of-the-art frontier
Open
Proprietary
MME Leaderboard
3 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 27B | 129K | — | ||
| 2 | DeepSeek | 16B | — | — | ||
| 3 | DeepSeek | 3B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about MME
A comprehensive evaluation benchmark for Multimodal Large Language Models measuring both perception and cognition abilities across 14 subtasks. Features manually designed instruction-answer pairs to avoid data leakage and provides systematic quantitative assessment of MLLM capabilities.
The MME paper is available at https://arxiv.org/abs/2306.13394. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MME leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, DeepSeek VL2 by DeepSeek leads with a score of 0.225. The average score across all models is 0.210.
The highest MME score is 0.225, achieved by DeepSeek VL2 from DeepSeek.
3 models have been evaluated on the MME benchmark, with 0 verified results and 3 self-reported results.
MME is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.