MME

A comprehensive evaluation benchmark for Multimodal Large Language Models measuring both perception and cognition abilities across 14 subtasks. Features manually designed instruction-answer pairs to avoid data leakage and provides systematic quantitative assessment of MLLM capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MME

State-of-the-art frontier
Open
Proprietary

MME Leaderboard

3 models
ContextCostLicense
1
DeepSeek
DeepSeek
27B129K
216B
33B
Notice missing or incorrect data?

FAQ

Common questions about MME

A comprehensive evaluation benchmark for Multimodal Large Language Models measuring both perception and cognition abilities across 14 subtasks. Features manually designed instruction-answer pairs to avoid data leakage and provides systematic quantitative assessment of MLLM capabilities.
The MME paper is available at https://arxiv.org/abs/2306.13394. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MME leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, DeepSeek VL2 by DeepSeek leads with a score of 0.225. The average score across all models is 0.210.
The highest MME score is 0.225, achieved by DeepSeek VL2 from DeepSeek.
3 models have been evaluated on the MME benchmark, with 0 verified results and 3 self-reported results.
MME is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.