Video-MME

Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in video analysis. It features 900 videos totaling 254 hours with 2,700 human-annotated question-answer pairs across 6 primary visual domains (Knowledge, Film & Television, Sports Competition, Life Record, Multilingual, and others) and 30 subfields. The benchmark evaluates models across diverse temporal dimensions (11 seconds to 1 hour), integrates multi-modal inputs including video frames, subtitles, and audio, and uses rigorous manual labeling by expert annotators for precise assessment.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Video-MME

State-of-the-art frontier
Open
Proprietary

Video-MME Leaderboard

10 models • 0 verified
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T262K
$0.60
$2.50
2
1.0M
$1.25
$10.00
3
2.1M
$2.50
$10.00
4
1.0M
$0.15
$0.60
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K
$0.20
$0.70
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K
$0.20
$1.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K
$0.18
$2.09
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K
$0.08
$0.50
9
8B1.0M
$0.07
$0.30
10
6B128K
$0.05
$0.10
Notice missing or incorrect data?

FAQ

Common questions about Video-MME

Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in video analysis. It features 900 videos totaling 254 hours with 2,700 human-annotated question-answer pairs across 6 primary visual domains (Knowledge, Film & Television, Sports Competition, Life Record, Multilingual, and others) and 30 subfields. The benchmark evaluates models across diverse temporal dimensions (11 seconds to 1 hour), integrates multi-modal inputs including video frames, subtitles, and audio, and uses rigorous manual labeling by expert annotators for precise assessment.
The Video-MME paper is available at https://arxiv.org/abs/2405.21075. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Video-MME leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, Kimi K2.5 by Moonshot AI leads with a score of 0.874. The average score across all models is 0.739.
The highest Video-MME score is 0.874, achieved by Kimi K2.5 from Moonshot AI.
10 models have been evaluated on the Video-MME benchmark, with 0 verified results and 10 self-reported results.
Video-MME is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models with multilingual support.