Video-MME
Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in video analysis. It features 900 videos totaling 254 hours with 2,700 human-annotated question-answer pairs across 6 primary visual domains (Knowledge, Film & Television, Sports Competition, Life Record, Multilingual, and others) and 30 subfields. The benchmark evaluates models across diverse temporal dimensions (11 seconds to 1 hour), integrates multi-modal inputs including video frames, subtitles, and audio, and uses rigorous manual labeling by expert annotators for precise assessment.
Progress Over Time
Interactive timeline showing model performance evolution on Video-MME
State-of-the-art frontier
Open
Proprietary
Video-MME Leaderboard
10 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Moonshot AI | 1.0T | 262K | $0.60 $2.50 | ||
2 | Google | — | 1.0M | $1.25 $10.00 | ||
3 | Google | — | 2.1M | $2.50 $10.00 | ||
4 | Google | — | 1.0M | $0.15 $0.60 | ||
5 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 $0.70 | ||
6 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 $1.00 | ||
7 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 $2.09 | ||
8 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 $0.50 | ||
9 | Google | 8B | 1.0M | $0.07 $0.30 | ||
10 | Microsoft | 6B | 128K | $0.05 $0.10 |
Notice missing or incorrect data?
FAQ
Common questions about Video-MME
Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in video analysis. It features 900 videos totaling 254 hours with 2,700 human-annotated question-answer pairs across 6 primary visual domains (Knowledge, Film & Television, Sports Competition, Life Record, Multilingual, and others) and 30 subfields. The benchmark evaluates models across diverse temporal dimensions (11 seconds to 1 hour), integrates multi-modal inputs including video frames, subtitles, and audio, and uses rigorous manual labeling by expert annotators for precise assessment.
The Video-MME paper is available at https://arxiv.org/abs/2405.21075. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Video-MME leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, Kimi K2.5 by Moonshot AI leads with a score of 0.874. The average score across all models is 0.739.
The highest Video-MME score is 0.874, achieved by Kimi K2.5 from Moonshot AI.
10 models have been evaluated on the Video-MME benchmark, with 0 verified results and 10 self-reported results.
Video-MME is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models with multilingual support.