MMT-Bench
MMT-Bench is a comprehensive multimodal benchmark for evaluating Large Vision-Language Models towards multitask AGI. It comprises 31,325 meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding.
Progress Over Time
Interactive timeline showing model performance evolution on MMT-Bench
State-of-the-art frontier
Open
Proprietary
MMT-Bench Leaderboard
4 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 27B | 129K | — | ||
| 1 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 3 | DeepSeek | 16B | — | — | ||
| 4 | DeepSeek | 3B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about MMT-Bench
MMT-Bench is a comprehensive multimodal benchmark for evaluating Large Vision-Language Models towards multitask AGI. It comprises 31,325 meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding.
The MMT-Bench paper is available at https://arxiv.org/abs/2404.16006. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMT-Bench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, DeepSeek VL2 by DeepSeek leads with a score of 0.636. The average score across all models is 0.608.
The highest MMT-Bench score is 0.636, achieved by DeepSeek VL2 from DeepSeek.
4 models have been evaluated on the MMT-Bench benchmark, with 0 verified results and 4 self-reported results.
MMT-Bench is categorized under general, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.