MMT-Bench

MMT-Bench is a comprehensive multimodal benchmark for evaluating Large Vision-Language Models towards multitask AGI. It comprises 31,325 meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MMT-Bench

State-of-the-art frontier
Open
Proprietary

MMT-Bench Leaderboard

4 models
ContextCostLicense
1
DeepSeek
DeepSeek
27B129K
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
316B
43B
Notice missing or incorrect data?

FAQ

Common questions about MMT-Bench

MMT-Bench is a comprehensive multimodal benchmark for evaluating Large Vision-Language Models towards multitask AGI. It comprises 31,325 meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding.
The MMT-Bench paper is available at https://arxiv.org/abs/2404.16006. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMT-Bench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, DeepSeek VL2 by DeepSeek leads with a score of 0.636. The average score across all models is 0.608.
The highest MMT-Bench score is 0.636, achieved by DeepSeek VL2 from DeepSeek.
4 models have been evaluated on the MMT-Bench benchmark, with 0 verified results and 4 self-reported results.
MMT-Bench is categorized under general, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.