MM-MT-Bench

A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.

Progress Over Time

Interactive timeline showing model performance evolution on MM-MT-Bench

State-of-the-art frontier
Open
Proprietary

MM-MT-Bench Leaderboard

17 models • 0 verified
ContextCostLicense
1
Mistral AI
Mistral AI
675B128K
$2.00
$5.00
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K
$0.30
$1.50
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K
$0.45
$3.49
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K
$0.20
$0.70
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K
$0.18
$2.09
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K
$0.20
$1.00
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K
$0.08
$0.50
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K
$0.10
$1.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K
$0.10
$0.60
12
Mistral AI
Mistral AI
124B128K
$2.00
$6.00
13
Mistral AI
Mistral AI
12B128K
$0.15
$0.15
14
14B
15
8B
16
3B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?

FAQ

Common questions about MM-MT-Bench

A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.
The MM-MT-Bench leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Mistral Large 3 by Mistral AI leads with a score of 84.900. The average score across all models is 9.832.
The highest MM-MT-Bench score is 84.900, achieved by Mistral Large 3 from Mistral AI.
17 models have been evaluated on the MM-MT-Bench benchmark, with 0 verified results and 17 self-reported results.
MM-MT-Bench is categorized under communication and multimodal. The benchmark evaluates multimodal models.