MM-MT-Bench

Progress Over Time

Interactive timeline showing model performance evolution on MM-MT-Bench

State-of-the-art frontier
Open
Proprietary

MM-MT-Bench Leaderboard

17 models
ContextCostLicense
1
Mistral AI
Mistral AI
675B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
12
Mistral AI
Mistral AI
124B
13
Mistral AI
Mistral AI
12B
1414B
158B
163B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?
About this benchmark

What is MM-MT-Bench?

A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.

MM-MT-Bench is a multimodal benchmark evaluating models on multimodal and communication tasks. LLM Stats tracks 17 models on this benchmark, scored on a 0–100 scale. The current average is 9.8, with the leader at 84.9.

Compare leaders on the best AI for multimodal and best AI for communication leaderboards.

Current leaders

Mistral Large 3 from Mistral AI currently leads the MM-MT-Bench leaderboard with a score of 84.900 across 17 evaluated AI models.

1Mistral Large 3Mistral AI84.9%
2Qwen3 VL 235B A22B InstructAlibaba Cloud / Qwen Team8.5%
2Qwen3 VL 235B A22B ThinkingAlibaba Cloud / Qwen Team8.5%

FAQ

Common questions about the MM-MT-Bench benchmark and leaderboard.

What is the MM-MT-Bench benchmark?

A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.

What is the MM-MT-Bench leaderboard?

The MM-MT-Bench leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Mistral Large 3 by Mistral AI leads with a score of 84.900. The average score across all models is 9.832.

What is the highest MM-MT-Bench score?

The highest MM-MT-Bench score is 84.900, achieved by Mistral Large 3 from Mistral AI.

How many models are evaluated on MM-MT-Bench?

17 models have been evaluated on the MM-MT-Bench benchmark, with 0 verified results and 17 self-reported results.

What categories does MM-MT-Bench cover?

MM-MT-Bench is categorized under multimodal and communication. The benchmark evaluates multimodal models.

What is the best open-source model on MM-MT-Bench?

Mistral Large 3 by Mistral AI is the top-ranked open-source model on MM-MT-Bench, with a score of 84.900 (rank #1).

How recent are the MM-MT-Bench leaderboard results?

The MM-MT-Bench leaderboard was last updated in July 2026 and currently includes 17 evaluated models.