MM-MT-Bench
Progress Over Time
Interactive timeline showing model performance evolution on MM-MT-Bench
MM-MT-Bench Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Mistral AI | 675B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 8 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 11 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 12 | Mistral AI | 124B | — | — | ||
| 13 | Mistral AI | 12B | — | — | ||
| 14 | Mistral AI | 14B | — | — | ||
| 15 | Mistral AI | 8B | — | — | ||
| 16 | Mistral AI | 3B | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 7B | — | — |
What is MM-MT-Bench?
A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.
MM-MT-Bench is a multimodal benchmark evaluating models on multimodal and communication tasks. LLM Stats tracks 17 models on this benchmark, scored on a 0–100 scale. The current average is 9.8, with the leader at 84.9.
Compare leaders on the best AI for multimodal and best AI for communication leaderboards.
Current leaders
Mistral Large 3 from Mistral AI currently leads the MM-MT-Bench leaderboard with a score of 84.900 across 17 evaluated AI models.
FAQ
Common questions about the MM-MT-Bench benchmark and leaderboard.