MM-MT-Bench
A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.
Progress Over Time
Interactive timeline showing model performance evolution on MM-MT-Bench
State-of-the-art frontier
Open
Proprietary
MM-MT-Bench Leaderboard
17 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Mistral AI | 675B | 128K | $2.00 $5.00 | ||
2 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 $1.50 | ||
2 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 $3.49 | ||
4 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
5 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
6 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 $0.70 | ||
7 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 $2.09 | ||
8 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 $1.00 | ||
9 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 $0.50 | ||
9 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 $1.00 | ||
11 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 $0.60 | ||
12 | Mistral AI | 124B | 128K | $2.00 $6.00 | ||
13 | Mistral AI | 12B | 128K | $0.15 $0.15 | ||
14 | Mistral AI | 14B | — | — | ||
15 | Mistral AI | 8B | — | — | ||
16 | Mistral AI | 3B | — | — | ||
17 | Alibaba Cloud / Qwen Team | 7B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about MM-MT-Bench
A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.
The MM-MT-Bench leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Mistral Large 3 by Mistral AI leads with a score of 84.900. The average score across all models is 9.832.
The highest MM-MT-Bench score is 84.900, achieved by Mistral Large 3 from Mistral AI.
17 models have been evaluated on the MM-MT-Bench benchmark, with 0 verified results and 17 self-reported results.
MM-MT-Bench is categorized under communication and multimodal. The benchmark evaluates multimodal models.