MT-Bench
MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.
Progress Over Time
Interactive timeline showing model performance evolution on MT-Bench
State-of-the-art frontier
Open
Proprietary
MT-Bench Leaderboard
12 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Nous Research | 70B | — | — | ||
2 | Alibaba Cloud / Qwen Team | 73B | 131K | $0.35 $0.40 | ||
3 | 50B | — | — | |||
4 | DeepSeek | 236B | 8K | $0.14 $0.28 | ||
5 | Alibaba Cloud / Qwen Team | 8B | 131K | $0.30 $0.30 | ||
6 | Mistral AI | 123B | 128K | $2.00 $6.00 | ||
7 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
8 | Mistral AI | 24B | 32K | $0.07 $0.14 | ||
9 | Mistral AI | 8B | 128K | $0.10 $0.10 | ||
10 | 8B | — | — | |||
11 | Mistral AI | 12B | 128K | $0.15 $0.15 | ||
12 | 70B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about MT-Bench
MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.
The MT-Bench paper is available at https://arxiv.org/abs/2306.05685. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MT-Bench leaderboard ranks 12 AI models based on their performance on this benchmark. Currently, Hermes 3 70B by Nous Research leads with a score of 8.990. The average score across all models is 1.471.
The highest MT-Bench score is 8.990, achieved by Hermes 3 70B from Nous Research.
12 models have been evaluated on the MT-Bench benchmark, with 0 verified results and 12 self-reported results.
MT-Bench is categorized under communication, creativity, general, reasoning, and roleplay. The benchmark evaluates text models.