MT-Bench

MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MT-Bench

State-of-the-art frontier
Open
Proprietary

MT-Bench Leaderboard

12 models • 0 verified
ContextCostLicense
1
Nous Research
Nous Research
70B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K
$0.35
$0.40
3
50B
4
236B8K
$0.14
$0.28
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B131K
$0.30
$0.30
6
Mistral AI
Mistral AI
123B128K
$2.00
$6.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
8
24B32K
$0.07
$0.14
9
8B128K
$0.10
$0.10
10
8B
11
Mistral AI
Mistral AI
12B128K
$0.15
$0.15
12
70B
Notice missing or incorrect data?

FAQ

Common questions about MT-Bench

MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.
The MT-Bench paper is available at https://arxiv.org/abs/2306.05685. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MT-Bench leaderboard ranks 12 AI models based on their performance on this benchmark. Currently, Hermes 3 70B by Nous Research leads with a score of 8.990. The average score across all models is 1.471.
The highest MT-Bench score is 8.990, achieved by Hermes 3 70B from Nous Research.
12 models have been evaluated on the MT-Bench benchmark, with 0 verified results and 12 self-reported results.
MT-Bench is categorized under communication, creativity, general, reasoning, and roleplay. The benchmark evaluates text models.