What is the MT-Bench leaderboard?

The MT-Bench leaderboard ranks 12 AI models based on their performance on this benchmark. Currently, Hermes 3 70B by Nous Research leads with a score of 8.990. The average score across all models is 1.471.

What is the highest MT-Bench score?

The highest MT-Bench score is 8.990, achieved by Hermes 3 70B from Nous Research.

How many models are evaluated on MT-Bench?

12 models have been evaluated on the MT-Bench benchmark, with 0 verified results and 12 self-reported results.

Where can I find the MT-Bench paper?

The MT-Bench paper is available at https://arxiv.org/abs/2306.05685. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MT-Bench cover?

MT-Bench is categorized under communication, creativity, general, reasoning, and roleplay. The benchmark evaluates text models.

All benchmarks

MT-Bench

MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.

Hermes 3 70B from Nous Research currently leads the MT-Bench leaderboard with a score of 8.990 across 12 evaluated AI models.

Paper