MT-Bench
MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.
Hermes 3 70B from Nous Research currently leads the MT-Bench leaderboard with a score of 8.990 across 12 evaluated AI models.