What is the Arena-Hard v2 leaderboard?

The Arena-Hard v2 leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, MiMo-V2-Flash by Xiaomi leads with a score of 0.862. The average score across all models is 0.661.

What is the highest Arena-Hard v2 score?

The highest Arena-Hard v2 score is 0.862, achieved by MiMo-V2-Flash from Xiaomi.

How many models are evaluated on Arena-Hard v2?

16 models have been evaluated on the Arena-Hard v2 benchmark, with 0 verified results and 16 self-reported results.

Where can I find the Arena-Hard v2 paper?

The Arena-Hard v2 paper is available at https://arxiv.org/abs/2406.11939. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Arena-Hard v2 cover?

Arena-Hard v2 is categorized under writing, creativity, general, and reasoning. The benchmark evaluates text models.

All benchmarks

Arena-Hard v2

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.

MiMo-V2-Flash from Xiaomi currently leads the Arena-Hard v2 leaderboard with a score of 0.862 across 16 evaluated AI models.

Paper