Benchmarks/creativity/Arena-Hard v2

Arena-Hard v2

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Arena-Hard v2

State-of-the-art frontier
Open
Proprietary

Arena-Hard v2 Leaderboard

16 models
ContextCostLicense
1309B256K$0.10 / $0.30
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
6120B
7
Sarvam AI
Sarvam AI
105B
832B262K$0.06 / $0.24
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
15
Sarvam AI
Sarvam AI
30B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
Notice missing or incorrect data?

FAQ

Common questions about Arena-Hard v2

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.
The Arena-Hard v2 paper is available at https://arxiv.org/abs/2406.11939. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Arena-Hard v2 leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, MiMo-V2-Flash by Xiaomi leads with a score of 0.862. The average score across all models is 0.661.
The highest Arena-Hard v2 score is 0.862, achieved by MiMo-V2-Flash from Xiaomi.
16 models have been evaluated on the Arena-Hard v2 benchmark, with 0 verified results and 16 self-reported results.
Arena-Hard v2 is categorized under creativity, general, reasoning, and writing. The benchmark evaluates text models.