Benchmarks/creativity/Arena-Hard v2

Arena-Hard v2

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Arena-Hard v2

State-of-the-art frontier

Open

Proprietary

Arena-Hard v2 Leaderboard

16 models

			Context	Cost
1	MiMo-V2-Flash Xiaomi	309B	256K	$0.10 / $0.30
2	Qwen3-Next-80B-A3B-Instruct Alibaba Cloud / Qwen Team	80B	66K	$0.15 / $1.50
3	Qwen3-235B-A22B-Thinking-2507 Alibaba Cloud / Qwen Team	235B	262K	$0.30 / $3.00
4	Qwen3-235B-A22B-Instruct-2507 Alibaba Cloud / Qwen Team	235B	262K	$0.15 / $0.80
5	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	262K	$0.30 / $1.49
6	Nemotron 3 Super (120B A12B) NVIDIA	120B	—	—
7	Sarvam-105B Sarvam AI	105B	—	—
8	Nemotron 3 Nano (30B A3B) NVIDIA	32B	262K	$0.06 / $0.24
9	Qwen3 VL 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
10	Qwen3-Next-80B-A3B-Thinking Alibaba Cloud / Qwen Team	80B	66K	$0.15 / $1.50
11	Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team	33B	—	—
12	Qwen3 VL 30B A3B Instruct Alibaba Cloud / Qwen Team	31B	262K	$0.20 / $0.70
13	Qwen3 VL 30B A3B Thinking Alibaba Cloud / Qwen Team	31B	262K	$0.20 / $1.00
14	Qwen3 VL 8B Thinking Alibaba Cloud / Qwen Team	9B	262K	$0.18 / $2.09
15	Sarvam-30B Sarvam AI	30B	—	—
16	Qwen3 VL 4B Thinking Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $1.00

Notice missing or incorrect data?

FAQ

Common questions about Arena-Hard v2

The Arena-Hard v2 paper is available at https://arxiv.org/abs/2406.11939. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The Arena-Hard v2 leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, MiMo-V2-Flash by Xiaomi leads with a score of 0.862. The average score across all models is 0.661.

The highest Arena-Hard v2 score is 0.862, achieved by MiMo-V2-Flash from Xiaomi.

16 models have been evaluated on the Arena-Hard v2 benchmark, with 0 verified results and 16 self-reported results.

Arena-Hard v2 is categorized under creativity, general, reasoning, and writing. The benchmark evaluates text models.

Arena-Hard v2

Progress Over Time

Arena-Hard v2 Leaderboard

FAQ

What is the Arena-Hard v2 benchmark?

Where can I find the Arena-Hard v2 paper?

What is the Arena-Hard v2 leaderboard?

What is the highest Arena-Hard v2 score?

How many models are evaluated on Arena-Hard v2?

What categories does Arena-Hard v2 cover?