Arena Hard

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Arena Hard

State-of-the-art frontier
Open
Proprietary

Arena Hard Leaderboard

26 models • 0 verified
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
4
50B
5
24B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
7
14B
8
236B
9
Microsoft
Microsoft
15B
10
14B
11
8B
12
398B
13
Mistral AI
Mistral AI
119B
14
8B
14
8B
16
14B
16
Mistral AI
Mistral AI
675B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
19
8B
20
52B
21
24B
22
60B
23
4B
24
Microsoft
Microsoft
4B
25
3B
26
7B
Notice missing or incorrect data?

FAQ

Common questions about Arena Hard

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.
The Arena Hard paper is available at https://arxiv.org/abs/2406.11939. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Arena Hard leaderboard ranks 26 AI models based on their performance on this benchmark. Currently, Qwen3 235B A22B by Alibaba Cloud / Qwen Team leads with a score of 0.956. The average score across all models is 0.622.
The highest Arena Hard score is 0.956, achieved by Qwen3 235B A22B from Alibaba Cloud / Qwen Team.
26 models have been evaluated on the Arena Hard benchmark, with 0 verified results and 26 self-reported results.
Arena Hard is categorized under creativity, general, and reasoning. The benchmark evaluates text models.