Arena Hard
Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.
Progress Over Time
Interactive timeline showing model performance evolution on Arena Hard
State-of-the-art frontier
Open
Proprietary
Arena Hard Leaderboard
26 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
2 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
3 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
4 | 50B | — | — | |||
5 | Mistral AI | 24B | — | — | ||
6 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
7 | Microsoft | 14B | — | — | ||
8 | DeepSeek | 236B | — | — | ||
9 | Microsoft | 15B | — | — | ||
10 | Microsoft | 14B | — | — | ||
11 | Mistral AI | 8B | — | — | ||
12 | AI21 Labs | 398B | — | — | ||
13 | Mistral AI | 119B | — | — | ||
14 | 8B | — | — | |||
14 | 8B | — | — | |||
16 | Mistral AI | 14B | — | — | ||
16 | Mistral AI | 675B | — | — | ||
18 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
19 | Mistral AI | 8B | — | — | ||
20 | AI21 Labs | 52B | — | — | ||
21 | Mistral AI | 24B | — | — | ||
22 | Microsoft | 60B | — | — | ||
23 | Microsoft | 4B | — | — | ||
24 | Microsoft | 4B | — | — | ||
25 | Mistral AI | 3B | — | — | ||
26 | 7B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about Arena Hard
Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.
The Arena Hard paper is available at https://arxiv.org/abs/2406.11939. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Arena Hard leaderboard ranks 26 AI models based on their performance on this benchmark. Currently, Qwen3 235B A22B by Alibaba Cloud / Qwen Team leads with a score of 0.956. The average score across all models is 0.622.
The highest Arena Hard score is 0.956, achieved by Qwen3 235B A22B from Alibaba Cloud / Qwen Team.
26 models have been evaluated on the Arena Hard benchmark, with 0 verified results and 26 self-reported results.
Arena Hard is categorized under creativity, general, and reasoning. The benchmark evaluates text models.