Arena Hard

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Arena Hard

State-of-the-art frontier
Open
Proprietary

Arena Hard Leaderboard

26 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B128K$0.10 / $0.30
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B128K$0.10 / $0.30
450B
524B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
714B
8236B
9
Microsoft
Microsoft
15B
1014B
118B
12398B
13
Mistral AI
Mistral AI
119B256K$0.15 / $0.60
148B
148B
1614B
16
Mistral AI
Mistral AI
675B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
198B
2052B
2124B
2260B
234B
24
Microsoft
Microsoft
4B
253B
267B
Notice missing or incorrect data?
About this benchmark

What is Arena Hard?

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

Arena Hard is a text benchmark evaluating models on reasoning, general, creativity, and writing tasks. LLM Stats tracks 26 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 1.0.

Compare leaders on the best AI for reasoning, best AI for general, best AI for creativity and best AI for writing leaderboards.

Current leaders

Qwen3 235B A22B from Alibaba Cloud / Qwen Team currently leads the Arena Hard leaderboard with a score of 0.956 across 26 evaluated AI models.

1Qwen3 235B A22BAlibaba Cloud / Qwen Team95.6%
2Qwen3 32BAlibaba Cloud / Qwen Team93.8%
3Qwen3 30B A3BAlibaba Cloud / Qwen Team91.0%

Source paper

Title
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
Authors
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, and 4 others
Published
Abstract

The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark's alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.

FAQ

Common questions about the Arena Hard benchmark and leaderboard.

What is the Arena Hard benchmark?

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

What is the Arena Hard leaderboard?

The Arena Hard leaderboard ranks 26 AI models based on their performance on this benchmark. Currently, Qwen3 235B A22B by Alibaba Cloud / Qwen Team leads with a score of 0.956. The average score across all models is 0.622.

What is the highest Arena Hard score?

The highest Arena Hard score is 0.956, achieved by Qwen3 235B A22B from Alibaba Cloud / Qwen Team.

How many models are evaluated on Arena Hard?

26 models have been evaluated on the Arena Hard benchmark, with 0 verified results and 26 self-reported results.

Where can I find the Arena Hard paper?

The Arena Hard paper is available at https://arxiv.org/abs/2406.11939. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Arena Hard cover?

Arena Hard is categorized under reasoning, general, creativity, and writing. The benchmark evaluates text models.

What is the best open-source model on Arena Hard?

Qwen3 235B A22B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on Arena Hard, with a score of 0.956 (rank #1).

Which model offers the best value on Arena Hard?

Among models scoring within 10% of the leader, Qwen3 32B from Alibaba Cloud / Qwen Team is the cheapest, at $0.10 per million input tokens with a score of 0.938.

How recent are the Arena Hard leaderboard results?

The Arena Hard leaderboard was last updated in July 2026 and currently includes 26 evaluated models.