AlignBench Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on AlignBench

State-of-the-art frontier

Open

Proprietary

AlignBench Leaderboard

4 models

			Context	Cost
1	Qwen2.5 72B Instruct Alibaba Cloud / Qwen Team	73B	131K	$0.35 / $0.40
2	DeepSeek-V2.5 DeepSeek	236B	8K	$0.14 / $0.28
3	Qwen2.5 7B Instruct Alibaba Cloud / Qwen Team	8B	131K	$0.30 / $0.30
4	Qwen2 7B Instruct Alibaba Cloud / Qwen Team	8B	—	—

FAQ

Common questions about AlignBench

AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: Fundamental Language Ability, Advanced Chinese Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play, and Professional Knowledge. The benchmark includes 683 real-scenario rooted queries with human-verified references and uses a rule-calibrated multi-dimensional LLM-as-Judge approach with Chain-of-Thought for evaluation.

The AlignBench paper is available at https://arxiv.org/abs/2311.18743. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The AlignBench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Qwen2.5 72B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.816. The average score across all models is 0.769.

The highest AlignBench score is 0.816, achieved by Qwen2.5 72B Instruct from Alibaba Cloud / Qwen Team.

4 models have been evaluated on the AlignBench benchmark, with 0 verified results and 4 self-reported results.

AlignBench is categorized under creativity, general, language, math, reasoning, roleplay, and writing. The benchmark evaluates text models with multilingual support.

AlignBench

Progress Over Time

AlignBench Leaderboard

FAQ

What is the AlignBench benchmark?

Where can I find the AlignBench paper?

What is the AlignBench leaderboard?

What is the highest AlignBench score?

How many models are evaluated on AlignBench?

What categories does AlignBench cover?