AlignBench

AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: Fundamental Language Ability, Advanced Chinese Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play, and Professional Knowledge. The benchmark includes 683 real-scenario rooted queries with human-verified references and uses a rule-calibrated multi-dimensional LLM-as-Judge approach with Chain-of-Thought for evaluation.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on AlignBench

State-of-the-art frontier
Open
Proprietary

AlignBench Leaderboard

4 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K$0.35 / $0.40
2236B8K$0.14 / $0.28
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B131K$0.30 / $0.30
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
Notice missing or incorrect data?

FAQ

Common questions about AlignBench

AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: Fundamental Language Ability, Advanced Chinese Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play, and Professional Knowledge. The benchmark includes 683 real-scenario rooted queries with human-verified references and uses a rule-calibrated multi-dimensional LLM-as-Judge approach with Chain-of-Thought for evaluation.
The AlignBench paper is available at https://arxiv.org/abs/2311.18743. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AlignBench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Qwen2.5 72B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.816. The average score across all models is 0.769.
The highest AlignBench score is 0.816, achieved by Qwen2.5 72B Instruct from Alibaba Cloud / Qwen Team.
4 models have been evaluated on the AlignBench benchmark, with 0 verified results and 4 self-reported results.
AlignBench is categorized under creativity, general, language, math, reasoning, roleplay, and writing. The benchmark evaluates text models with multilingual support.