AlignBench
AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: Fundamental Language Ability, Advanced Chinese Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play, and Professional Knowledge. The benchmark includes 683 real-scenario rooted queries with human-verified references and uses a rule-calibrated multi-dimensional LLM-as-Judge approach with Chain-of-Thought for evaluation.
Progress Over Time
Interactive timeline showing model performance evolution on AlignBench
State-of-the-art frontier
Open
Proprietary
AlignBench Leaderboard
4 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 73B | 131K | $0.35 / $0.40 | ||
| 2 | DeepSeek | 236B | 8K | $0.14 / $0.28 | ||
| 3 | Alibaba Cloud / Qwen Team | 8B | 131K | $0.30 / $0.30 | ||
| 4 | Alibaba Cloud / Qwen Team | 8B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about AlignBench
AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: Fundamental Language Ability, Advanced Chinese Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play, and Professional Knowledge. The benchmark includes 683 real-scenario rooted queries with human-verified references and uses a rule-calibrated multi-dimensional LLM-as-Judge approach with Chain-of-Thought for evaluation.
The AlignBench paper is available at https://arxiv.org/abs/2311.18743. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AlignBench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Qwen2.5 72B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.816. The average score across all models is 0.769.
The highest AlignBench score is 0.816, achieved by Qwen2.5 72B Instruct from Alibaba Cloud / Qwen Team.
4 models have been evaluated on the AlignBench benchmark, with 0 verified results and 4 self-reported results.
AlignBench is categorized under creativity, general, language, math, reasoning, roleplay, and writing. The benchmark evaluates text models with multilingual support.