WritingBench
A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.
Progress Over Time
Interactive timeline showing model performance evolution on WritingBench
State-of-the-art frontier
Open
Proprietary
WritingBench Leaderboard
15 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.30 / $3.00 | ||
| 2 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 3 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 4 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 5 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 7 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 7 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 / $0.80 | ||
| 9 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 10 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 11 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 12 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 14 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 15 | Moonshot AI | 1.0T | — | — |
Notice missing or incorrect data?
FAQ
Common questions about WritingBench
A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.
The WritingBench paper is available at https://arxiv.org/abs/2503.05244. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The WritingBench leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, Qwen3-235B-A22B-Thinking-2507 by Alibaba Cloud / Qwen Team leads with a score of 0.883. The average score across all models is 0.842.
The highest WritingBench score is 0.883, achieved by Qwen3-235B-A22B-Thinking-2507 from Alibaba Cloud / Qwen Team.
15 models have been evaluated on the WritingBench benchmark, with 0 verified results and 15 self-reported results.
WritingBench is categorized under communication, creativity, finance, legal, and writing. The benchmark evaluates text models with multilingual support.