WritingBench

A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on WritingBench

State-of-the-art frontier
Open
Proprietary

WritingBench Leaderboard

15 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
151.0T
Notice missing or incorrect data?

FAQ

Common questions about WritingBench

A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.
The WritingBench paper is available at https://arxiv.org/abs/2503.05244. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The WritingBench leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, Qwen3-235B-A22B-Thinking-2507 by Alibaba Cloud / Qwen Team leads with a score of 0.883. The average score across all models is 0.842.
The highest WritingBench score is 0.883, achieved by Qwen3-235B-A22B-Thinking-2507 from Alibaba Cloud / Qwen Team.
15 models have been evaluated on the WritingBench benchmark, with 0 verified results and 15 self-reported results.
WritingBench is categorized under communication, creativity, finance, legal, and writing. The benchmark evaluates text models with multilingual support.