WritingBench
A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.
Qwen3-235B-A22B-Thinking-2507 from Alibaba Cloud / Qwen Team currently leads the WritingBench leaderboard with a score of 0.883 across 15 evaluated AI models.
Qwen3-235B-A22B-Thinking-2507 leads with 88.3%, followed by
Qwen3-Next-80B-A3B-Instruct at 87.3% and
Qwen3 VL 235B A22B Thinking at 86.7%.
Progress Over Time
Interactive timeline showing model performance evolution on WritingBench
WritingBench Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 4 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 5 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 7 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 11 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 12 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 14 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 15 | Moonshot AI | 1.0T | — | — |
FAQ
Common questions about WritingBench.
More evaluations to explore
Related benchmarks in the same category
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
SuperGPQA is a comprehensive benchmark that evaluates large language models across 285 graduate-level academic disciplines. The benchmark contains 25,957 questions covering 13 broad disciplinary areas including Engineering, Medicine, Science, and Law, with specialized fields in light industry, agriculture, and service-oriented domains. It employs a Human-LLM collaborative filtering mechanism with over 80 expert annotators to create challenging questions that assess graduate-level knowledge and reasoning capabilities.
τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.
Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.
Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.