Creative Writing v3
EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.
Grok-4.1 Thinking from xAI currently leads the Creative Writing v3 leaderboard with a score of 1721.900 across 13 evaluated AI models.
Grok-4.1 Thinking leads with 172190.0%, followed by
Grok-4.1 at 170860.0% and
Qwen3-235B-A22B-Instruct-2507 at 87.5%.
Progress Over Time
Interactive timeline showing model performance evolution on Creative Writing v3
Creative Writing v3 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | — | — | — | |||
| 2 | xAI | — | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.50 | ||
| 5 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 7 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 13 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 |
FAQ
Common questions about Creative Writing v3.
More evaluations to explore
Related benchmarks in the same category
Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.
Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.
A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.
MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.
COLLIE is a grammar-based framework for systematic construction of constrained text generation tasks. It allows specification of rich, compositional constraints across diverse generation levels and modeling challenges including language understanding, logical reasoning, and semantic planning. The COLLIE-v1 dataset contains 2,080 instances across 13 constraint structures.
The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and social intelligence in everyday situations, testing commonsense understanding of social interactions and theory of mind reasoning about the implied emotions and behavior of others.