Creative Writing v3

EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.

Grok-4.1 Thinking from xAI currently leads the Creative Writing v3 leaderboard with a score of 1721.900 across 13 evaluated AI models.

Paper

xAIGrok-4.1 Thinking leads with 172190.0%, followed by xAIGrok-4.1 at 170860.0% and Alibaba Cloud / Qwen TeamQwen3-235B-A22B-Instruct-2507 at 87.5%.

Progress Over Time

Interactive timeline showing model performance evolution on Creative Writing v3

State-of-the-art frontier
Open
Proprietary

Creative Writing v3 Leaderboard

13 models
ContextCostLicense
1
2
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.50
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
Notice missing or incorrect data?

FAQ

Common questions about Creative Writing v3.

What is the Creative Writing v3 benchmark?

EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.

What is the Creative Writing v3 leaderboard?

The Creative Writing v3 leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Grok-4.1 Thinking by xAI leads with a score of 1721.900. The average score across all models is 264.597.

What is the highest Creative Writing v3 score?

The highest Creative Writing v3 score is 1721.900, achieved by Grok-4.1 Thinking from xAI.

How many models are evaluated on Creative Writing v3?

13 models have been evaluated on the Creative Writing v3 benchmark, with 0 verified results and 13 self-reported results.

Where can I find the Creative Writing v3 paper?

The Creative Writing v3 paper is available at https://arxiv.org/abs/2312.06281. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Creative Writing v3 cover?

Creative Writing v3 is categorized under writing and creativity. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all writing
Arena Hard

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

writing
26 models
Arena-Hard v2

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.

writing
16 models
WritingBench

A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.

writing
15 models
MT-Bench

MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.

creativity
12 models
COLLIE

COLLIE is a grammar-based framework for systematic construction of constrained text generation tasks. It allows specification of rich, compositional constraints across diverse generation levels and modeling challenges including language understanding, logical reasoning, and semantic planning. The COLLIE-v1 dataset contains 2,080 instances across 13 constraint structures.

writing
9 models
Social IQa

The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and social intelligence in everyday situations, testing commonsense understanding of social interactions and theory of mind reasoning about the implied emotions and behavior of others.

creativity
9 models