Creative Writing v3

EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.

Grok-4.1 Thinking from xAI currently leads the Creative Writing v3 leaderboard with a score of 1721.900 across 13 evaluated AI models.

Paper

Grok-4.1 Thinking leads with 172190.0%, followed by Grok-4.1 at 170860.0% and Qwen3-235B-A22B-Instruct-2507 at 87.5%.

Progress Over Time

Interactive timeline showing model performance evolution on Creative Writing v3

State-of-the-art frontier

Open

Proprietary

Creative Writing v3 Leaderboard

13 models

			Context	Cost
1	Grok-4.1 Thinking xAI	—	—	—
2	Grok-4.1 xAI	—	—	—
3	Qwen3-235B-A22B-Instruct-2507 Alibaba Cloud / Qwen Team	235B	—	—
4	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	262K	$0.30 / $1.50
5	Qwen3-235B-A22B-Thinking-2507 Alibaba Cloud / Qwen Team	235B	—	—
6	Qwen3 VL 235B A22B Thinking Alibaba Cloud / Qwen Team	236B	262K	$0.45 / $3.49
7	Qwen3 VL 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
8	Qwen3-Next-80B-A3B-Instruct Alibaba Cloud / Qwen Team	80B	—	—
9	Qwen3 VL 30B A3B Instruct Alibaba Cloud / Qwen Team	31B	—	—
10	Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team	33B	—	—
11	Qwen3 VL 30B A3B Thinking Alibaba Cloud / Qwen Team	31B	—	—
12	Qwen3 VL 8B Thinking Alibaba Cloud / Qwen Team	9B	262K	$0.18 / $2.09
13	Qwen3 VL 4B Thinking Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $1.00

Notice missing or incorrect data?

FAQ

Common questions about Creative Writing v3.

What is the Creative Writing v3 benchmark?

What is the Creative Writing v3 leaderboard?

The Creative Writing v3 leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Grok-4.1 Thinking by xAI leads with a score of 1721.900. The average score across all models is 264.597.

What is the highest Creative Writing v3 score?

The highest Creative Writing v3 score is 1721.900, achieved by Grok-4.1 Thinking from xAI.

How many models are evaluated on Creative Writing v3?

13 models have been evaluated on the Creative Writing v3 benchmark, with 0 verified results and 13 self-reported results.

Where can I find the Creative Writing v3 paper?

The Creative Writing v3 paper is available at https://arxiv.org/abs/2312.06281. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Creative Writing v3 cover?

Creative Writing v3 is categorized under writing and creativity. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all writing →

Arena Hard

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

writing

26 models

Arena-Hard v2

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.

writing

16 models

WritingBench

A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.

writing

15 models

MT-Bench

MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.

creativity

12 models

COLLIE

COLLIE is a grammar-based framework for systematic construction of constrained text generation tasks. It allows specification of rich, compositional constraints across diverse generation levels and modeling challenges including language understanding, logical reasoning, and semantic planning. The COLLIE-v1 dataset contains 2,080 instances across 13 constraint structures.

writing

9 models

Social IQa

The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and social intelligence in everyday situations, testing commonsense understanding of social interactions and theory of mind reasoning about the implied emotions and behavior of others.

creativity

9 models