Benchmarks/creativity/Creative Writing v3

Creative Writing v3

EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Creative Writing v3

State-of-the-art frontier
Open
Proprietary

Creative Writing v3 Leaderboard

13 models • 0 verified
ContextCostLicense
1
256K
$3.00
$15.00
2
256K
$3.00
$15.00
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K
$0.15
$0.80
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K
$0.30
$1.50
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K
$0.30
$3.00
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K
$0.45
$3.49
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K
$0.15
$1.50
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K
$0.20
$0.70
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K
$0.20
$1.00
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K
$0.18
$2.09
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K
$0.10
$1.00
Notice missing or incorrect data?

FAQ

Common questions about Creative Writing v3

EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.
The Creative Writing v3 paper is available at https://arxiv.org/abs/2312.06281. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Creative Writing v3 leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Grok-4.1 Thinking by xAI leads with a score of 1721.900. The average score across all models is 264.597.
The highest Creative Writing v3 score is 1721.900, achieved by Grok-4.1 Thinking from xAI.
13 models have been evaluated on the Creative Writing v3 benchmark, with 0 verified results and 13 self-reported results.
Creative Writing v3 is categorized under creativity and writing. The benchmark evaluates text models.