Creative Writing v3
EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.
Progress Over Time
Interactive timeline showing model performance evolution on Creative Writing v3
State-of-the-art frontier
Open
Proprietary
Creative Writing v3 Leaderboard
13 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | — | 256K | $3.00 $15.00 | |||
2 | xAI | — | 256K | $3.00 $15.00 | ||
3 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 $0.80 | ||
4 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 $1.50 | ||
5 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.30 $3.00 | ||
6 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 $3.49 | ||
7 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
8 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 $1.50 | ||
9 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 $0.70 | ||
10 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
11 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 $1.00 | ||
12 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 $2.09 | ||
13 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 $1.00 |
Notice missing or incorrect data?
FAQ
Common questions about Creative Writing v3
EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.
The Creative Writing v3 paper is available at https://arxiv.org/abs/2312.06281. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Creative Writing v3 leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Grok-4.1 Thinking by xAI leads with a score of 1721.900. The average score across all models is 264.597.
The highest Creative Writing v3 score is 1721.900, achieved by Grok-4.1 Thinking from xAI.
13 models have been evaluated on the Creative Writing v3 benchmark, with 0 verified results and 13 self-reported results.
Creative Writing v3 is categorized under creativity and writing. The benchmark evaluates text models.