Creative Writing v3
Progress Over Time
Interactive timeline showing model performance evolution on Creative Writing v3
Creative Writing v3 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | — | — | — | |||
| 2 | xAI | — | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 13 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 |
What is Creative Writing v3?
EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.
Creative Writing v3 is a text benchmark evaluating models on creativity and writing tasks. LLM Stats tracks 13 models on this benchmark, scored on a 0–1 scale. The current average is 264.6, with the leader at 1721.9.
Compare leaders on the best AI for creativity and best AI for writing leaderboards.
Current leaders
Grok-4.1 Thinking from xAI currently leads the Creative Writing v3 leaderboard with a score of 1721.900 across 13 evaluated AI models.
Source paper
- Title
- EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models
- Authors
- Samuel J. Paech
- Published
- arXiv
- 2312.06281
Abstract
We introduce EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language Models (LLMs). We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://eqbench.com
FAQ
Common questions about the Creative Writing v3 benchmark and leaderboard.