EQ-Bench

EQ-Bench is an LLM-judged test evaluating active emotional intelligence abilities, understanding, insight, empathy, and interpersonal skills. The test set contains 45 challenging roleplay scenarios, most of which constitute pre-written prompts spanning 3 turns. The benchmark evaluates the performance of models by validating responses against several criteria and conducts pairwise comparisons to report a normalized Elo computation for each model.

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on EQ-Bench

State-of-the-art frontier
Open
Proprietary

EQ-Bench Leaderboard

2 models
ContextCostLicense
1256K$3.00 / $15.00
2256K$3.00 / $15.00
Notice missing or incorrect data?

FAQ

Common questions about EQ-Bench

EQ-Bench is an LLM-judged test evaluating active emotional intelligence abilities, understanding, insight, empathy, and interpersonal skills. The test set contains 45 challenging roleplay scenarios, most of which constitute pre-written prompts spanning 3 turns. The benchmark evaluates the performance of models by validating responses against several criteria and conducts pairwise comparisons to report a normalized Elo computation for each model.
The EQ-Bench paper is available at https://github.com/EQ-Bench/EQ-Bench. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The EQ-Bench dataset is available at https://github.com/EQ-Bench/EQ-Bench.
The EQ-Bench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Grok-4.1 Thinking by xAI leads with a score of 1586.000. The average score across all models is 1585.500.
The highest EQ-Bench score is 1586.000, achieved by Grok-4.1 Thinking from xAI.
2 models have been evaluated on the EQ-Bench benchmark, with 0 verified results and 2 self-reported results.
EQ-Bench is categorized under creativity, general, reasoning, and roleplay. The benchmark evaluates text models.