Benchmarks/general/LMArena Text Leaderboard

LMArena Text Leaderboard

LMArena Text Leaderboard is a blind human preference evaluation benchmark that ranks models based on pairwise comparisons in real-world conversations. The leaderboard uses Elo ratings computed from user preferences in head-to-head model battles, providing a comprehensive measure of overall model capability and style.

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on LMArena Text Leaderboard

State-of-the-art frontier
Open
Proprietary

LMArena Text Leaderboard Leaderboard

2 models • 0 verified
ContextCostLicense
1
256K
$3.00
$15.00
2
256K
$3.00
$15.00
Notice missing or incorrect data?

FAQ

Common questions about LMArena Text Leaderboard

LMArena Text Leaderboard is a blind human preference evaluation benchmark that ranks models based on pairwise comparisons in real-world conversations. The leaderboard uses Elo ratings computed from user preferences in head-to-head model battles, providing a comprehensive measure of overall model capability and style.
The LMArena Text Leaderboard paper is available at https://arena.lmsys.org/. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LMArena Text Leaderboard dataset is available at https://arena.lmsys.org/.
The LMArena Text Leaderboard leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Grok-4.1 Thinking by xAI leads with a score of 1483.000. The average score across all models is 1474.000.
The highest LMArena Text Leaderboard score is 1483.000, achieved by Grok-4.1 Thinking from xAI.
2 models have been evaluated on the LMArena Text Leaderboard benchmark, with 0 verified results and 2 self-reported results.
LMArena Text Leaderboard is categorized under general and reasoning. The benchmark evaluates text models.