Human Preferences

Model Arenas

Real human preference data from blind comparisons. Users evaluate AI models without knowing which is which, revealing which models truly perform better.

About Rankings

Rankings use TrueSkill, a sophisticated rating system that balances skill estimates with uncertainty. The conservative rating (μ - 3σ) ensures reliable comparisons even with limited data.

μ (Mu)
Skill estimate
σ (Sigma)
Uncertainty
Rating
μ - 3σ