LiveBench

LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.

o3-mini from OpenAI currently leads the LiveBench leaderboard with a score of 0.846 across 13 evaluated AI models.

Paper