LiveBench
Progress Over Time
Interactive timeline showing model performance evolution on LiveBench
LiveBench Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | — | — | ||
| 2 | OpenAI | — | 1.1M | $5.00 / $30.00 | ||
| 3 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 4 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 5 | Anthropic | — | 1.0M | $10.00 / $50.00 | ||
| 6 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 7 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 8 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 9 | Moonshot AI | 1.0T | — | — | ||
| 9 | Moonshot AI | 1.0T | — | — | ||
| 11 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 12 | Anthropic | — | — | — | ||
| 13 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 14 | Google | — | 1.0M | $1.50 / $9.00 | ||
| 15 | Alibaba Cloud / Qwen Team | 33B | 128K | $0.10 / $0.44 | ||
| 16 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 17 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 17 | Alibaba Cloud / Qwen Team | 31B | 128K | $0.10 / $0.44 | ||
| 19 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 20 | DeepSeek | 1.6T | 1.0M | $1.60 / $3.20 | ||
| 21 | Google | — | — | — | ||
| 22 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 23 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 24 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 25 | Moonshot AI | 1.0T | 262K | $0.75 / $3.50 | ||
| 26 | OpenAI | — | — | — | ||
| 27 | Moonshot AI | 1.0T | 262K | $0.74 / $3.50 | ||
| 28 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 29 | Zhipu AI | 754B | 200K | $1.40 / $4.40 | ||
| 30 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 31 | MiniMax | — | 1.0M | $0.30 / $1.20 | ||
| 32 | Moonshot AI | 1.0T | — | — | ||
| 33 | OpenAI | — | — | — | ||
| 34 | OpenAI | — | — | — | ||
| 34 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 36 | Microsoft | 15B | — | — | ||
| 37 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 38 | Alibaba Cloud / Qwen Team | 7B | — | — |
What is LiveBench?
LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.
LiveBench is a text benchmark evaluating models on math, reasoning, and general tasks. LLM Stats tracks 38 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.
Compare leaders on the best AI for math, best AI for reasoning and best AI for general leaderboards.
Current leaders
o3-mini from OpenAI currently leads the LiveBench leaderboard with a score of 0.846 across 38 evaluated AI models.
Source paper
- Title
- LiveBench: A Challenging, Contamination-Limited LLM Benchmark
- Authors
- Colin White, Samuel Dooley, Manley Roberts, Arka Pal, and 14 others
- Published
- arXiv
- 2406.19314
Abstract
Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. LiveBench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.
FAQ
Common questions about the LiveBench benchmark and leaderboard.