LiveBench

Paper

Progress Over Time

Interactive timeline showing model performance evolution on LiveBench

State-of-the-art frontier
Open
Proprietary

LiveBench Leaderboard

38 models
ContextCostLicense
1
OpenAI
OpenAI
2
OpenAI
OpenAI
1.1M$5.00 / $30.00
3
OpenAI
OpenAI
1.0M$2.50 / $15.00
41.0M$2.50 / $15.00
51.0M$10.00 / $50.00
61.0M$5.00 / $25.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
81.0M$5.00 / $25.00
9
Moonshot AI
Moonshot AI
1.0T
91.0T
111.0M$5.00 / $25.00
12
13200K$3.00 / $15.00
141.0M$1.50 / $9.00
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B128K$0.10 / $0.44
16
OpenAI
OpenAI
400K$1.75 / $14.00
17400K$1.75 / $14.00
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B128K$0.10 / $0.44
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
201.6T1.0M$1.60 / $3.20
21
22
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
23400K$1.75 / $14.00
241.0M$0.50 / $3.00
25
Moonshot AI
Moonshot AI
1.0T262K$0.75 / $3.50
26
27
Moonshot AI
Moonshot AI
1.0T262K$0.74 / $3.50
28
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
29
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
30400K$0.20 / $1.25
31
MiniMax
MiniMax
1.0M$0.30 / $1.20
32
Moonshot AI
Moonshot AI
1.0T
33
OpenAI
OpenAI
34
34
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
36
Microsoft
Microsoft
15B
37
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
38
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?
About this benchmark

What is LiveBench?

LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.

LiveBench is a text benchmark evaluating models on math, reasoning, and general tasks. LLM Stats tracks 38 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.

Compare leaders on the best AI for math, best AI for reasoning and best AI for general leaderboards.

Current leaders

o3-mini from OpenAI currently leads the LiveBench leaderboard with a score of 0.846 across 38 evaluated AI models.

1o3-miniOpenAI84.6%
2GPT-5.5OpenAI80.7%
3GPT-5.4OpenAI80.3%
OSSQwen3 235B A22B#7 open-weight77.1%

Source paper

Title
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Authors
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, and 14 others
Published
Abstract

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. LiveBench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

FAQ

Common questions about the LiveBench benchmark and leaderboard.

What is the LiveBench benchmark?

LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.

What is the LiveBench leaderboard?

The LiveBench leaderboard ranks 38 AI models based on their performance on this benchmark. Currently, o3-mini by OpenAI leads with a score of 0.846. The average score across all models is 0.705.

What is the highest LiveBench score?

The highest LiveBench score is 0.846, achieved by o3-mini from OpenAI.

How many models are evaluated on LiveBench?

38 models have been evaluated on the LiveBench benchmark, with 0 verified results and 13 self-reported results.

Where can I find the LiveBench paper?

The LiveBench paper is available at https://arxiv.org/abs/2406.19314. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does LiveBench cover?

LiveBench is categorized under math, reasoning, and general. The benchmark evaluates text models.

What is the best open-source model on LiveBench?

Qwen3 235B A22B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on LiveBench, with a score of 0.771 (rank #7).

Which model offers the best value on LiveBench?

Among models scoring within 10% of the leader, GPT-5.4 from OpenAI is the cheapest, at $2.50 per million input tokens with a score of 0.803.

How recent are the LiveBench leaderboard results?

The LiveBench leaderboard was last updated in July 2026 and currently includes 38 evaluated models.