LiveBench

LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on LiveBench

State-of-the-art frontier
Open
Proprietary

LiveBench Leaderboard

13 models • 0 verified
ContextCostLicense
1
OpenAI
OpenAI
200K
$1.10
$4.40
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B128K
$0.10
$0.10
3
1.0T
3
Moonshot AI
Moonshot AI
1.0T200K
$0.50
$0.50
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B128K
$0.10
$0.30
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B128K
$0.10
$0.30
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
8
OpenAI
OpenAI
200K
$15.00
$60.00
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K
$0.35
$0.40
9
128K
$15.00
$60.00
11
Microsoft
Microsoft
15B16K
$0.07
$0.14
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B131K
$0.30
$0.30
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?

FAQ

Common questions about LiveBench

LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.
The LiveBench paper is available at https://arxiv.org/abs/2406.19314. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LiveBench leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, o3-mini by OpenAI leads with a score of 0.846. The average score across all models is 0.632.
The highest LiveBench score is 0.846, achieved by o3-mini from OpenAI.
13 models have been evaluated on the LiveBench benchmark, with 0 verified results and 13 self-reported results.
LiveBench is categorized under general, math, and reasoning. The benchmark evaluates text models.