LiveBench
LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.
Progress Over Time
Interactive timeline showing model performance evolution on LiveBench
State-of-the-art frontier
Open
Proprietary
LiveBench Leaderboard
13 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | OpenAI | — | 200K | $1.10 $4.40 | ||
2 | Alibaba Cloud / Qwen Team | 235B | 128K | $0.10 $0.10 | ||
3 | Moonshot AI | 1.0T | — | — | ||
3 | Moonshot AI | 1.0T | 200K | $0.50 $0.50 | ||
5 | Alibaba Cloud / Qwen Team | 33B | 128K | $0.10 $0.30 | ||
6 | Alibaba Cloud / Qwen Team | 31B | 128K | $0.10 $0.30 | ||
7 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
8 | OpenAI | — | 200K | $15.00 $60.00 | ||
9 | Alibaba Cloud / Qwen Team | 73B | 131K | $0.35 $0.40 | ||
9 | OpenAI | — | 128K | $15.00 $60.00 | ||
11 | Microsoft | 15B | 16K | $0.07 $0.14 | ||
12 | Alibaba Cloud / Qwen Team | 8B | 131K | $0.30 $0.30 | ||
13 | Alibaba Cloud / Qwen Team | 7B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about LiveBench
LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.
The LiveBench paper is available at https://arxiv.org/abs/2406.19314. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LiveBench leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, o3-mini by OpenAI leads with a score of 0.846. The average score across all models is 0.632.
The highest LiveBench score is 0.846, achieved by o3-mini from OpenAI.
13 models have been evaluated on the LiveBench benchmark, with 0 verified results and 13 self-reported results.
LiveBench is categorized under general, math, and reasoning. The benchmark evaluates text models.