Benchmarks/code/LiveCodeBench

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on LiveCodeBench

State-of-the-art frontier
Open
Proprietary

LiveCodeBench Leaderboard

68 models • 0 verified
ContextCostLicense
1
0.833685B
2
MiniMax
MiniMax
0.830230B1.0M
$0.30
$1.20
3
0.828560B128K
$0.30
$1.20
4
0.812120B262K
$0.10
$0.50
5
0.804128K
$0.30
$0.50
6
0.8002.0M
$0.20
$0.50
7
0.794
7
0.794560B128K
$0.30
$1.20
7
0.794128K
$3.00
$15.00
10
0.790
11
0.780230B1.0M
$0.30
$1.20
12
0.741685B
13
0.733671B131K
$0.50
$2.15
14
Zhipu AI
Zhipu AI
0.729355B131K
$0.40
$1.60
15
0.7119B
16
Zhipu AI
Zhipu AI
0.707106B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.707235B128K
$0.10
$0.10
18
0.6901.0M
$1.25
$10.00
19
Inception
Inception
0.670128K
$0.25
$0.75
20
0.663253B
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.65733B128K
$0.10
$0.30
22
0.650456B1.0M
$0.55
$2.20
23
0.64614B262K
$0.20
$0.20
24
Mistral AI
Mistral AI
0.636119B256K
$0.15
$0.60
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.63433B
26
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.62631B128K
$0.10
$0.30
27
0.623456B
28
0.6168B262K
$0.15
$0.15
29
0.57571B128K
$0.10
$0.40
30
0.57233B128K
$0.12
$0.18
31
0.564671B164K
$0.27
$1.00
32
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.55573B131K
$0.35
$0.40
33
0.5483B131K
$0.10
$0.10
34
0.53814B
35
0.5371.0T
36
0.53114B
36
0.53115B
38
0.51324B
39
0.50324B
40
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.50033B33K
$0.15
$0.60
40
0.500671B
42
0.492671B164K
$0.28
$1.14
43
0.480560B128K
$0.30
$1.20
44
0.434400B1.0M
$0.17
$0.60
45
0.3968B
46
DeepSeek
DeepSeek
0.376671B131K
$0.27
$1.10
46
0.3768B
48
0.3511.0M
$0.10
$0.40
49
0.344675B
49
0.344675B
Showing 1-50 of 68
1 / 2
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
The LiveCodeBench paper is available at https://arxiv.org/abs/2403.07974. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LiveCodeBench leaderboard ranks 68 AI models based on their performance on this benchmark. Currently, DeepSeek-V3.2 (Thinking) by DeepSeek leads with a score of 0.833. The average score across all models is 0.514.
The highest LiveCodeBench score is 0.833, achieved by DeepSeek-V3.2 (Thinking) from DeepSeek.
68 models have been evaluated on the LiveCodeBench benchmark, with 0 verified results and 68 self-reported results.
LiveCodeBench is categorized under code, general, and reasoning. The benchmark evaluates text models.