Benchmarks/code/LiveCodeBench

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on LiveCodeBench

State-of-the-art frontier
Open
Proprietary

LiveCodeBench Leaderboard

69 models
ContextCostLicense
1685B164K$0.26 / $0.38
1685B
3
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
4560B128K$0.30 / $1.20
5120B262K$0.10 / $0.50
6128K$0.30 / $0.50
72.0M$0.20 / $0.50
8128K$3.00 / $15.00
8
8560B128K$0.30 / $1.20
11
12230B1.0M$0.30 / $1.20
13685B
14671B131K$0.50 / $2.15
15
Zhipu AI
Zhipu AI
355B131K$0.40 / $1.60
169B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B128K$0.10 / $0.10
17
Zhipu AI
Zhipu AI
106B
191.0M$1.25 / $10.00
20
Inception
Inception
128K$0.25 / $0.75
21253B
22
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B128K$0.10 / $0.30
23456B1.0M$0.55 / $2.20
2414B262K$0.20 / $0.20
25
Mistral AI
Mistral AI
119B256K$0.15 / $0.60
26
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B128K$0.10 / $0.44
28456B
298B262K$0.15 / $0.15
3071B128K$0.10 / $0.40
3133B128K$0.12 / $0.18
32671B164K$0.27 / $1.00
33
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K$0.35 / $0.40
343B131K$0.10 / $0.10
3514B
361.0T
3714B
3715B
3924B
4024B
41671B
41
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B33K$0.15 / $0.60
43671B164K$0.28 / $1.14
44560B128K$0.30 / $1.20
45400B1.0M$0.17 / $0.85
468B
478B
47
DeepSeek
DeepSeek
671B131K$0.27 / $1.10
491.0M$0.10 / $0.40
50675B
150 of 69
1/2
Notice missing or incorrect data?

FAQ

Common questions about LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
The LiveCodeBench paper is available at https://arxiv.org/abs/2403.07974. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LiveCodeBench leaderboard ranks 69 AI models based on their performance on this benchmark. Currently, DeepSeek-V3.2 by DeepSeek leads with a score of 0.833. The average score across all models is 0.519.
The highest LiveCodeBench score is 0.833, achieved by DeepSeek-V3.2 from DeepSeek.
69 models have been evaluated on the LiveCodeBench benchmark, with 0 verified results and 69 self-reported results.
LiveCodeBench is categorized under code, general, and reasoning. The benchmark evaluates text models.