LiveCodeBench
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
Progress Over Time
Interactive timeline showing model performance evolution on LiveCodeBench
State-of-the-art frontier
Open
Proprietary
LiveCodeBench Leaderboard
68 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | DeepSeek | 0.833 | 685B | — | — | |
2 | MiniMax | 0.830 | 230B | 1.0M | $0.30 $1.20 | |
3 | Meituan | 0.828 | 560B | 128K | $0.30 $1.20 | |
4 | 0.812 | 120B | 262K | $0.10 $0.50 | ||
5 | xAI | 0.804 | — | 128K | $0.30 $0.50 | |
6 | xAI | 0.800 | — | 2.0M | $0.20 $0.50 | |
7 | xAI | 0.794 | — | — | — | |
7 | Meituan | 0.794 | 560B | 128K | $0.30 $1.20 | |
7 | xAI | 0.794 | — | 128K | $3.00 $15.00 | |
10 | xAI | 0.790 | — | — | — | |
11 | MiniMax | 0.780 | 230B | 1.0M | $0.30 $1.20 | |
12 | DeepSeek | 0.741 | 685B | — | — | |
13 | DeepSeek | 0.733 | 671B | 131K | $0.50 $2.15 | |
14 | Zhipu AI | 0.729 | 355B | 131K | $0.40 $1.60 | |
15 | NVIDIA | 0.711 | 9B | — | — | |
16 | Zhipu AI | 0.707 | 106B | — | — | |
16 | Alibaba Cloud / Qwen Team | 0.707 | 235B | 128K | $0.10 $0.10 | |
18 | 0.690 | — | 1.0M | $1.25 $10.00 | ||
19 | Inception | 0.670 | — | 128K | $0.25 $0.75 | |
20 | 0.663 | 253B | — | — | ||
21 | Alibaba Cloud / Qwen Team | 0.657 | 33B | 128K | $0.10 $0.30 | |
22 | MiniMax | 0.650 | 456B | 1.0M | $0.55 $2.20 | |
23 | Mistral AI | 0.646 | 14B | 262K | $0.20 $0.20 | |
24 | Mistral AI | 0.636 | 119B | 256K | $0.15 $0.60 | |
25 | Alibaba Cloud / Qwen Team | 0.634 | 33B | — | — | |
26 | Alibaba Cloud / Qwen Team | 0.626 | 31B | 128K | $0.10 $0.30 | |
27 | MiniMax | 0.623 | 456B | — | — | |
28 | Mistral AI | 0.616 | 8B | 262K | $0.15 $0.15 | |
29 | DeepSeek | 0.575 | 71B | 128K | $0.10 $0.40 | |
30 | DeepSeek | 0.572 | 33B | 128K | $0.12 $0.18 | |
31 | DeepSeek | 0.564 | 671B | 164K | $0.27 $1.00 | |
32 | Alibaba Cloud / Qwen Team | 0.555 | 73B | 131K | $0.35 $0.40 | |
33 | Mistral AI | 0.548 | 3B | 131K | $0.10 $0.10 | |
34 | Microsoft | 0.538 | 14B | — | — | |
35 | Moonshot AI | 0.537 | 1.0T | — | — | |
36 | Microsoft | 0.531 | 14B | — | — | |
36 | DeepSeek | 0.531 | 15B | — | — | |
38 | Mistral AI | 0.513 | 24B | — | — | |
39 | Mistral AI | 0.503 | 24B | — | — | |
40 | Alibaba Cloud / Qwen Team | 0.500 | 33B | 33K | $0.15 $0.60 | |
40 | DeepSeek | 0.500 | 671B | — | — | |
42 | DeepSeek | 0.492 | 671B | 164K | $0.28 $1.14 | |
43 | Meituan | 0.480 | 560B | 128K | $0.30 $1.20 | |
44 | Meta | 0.434 | 400B | 1.0M | $0.17 $0.60 | |
45 | DeepSeek | 0.396 | 8B | — | — | |
46 | DeepSeek | 0.376 | 671B | 131K | $0.27 $1.10 | |
46 | DeepSeek | 0.376 | 8B | — | — | |
48 | Google | 0.351 | — | 1.0M | $0.10 $0.40 | |
49 | Mistral AI | 0.344 | 675B | — | — | |
49 | 0.344 | 675B | — | — |
Showing 1-50 of 68
1 / 2
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about LiveCodeBench
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
The LiveCodeBench paper is available at https://arxiv.org/abs/2403.07974. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LiveCodeBench leaderboard ranks 68 AI models based on their performance on this benchmark. Currently, DeepSeek-V3.2 (Thinking) by DeepSeek leads with a score of 0.833. The average score across all models is 0.514.
The highest LiveCodeBench score is 0.833, achieved by DeepSeek-V3.2 (Thinking) from DeepSeek.
68 models have been evaluated on the LiveCodeBench benchmark, with 0 verified results and 68 self-reported results.
LiveCodeBench is categorized under code, general, and reasoning. The benchmark evaluates text models.