LiveCodeBench
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
Progress Over Time
Interactive timeline showing model performance evolution on LiveCodeBench
State-of-the-art frontier
Open
Proprietary
LiveCodeBench Leaderboard
69 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 685B | 164K | $0.26 / $0.38 | ||
| 1 | DeepSeek | 685B | — | — | ||
| 3 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 4 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 5 | 120B | 262K | $0.10 / $0.50 | |||
| 6 | xAI | — | 128K | $0.30 / $0.50 | ||
| 7 | xAI | — | 2.0M | $0.20 / $0.50 | ||
| 8 | xAI | — | 128K | $3.00 / $15.00 | ||
| 8 | xAI | — | — | — | ||
| 8 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 11 | xAI | — | — | — | ||
| 12 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 13 | DeepSeek | 685B | — | — | ||
| 14 | DeepSeek | 671B | 131K | $0.50 / $2.15 | ||
| 15 | Zhipu AI | 355B | 131K | $0.40 / $1.60 | ||
| 16 | NVIDIA | 9B | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 235B | 128K | $0.10 / $0.10 | ||
| 17 | Zhipu AI | 106B | — | — | ||
| 19 | — | 1.0M | $1.25 / $10.00 | |||
| 20 | Inception | — | 128K | $0.25 / $0.75 | ||
| 21 | 253B | — | — | |||
| 22 | Alibaba Cloud / Qwen Team | 33B | 128K | $0.10 / $0.30 | ||
| 23 | MiniMax | 456B | 1.0M | $0.55 / $2.20 | ||
| 24 | Mistral AI | 14B | 262K | $0.20 / $0.20 | ||
| 25 | Mistral AI | 119B | 256K | $0.15 / $0.60 | ||
| 26 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 27 | Alibaba Cloud / Qwen Team | 31B | 128K | $0.10 / $0.44 | ||
| 28 | MiniMax | 456B | — | — | ||
| 29 | Mistral AI | 8B | 262K | $0.15 / $0.15 | ||
| 30 | DeepSeek | 71B | 128K | $0.10 / $0.40 | ||
| 31 | DeepSeek | 33B | 128K | $0.12 / $0.18 | ||
| 32 | DeepSeek | 671B | 164K | $0.27 / $1.00 | ||
| 33 | Alibaba Cloud / Qwen Team | 73B | 131K | $0.35 / $0.40 | ||
| 34 | Mistral AI | 3B | 131K | $0.10 / $0.10 | ||
| 35 | Microsoft | 14B | — | — | ||
| 36 | Moonshot AI | 1.0T | — | — | ||
| 37 | Microsoft | 14B | — | — | ||
| 37 | DeepSeek | 15B | — | — | ||
| 39 | Mistral AI | 24B | — | — | ||
| 40 | Mistral AI | 24B | — | — | ||
| 41 | DeepSeek | 671B | — | — | ||
| 41 | Alibaba Cloud / Qwen Team | 33B | 33K | $0.15 / $0.60 | ||
| 43 | DeepSeek | 671B | 164K | $0.28 / $1.14 | ||
| 44 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 45 | Meta | 400B | 1.0M | $0.17 / $0.85 | ||
| 46 | DeepSeek | 8B | — | — | ||
| 47 | DeepSeek | 8B | — | — | ||
| 47 | DeepSeek | 671B | 131K | $0.27 / $1.10 | ||
| 49 | Google | — | 1.0M | $0.10 / $0.40 | ||
| 50 | Mistral AI | 675B | — | — |
1–50 of 69
1/2
Notice missing or incorrect data?
FAQ
Common questions about LiveCodeBench
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
The LiveCodeBench paper is available at https://arxiv.org/abs/2403.07974. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LiveCodeBench leaderboard ranks 69 AI models based on their performance on this benchmark. Currently, DeepSeek-V3.2 by DeepSeek leads with a score of 0.833. The average score across all models is 0.519.
The highest LiveCodeBench score is 0.833, achieved by DeepSeek-V3.2 from DeepSeek.
69 models have been evaluated on the LiveCodeBench benchmark, with 0 verified results and 69 self-reported results.
LiveCodeBench is categorized under code, general, and reasoning. The benchmark evaluates text models.