LiveCodeBench(01-09)
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
Progress Over Time
Interactive timeline showing model performance evolution on LiveCodeBench(01-09)
State-of-the-art frontier
Open
Proprietary
LiveCodeBench(01-09) Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 236B | 8K | $0.14 / $0.28 |
Notice missing or incorrect data?
FAQ
Common questions about LiveCodeBench(01-09)
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
The LiveCodeBench(01-09) paper is available at https://arxiv.org/abs/2403.07974. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LiveCodeBench(01-09) leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, DeepSeek-V2.5 by DeepSeek leads with a score of 0.418. The average score across all models is 0.418.
The highest LiveCodeBench(01-09) score is 0.418, achieved by DeepSeek-V2.5 from DeepSeek.
1 models have been evaluated on the LiveCodeBench(01-09) benchmark, with 0 verified results and 1 self-reported results.
LiveCodeBench(01-09) is categorized under general and reasoning. The benchmark evaluates text models.