LiveCodeBench v5
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
Progress Over Time
Interactive timeline showing model performance evolution on LiveCodeBench v5
State-of-the-art frontier
Open
Proprietary
LiveCodeBench v5 Leaderboard
8 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | — | 1.0M | $1.25 / $10.00 | ||
| 2 | Google | — | 1.0M | $0.30 / $2.50 | ||
| 3 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 4 | Google | — | 1.0M | $0.07 / $0.30 | ||
| 5 | Google | 8B | 32K | $20.00 / $40.00 | ||
| 5 | 2B | — | — | |||
| 7 | 2B | — | — | |||
| 7 | Google | 8B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about LiveCodeBench v5
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
The LiveCodeBench v5 paper is available at https://arxiv.org/abs/2403.07974. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LiveCodeBench v5 leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, Gemini 2.5 Pro by Google leads with a score of 0.756. The average score across all models is 0.398.
The highest LiveCodeBench v5 score is 0.756, achieved by Gemini 2.5 Pro from Google.
8 models have been evaluated on the LiveCodeBench v5 benchmark, with 0 verified results and 8 self-reported results.
LiveCodeBench v5 is categorized under general and reasoning. The benchmark evaluates text models.