Benchmarks/math/CodeForces

CodeForces

A competitive programming benchmark using problems from the CodeForces platform. The benchmark evaluates code generation capabilities of LLMs on algorithmic problems with difficulty ratings ranging from 800 to 2400. Problems cover diverse algorithmic categories including dynamic programming, graph algorithms, data structures, and mathematical problems with standardized evaluation through direct platform submission.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on CodeForces

State-of-the-art frontier
Open
Proprietary

CodeForces Leaderboard

11 models • 0 verified
ContextCostLicense
1
685B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
4
117B131K
$0.09
$0.45
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
6
685B
7
21B
8
685B
9
671B164K
$0.27
$1.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B128K
$0.10
$0.30
11
671B131K
$0.50
$2.15
Notice missing or incorrect data?

FAQ

Common questions about CodeForces

A competitive programming benchmark using problems from the CodeForces platform. The benchmark evaluates code generation capabilities of LLMs on algorithmic problems with difficulty ratings ranging from 800 to 2400. Problems cover diverse algorithmic categories including dynamic programming, graph algorithms, data structures, and mathematical problems with standardized evaluation through direct platform submission.
The CodeForces paper is available at https://arxiv.org/abs/2501.01257. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CodeForces leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, DeepSeek-V3.2-Speciale by DeepSeek leads with a score of 0.900. The average score across all models is 0.768.
The highest CodeForces score is 0.900, achieved by DeepSeek-V3.2-Speciale from DeepSeek.
11 models have been evaluated on the CodeForces benchmark, with 0 verified results and 11 self-reported results.
CodeForces is categorized under math and reasoning. The benchmark evaluates text models.