CFEval

CFEval benchmark for evaluating code generation and problem-solving capabilities

Qwen3-235B-A22B-Thinking-2507 from Alibaba Cloud / Qwen Team currently leads the CFEval leaderboard with a score of 2134.000 across 2 evaluated AI models.

Qwen3-235B-A22B-Thinking-2507 leads with 2134.000, followed by Qwen3-Next-80B-A3B-Thinking at 2071.000.

Progress Over Time

Interactive timeline showing model performance evolution on CFEval

State-of-the-art frontier

Open

Proprietary

CFEval Leaderboard

2 models

				Context	Cost	License
1	Qwen3-235B-A22B-Thinking-2507 Alibaba Cloud / Qwen Team		235B	—	—
2	Qwen3-Next-80B-A3B-Thinking Alibaba Cloud / Qwen Team		80B	—	—

Notice missing or incorrect data?

FAQ

Common questions about CFEval.

What is the CFEval benchmark?

CFEval benchmark for evaluating code generation and problem-solving capabilities

What is the CFEval leaderboard?

The CFEval leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Qwen3-235B-A22B-Thinking-2507 by Alibaba Cloud / Qwen Team leads with a score of 2134.000. The average score across all models is 2102.500.

What is the highest CFEval score?

The highest CFEval score is 2134.000, achieved by Qwen3-235B-A22B-Thinking-2507 from Alibaba Cloud / Qwen Team.

How many models are evaluated on CFEval?

2 models have been evaluated on the CFEval benchmark, with 0 verified results and 2 self-reported results.

What categories does CFEval cover?

CFEval is categorized under code. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all code →

SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

code

90 models

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

code

71 models

HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

code

66 models

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

code

40 models

SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

code

27 models

Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

code

23 models