HumanEval-Mul Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on HumanEval-Mul

State-of-the-art frontier

Open

Proprietary

HumanEval-Mul Leaderboard

2 models

				Context	Cost	License
1	DeepSeek-V3 DeepSeek		671B	131K	$0.27 / $1.10
2	DeepSeek-V2.5 DeepSeek		236B	8K	$0.14 / $0.28

FAQ

Common questions about HumanEval-Mul

A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

The HumanEval-Mul paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The HumanEval-Mul leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, DeepSeek-V3 by DeepSeek leads with a score of 0.826. The average score across all models is 0.782.

The highest HumanEval-Mul score is 0.826, achieved by DeepSeek-V3 from DeepSeek.

2 models have been evaluated on the HumanEval-Mul benchmark, with 0 verified results and 2 self-reported results.

HumanEval-Mul is categorized under reasoning. The benchmark evaluates text models with multilingual support.

HumanEval-Mul

Progress Over Time

HumanEval-Mul Leaderboard

FAQ

What is the HumanEval-Mul benchmark?

Where can I find the HumanEval-Mul paper?

What is the HumanEval-Mul leaderboard?

What is the highest HumanEval-Mul score?

How many models are evaluated on HumanEval-Mul?

What categories does HumanEval-Mul cover?