Benchmarks/reasoning/HumanEval-Mul

HumanEval-Mul

A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

Paper

Progress Over Time

Interactive timeline showing model performance evolution on HumanEval-Mul

State-of-the-art frontier
Open
Proprietary

HumanEval-Mul Leaderboard

2 models
ContextCostLicense
1
DeepSeek
DeepSeek
671B131K$0.27 / $1.10
2236B8K$0.14 / $0.28
Notice missing or incorrect data?

FAQ

Common questions about HumanEval-Mul

A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
The HumanEval-Mul paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval-Mul leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, DeepSeek-V3 by DeepSeek leads with a score of 0.826. The average score across all models is 0.782.
The highest HumanEval-Mul score is 0.826, achieved by DeepSeek-V3 from DeepSeek.
2 models have been evaluated on the HumanEval-Mul benchmark, with 0 verified results and 2 self-reported results.
HumanEval-Mul is categorized under reasoning. The benchmark evaluates text models with multilingual support.