HumanEval-Mul
A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
Progress Over Time
Interactive timeline showing model performance evolution on HumanEval-Mul
State-of-the-art frontier
Open
Proprietary
HumanEval-Mul Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 671B | 131K | $0.27 / $1.10 | ||
| 2 | DeepSeek | 236B | 8K | $0.14 / $0.28 |
Notice missing or incorrect data?
FAQ
Common questions about HumanEval-Mul
A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
The HumanEval-Mul paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval-Mul leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, DeepSeek-V3 by DeepSeek leads with a score of 0.826. The average score across all models is 0.782.
The highest HumanEval-Mul score is 0.826, achieved by DeepSeek-V3 from DeepSeek.
2 models have been evaluated on the HumanEval-Mul benchmark, with 0 verified results and 2 self-reported results.
HumanEval-Mul is categorized under reasoning. The benchmark evaluates text models with multilingual support.