Benchmarks/general/Multipl-E HumanEval

Multipl-E HumanEval

MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks across multiple programming languages. It extends the HumanEval benchmark to 18 additional programming languages, enabling evaluation of code generation models across diverse programming paradigms and providing insights into how models generalize programming knowledge across language boundaries.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Multipl-E HumanEval

State-of-the-art frontier
Open
Proprietary

Multipl-E HumanEval Leaderboard

3 models
ContextCostLicense
1405B128K$0.89 / $0.89
270B128K$0.20 / $0.20
38B131K$0.03 / $0.03
Notice missing or incorrect data?

FAQ

Common questions about Multipl-E HumanEval

MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks across multiple programming languages. It extends the HumanEval benchmark to 18 additional programming languages, enabling evaluation of code generation models across diverse programming paradigms and providing insights into how models generalize programming knowledge across language boundaries.
The Multipl-E HumanEval paper is available at https://arxiv.org/abs/2208.08227. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Multipl-E HumanEval leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Llama 3.1 405B Instruct by Meta leads with a score of 0.752. The average score across all models is 0.638.
The highest Multipl-E HumanEval score is 0.752, achieved by Llama 3.1 405B Instruct from Meta.
3 models have been evaluated on the Multipl-E HumanEval benchmark, with 0 verified results and 3 self-reported results.
Multipl-E HumanEval is categorized under general and language. The benchmark evaluates text models with multilingual support.