Multipl-E HumanEval
MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks across multiple programming languages. It extends the HumanEval benchmark to 18 additional programming languages, enabling evaluation of code generation models across diverse programming paradigms and providing insights into how models generalize programming knowledge across language boundaries.
Progress Over Time
Interactive timeline showing model performance evolution on Multipl-E HumanEval
State-of-the-art frontier
Open
Proprietary
Multipl-E HumanEval Leaderboard
3 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 405B | 128K | $0.89 / $0.89 | |||
| 2 | 70B | 128K | $0.20 / $0.20 | |||
| 3 | 8B | 131K | $0.03 / $0.03 |
Notice missing or incorrect data?
FAQ
Common questions about Multipl-E HumanEval
MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks across multiple programming languages. It extends the HumanEval benchmark to 18 additional programming languages, enabling evaluation of code generation models across diverse programming paradigms and providing insights into how models generalize programming knowledge across language boundaries.
The Multipl-E HumanEval paper is available at https://arxiv.org/abs/2208.08227. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Multipl-E HumanEval leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Llama 3.1 405B Instruct by Meta leads with a score of 0.752. The average score across all models is 0.638.
The highest Multipl-E HumanEval score is 0.752, achieved by Llama 3.1 405B Instruct from Meta.
3 models have been evaluated on the Multipl-E HumanEval benchmark, with 0 verified results and 3 self-reported results.
Multipl-E HumanEval is categorized under general and language. The benchmark evaluates text models with multilingual support.