HumanEval Plus
Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code
Progress Over Time
Interactive timeline showing model performance evolution on HumanEval Plus
State-of-the-art frontier
Open
Proprietary
HumanEval Plus Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Mistral AI | 24B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about HumanEval Plus
Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code
The HumanEval Plus paper is available at https://arxiv.org/abs/2305.01210. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval Plus leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Mistral Small 3.2 24B Instruct by Mistral AI leads with a score of 0.929. The average score across all models is 0.929.
The highest HumanEval Plus score is 0.929, achieved by Mistral Small 3.2 24B Instruct from Mistral AI.
1 models have been evaluated on the HumanEval Plus benchmark, with 0 verified results and 1 self-reported results.
HumanEval Plus is categorized under code and reasoning. The benchmark evaluates text models.