Benchmarks/code/HumanEval Plus

HumanEval Plus

Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code

Paper

Progress Over Time

Interactive timeline showing model performance evolution on HumanEval Plus

State-of-the-art frontier
Open
Proprietary

HumanEval Plus Leaderboard

1 models
ContextCostLicense
124B
Notice missing or incorrect data?

FAQ

Common questions about HumanEval Plus

Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code
The HumanEval Plus paper is available at https://arxiv.org/abs/2305.01210. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval Plus leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Mistral Small 3.2 24B Instruct by Mistral AI leads with a score of 0.929. The average score across all models is 0.929.
The highest HumanEval Plus score is 0.929, achieved by Mistral Small 3.2 24B Instruct from Mistral AI.
1 models have been evaluated on the HumanEval Plus benchmark, with 0 verified results and 1 self-reported results.
HumanEval Plus is categorized under code and reasoning. The benchmark evaluates text models.