Benchmarks/reasoning/HumanEval-ER

HumanEval-ER

A variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

Paper

Progress Over Time

Interactive timeline showing model performance evolution on HumanEval-ER

State-of-the-art frontier
Open
Proprietary

HumanEval-ER Leaderboard

1 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
Notice missing or incorrect data?

FAQ

Common questions about HumanEval-ER

A variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
The HumanEval-ER paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval-ER leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.811. The average score across all models is 0.811.
The highest HumanEval-ER score is 0.811, achieved by Kimi K2 Instruct from Moonshot AI.
1 models have been evaluated on the HumanEval-ER benchmark, with 0 verified results and 1 self-reported results.
HumanEval-ER is categorized under reasoning. The benchmark evaluates text models.