HumanEval-ER
A variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
Progress Over Time
Interactive timeline showing model performance evolution on HumanEval-ER
State-of-the-art frontier
Open
Proprietary
HumanEval-ER Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 |
Notice missing or incorrect data?
FAQ
Common questions about HumanEval-ER
A variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
The HumanEval-ER paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval-ER leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.811. The average score across all models is 0.811.
The highest HumanEval-ER score is 0.811, achieved by Kimi K2 Instruct from Moonshot AI.
1 models have been evaluated on the HumanEval-ER benchmark, with 0 verified results and 1 self-reported results.
HumanEval-ER is categorized under reasoning. The benchmark evaluates text models.