Instruct HumanEval
Instruction-based variant of HumanEval benchmark for evaluating large language models' code generation capabilities with functional correctness using pass@k metric on programming problems
Progress Over Time
Interactive timeline showing model performance evolution on Instruct HumanEval
State-of-the-art frontier
Open
Proprietary
Instruct HumanEval Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 70B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about Instruct HumanEval
Instruction-based variant of HumanEval benchmark for evaluating large language models' code generation capabilities with functional correctness using pass@k metric on programming problems
The Instruct HumanEval paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Instruct HumanEval leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.1 Nemotron 70B Instruct by NVIDIA leads with a score of 0.738. The average score across all models is 0.738.
The highest Instruct HumanEval score is 0.738, achieved by Llama 3.1 Nemotron 70B Instruct from NVIDIA.
1 models have been evaluated on the Instruct HumanEval benchmark, with 0 verified results and 1 self-reported results.
Instruct HumanEval is categorized under general. The benchmark evaluates text models.