Benchmarks/general/Instruct HumanEval

Instruct HumanEval

Instruction-based variant of HumanEval benchmark for evaluating large language models' code generation capabilities with functional correctness using pass@k metric on programming problems

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Instruct HumanEval

State-of-the-art frontier
Open
Proprietary

Instruct HumanEval Leaderboard

1 models
ContextCostLicense
170B
Notice missing or incorrect data?

FAQ

Common questions about Instruct HumanEval

Instruction-based variant of HumanEval benchmark for evaluating large language models' code generation capabilities with functional correctness using pass@k metric on programming problems
The Instruct HumanEval paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Instruct HumanEval leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.1 Nemotron 70B Instruct by NVIDIA leads with a score of 0.738. The average score across all models is 0.738.
The highest Instruct HumanEval score is 0.738, achieved by Llama 3.1 Nemotron 70B Instruct from NVIDIA.
1 models have been evaluated on the Instruct HumanEval benchmark, with 0 verified results and 1 self-reported results.
Instruct HumanEval is categorized under general. The benchmark evaluates text models.