Benchmarks/code/HumanEval

HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

Paper

Progress Over Time

Interactive timeline showing model performance evolution on HumanEval

State-of-the-art frontier
Open
Proprietary

HumanEval Leaderboard

65 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
2200K$3.00 / $15.00
3
OpenAI
OpenAI
400K$1.25 / $10.00
4
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B128K$0.09 / $0.09
6
OpenAI
OpenAI
128K$3.00 / $12.00
7
Sarvam AI
Sarvam AI
30B
8
Mistral AI
Mistral AI
123B128K$2.00 / $6.00
8200K$3.00 / $15.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
11
OpenAI
OpenAI
128K$2.50 / $10.00
128B128K$0.50 / $0.50
128B
14
15
Amazon
Amazon
300K$0.80 / $3.20
15236B8K$0.14 / $0.28
15405B128K$0.89 / $0.89
1824B
18560B128K$0.30 / $1.20
2070B128K$0.20 / $0.20
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
20128K$2.00 / $10.00
24200K$0.80 / $4.00
24
OpenAI
OpenAI
200K$15.00 / $60.00
26
OpenAI
OpenAI
128K$75.00 / $150.00
2727B131K$0.10 / $0.20
28128K$0.15 / $0.60
29128K$10.00 / $30.00
30
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K$0.35 / $0.40
31
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
32
33
Amazon
Amazon
300K$0.06 / $0.24
3312B131K$0.05 / $0.10
35
Anthropic
Anthropic
200K$15.00 / $75.00
3624B32K$0.07 / $0.14
36
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B131K$0.30 / $0.30
382.1M$2.50 / $10.00
39
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
40
Microsoft
Microsoft
15B16K$0.07 / $0.14
417B
42128K$0.03 / $0.14
42
Mistral AI
Mistral AI
22B
4470B128K$0.20 / $0.20
45
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
46
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
47200K$0.25 / $1.25
482B
488B32K$20.00 / $40.00
501.0M$0.15 / $0.60
150 of 65
1/2
Notice missing or incorrect data?

FAQ

Common questions about HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
The HumanEval paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval leaderboard ranks 65 AI models based on their performance on this benchmark. Currently, Kimi K2 0905 by Moonshot AI leads with a score of 0.945. The average score across all models is 0.809.
The highest HumanEval score is 0.945, achieved by Kimi K2 0905 from Moonshot AI.
65 models have been evaluated on the HumanEval benchmark, with 0 verified results and 64 self-reported results.
HumanEval is categorized under code and reasoning. The benchmark evaluates text models.