Benchmarks/code/HumanEval

HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

Paper

Progress Over Time

Interactive timeline showing model performance evolution on HumanEval

State-of-the-art frontier
Open
Proprietary

HumanEval Leaderboard

65 models • 0 verified
ContextCostLicense
1
Moonshot AI
Moonshot AI
0.9451.0T262K
$0.60
$2.50
2
0.937200K
$3.00
$15.00
3
OpenAI
OpenAI
0.934400K
$1.25
$10.00
4
Moonshot AI
Moonshot AI
0.9331.0T200K
$0.50
$0.50
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.92732B128K
$0.09
$0.09
6
OpenAI
OpenAI
0.924128K
$3.00
$12.00
7
Sarvam AI
Sarvam AI
0.92130B
8
Mistral AI
Mistral AI
0.920123B128K
$2.00
$6.00
8
0.920200K
$3.00
$15.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.91534B
11
OpenAI
OpenAI
0.902128K
$2.50
$10.00
12
0.8978B128K
$0.50
$0.50
12
0.8978B
14
0.896
15
Amazon
Amazon
0.890300K
$0.80
$3.20
15
0.890236B8K
$0.14
$0.28
15
0.890405B128K
$0.89
$0.89
18
0.88424B
18
0.884560B128K
$0.30
$1.20
20
0.88470B128K
$0.20
$0.20
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.8847B
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.88433B
20
0.884128K
$2.00
$10.00
24
0.881200K
$0.80
$4.00
24
OpenAI
OpenAI
0.881200K
$15.00
$60.00
26
OpenAI
OpenAI
0.880128K
$75.00
$150.00
27
0.87827B131K
$0.10
$0.20
28
0.872128K
$0.15
$0.60
29
0.871128K
$10.00
$30.00
30
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.86673B131K
$0.35
$0.40
31
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.86072B
32
0.857
33
Amazon
Amazon
0.854300K
$0.06
$0.24
33
0.85412B131K
$0.05
$0.10
35
Anthropic
Anthropic
0.849200K
$15.00
$75.00
36
0.84824B32K
$0.07
$0.14
36
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.8488B131K
$0.30
$0.30
38
0.8412.1M
$2.50
$10.00
39
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.83515B
40
Microsoft
Microsoft
0.82615B16K
$0.07
$0.14
41
0.8247B
42
0.811128K
$0.03
$0.14
42
Mistral AI
Mistral AI
0.81122B
44
0.80570B128K
$0.20
$0.20
45
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.7998B
46
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.7877B
47
0.759200K
$0.25
$1.25
48
0.7502B
48
0.7508B32K
$20.00
$40.00
50
0.7431.0M
$0.15
$0.60
Showing 1-50 of 65
1 / 2
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
The HumanEval paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval leaderboard ranks 65 AI models based on their performance on this benchmark. Currently, Kimi K2 0905 by Moonshot AI leads with a score of 0.945. The average score across all models is 0.809.
The highest HumanEval score is 0.945, achieved by Kimi K2 0905 from Moonshot AI.
65 models have been evaluated on the HumanEval benchmark, with 0 verified results and 64 self-reported results.
HumanEval is categorized under code and reasoning. The benchmark evaluates text models.