HumanEval
A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
Progress Over Time
Interactive timeline showing model performance evolution on HumanEval
State-of-the-art frontier
Open
Proprietary
HumanEval Leaderboard
65 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Moonshot AI | 0.945 | 1.0T | 262K | $0.60 $2.50 | |
2 | Anthropic | 0.937 | — | 200K | $3.00 $15.00 | |
3 | OpenAI | 0.934 | — | 400K | $1.25 $10.00 | |
4 | Moonshot AI | 0.933 | 1.0T | 200K | $0.50 $0.50 | |
5 | Alibaba Cloud / Qwen Team | 0.927 | 32B | 128K | $0.09 $0.09 | |
6 | OpenAI | 0.924 | — | 128K | $3.00 $12.00 | |
7 | Sarvam AI | 0.921 | 30B | — | — | |
8 | Mistral AI | 0.920 | 123B | 128K | $2.00 $6.00 | |
8 | Anthropic | 0.920 | — | 200K | $3.00 $15.00 | |
10 | Alibaba Cloud / Qwen Team | 0.915 | 34B | — | — | |
11 | OpenAI | 0.902 | — | 128K | $2.50 $10.00 | |
12 | 0.897 | 8B | 128K | $0.50 $0.50 | ||
12 | 0.897 | 8B | — | — | ||
14 | Google | 0.896 | — | — | — | |
15 | Amazon | 0.890 | — | 300K | $0.80 $3.20 | |
15 | DeepSeek | 0.890 | 236B | 8K | $0.14 $0.28 | |
15 | 0.890 | 405B | 128K | $0.89 $0.89 | ||
18 | Mistral AI | 0.884 | 24B | — | — | |
18 | Meituan | 0.884 | 560B | 128K | $0.30 $1.20 | |
20 | 0.884 | 70B | 128K | $0.20 $0.20 | ||
20 | Alibaba Cloud / Qwen Team | 0.884 | 7B | — | — | |
20 | Alibaba Cloud / Qwen Team | 0.884 | 33B | — | — | |
20 | xAI | 0.884 | — | 128K | $2.00 $10.00 | |
24 | Anthropic | 0.881 | — | 200K | $0.80 $4.00 | |
24 | OpenAI | 0.881 | — | 200K | $15.00 $60.00 | |
26 | OpenAI | 0.880 | — | 128K | $75.00 $150.00 | |
27 | Google | 0.878 | 27B | 131K | $0.10 $0.20 | |
28 | OpenAI | 0.872 | — | 128K | $0.15 $0.60 | |
29 | OpenAI | 0.871 | — | 128K | $10.00 $30.00 | |
30 | Alibaba Cloud / Qwen Team | 0.866 | 73B | 131K | $0.35 $0.40 | |
31 | Alibaba Cloud / Qwen Team | 0.860 | 72B | — | — | |
32 | xAI | 0.857 | — | — | — | |
33 | Amazon | 0.854 | — | 300K | $0.06 $0.24 | |
33 | Google | 0.854 | 12B | 131K | $0.05 $0.10 | |
35 | Anthropic | 0.849 | — | 200K | $15.00 $75.00 | |
36 | Mistral AI | 0.848 | 24B | 32K | $0.07 $0.14 | |
36 | Alibaba Cloud / Qwen Team | 0.848 | 8B | 131K | $0.30 $0.30 | |
38 | Google | 0.841 | — | 2.1M | $2.50 $10.00 | |
39 | Alibaba Cloud / Qwen Team | 0.835 | 15B | — | — | |
40 | Microsoft | 0.826 | 15B | 16K | $0.07 $0.14 | |
41 | 0.824 | 7B | — | — | ||
42 | Amazon | 0.811 | — | 128K | $0.03 $0.14 | |
42 | Mistral AI | 0.811 | 22B | — | — | |
44 | 0.805 | 70B | 128K | $0.20 $0.20 | ||
45 | Alibaba Cloud / Qwen Team | 0.799 | 8B | — | — | |
46 | Alibaba Cloud / Qwen Team | 0.787 | 7B | — | — | |
47 | Anthropic | 0.759 | — | 200K | $0.25 $1.25 | |
48 | 0.750 | 2B | — | — | ||
48 | Google | 0.750 | 8B | 32K | $20.00 $40.00 | |
50 | Google | 0.743 | — | 1.0M | $0.15 $0.60 |
Showing 1-50 of 65
1 / 2
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about HumanEval
A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
The HumanEval paper is available at https://arxiv.org/abs/2107.03374. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HumanEval leaderboard ranks 65 AI models based on their performance on this benchmark. Currently, Kimi K2 0905 by Moonshot AI leads with a score of 0.945. The average score across all models is 0.809.
The highest HumanEval score is 0.945, achieved by Kimi K2 0905 from Moonshot AI.
65 models have been evaluated on the HumanEval benchmark, with 0 verified results and 64 self-reported results.
HumanEval is categorized under code and reasoning. The benchmark evaluates text models.