EvalPlus
A rigorous code synthesis evaluation framework that augments existing datasets with extensive test cases generated by LLM and mutation-based strategies to better assess functional correctness of generated code, including HumanEval+ with 80x more test cases
Progress Over Time
Interactive timeline showing model performance evolution on EvalPlus
State-of-the-art frontier
Open
Proprietary
EvalPlus Leaderboard
4 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 235B | 128K | $0.10 / $0.10 | ||
| 4 | Alibaba Cloud / Qwen Team | 8B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about EvalPlus
A rigorous code synthesis evaluation framework that augments existing datasets with extensive test cases generated by LLM and mutation-based strategies to better assess functional correctness of generated code, including HumanEval+ with 80x more test cases
The EvalPlus paper is available at https://arxiv.org/abs/2305.01210. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The EvalPlus leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Kimi K2 Base by Moonshot AI leads with a score of 0.803. The average score across all models is 0.768.
The highest EvalPlus score is 0.803, achieved by Kimi K2 Base from Moonshot AI.
4 models have been evaluated on the EvalPlus benchmark, with 0 verified results and 4 self-reported results.
EvalPlus is categorized under code and reasoning. The benchmark evaluates text models.