Benchmarks/code/EvalPlus

EvalPlus

A rigorous code synthesis evaluation framework that augments existing datasets with extensive test cases generated by LLM and mutation-based strategies to better assess functional correctness of generated code, including HumanEval+ with 80x more test cases

Paper

Progress Over Time

Interactive timeline showing model performance evolution on EvalPlus

State-of-the-art frontier
Open
Proprietary

EvalPlus Leaderboard

4 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B128K$0.10 / $0.10
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
Notice missing or incorrect data?

FAQ

Common questions about EvalPlus

A rigorous code synthesis evaluation framework that augments existing datasets with extensive test cases generated by LLM and mutation-based strategies to better assess functional correctness of generated code, including HumanEval+ with 80x more test cases
The EvalPlus paper is available at https://arxiv.org/abs/2305.01210. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The EvalPlus leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Kimi K2 Base by Moonshot AI leads with a score of 0.803. The average score across all models is 0.768.
The highest EvalPlus score is 0.803, achieved by Kimi K2 Base from Moonshot AI.
4 models have been evaluated on the EvalPlus benchmark, with 0 verified results and 4 self-reported results.
EvalPlus is categorized under code and reasoning. The benchmark evaluates text models.