PaperBench
PaperBench is a benchmark for evaluating AI agents on their ability to replicate research papers. It tests models on complex, multi-step workflows involving code implementation, experimentation, and reproducing scientific results from academic publications.
Progress Over Time
Interactive timeline showing model performance evolution on PaperBench
State-of-the-art frontier
Open
Proprietary
PaperBench Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | 262K | $0.60 / $3.00 |
Notice missing or incorrect data?
FAQ
Common questions about PaperBench
PaperBench is a benchmark for evaluating AI agents on their ability to replicate research papers. It tests models on complex, multi-step workflows involving code implementation, experimentation, and reproducing scientific results from academic publications.
The PaperBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2.5 by Moonshot AI leads with a score of 0.635. The average score across all models is 0.635.
The highest PaperBench score is 0.635, achieved by Kimi K2.5 from Moonshot AI.
1 models have been evaluated on the PaperBench benchmark, with 0 verified results and 1 self-reported results.
PaperBench is categorized under agents, code, and reasoning. The benchmark evaluates text models.