Benchmarks/agents/PaperBench

PaperBench

PaperBench is a benchmark for evaluating AI agents on their ability to replicate research papers. It tests models on complex, multi-step workflows involving code implementation, experimentation, and reproducing scientific results from academic publications.

Progress Over Time

Interactive timeline showing model performance evolution on PaperBench

State-of-the-art frontier
Open
Proprietary

PaperBench Leaderboard

1 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $3.00
Notice missing or incorrect data?

FAQ

Common questions about PaperBench

PaperBench is a benchmark for evaluating AI agents on their ability to replicate research papers. It tests models on complex, multi-step workflows involving code implementation, experimentation, and reproducing scientific results from academic publications.
The PaperBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2.5 by Moonshot AI leads with a score of 0.635. The average score across all models is 0.635.
The highest PaperBench score is 0.635, achieved by Kimi K2.5 from Moonshot AI.
1 models have been evaluated on the PaperBench benchmark, with 0 verified results and 1 self-reported results.
PaperBench is categorized under agents, code, and reasoning. The benchmark evaluates text models.