PaperBench Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on PaperBench

State-of-the-art frontier

Open

Proprietary

PaperBench Leaderboard

1 models

				Context	Cost	License
1	Kimi K2.5 Moonshot AI		1.0T	262K	$0.60 / $3.00

FAQ

Common questions about PaperBench

PaperBench is a benchmark for evaluating AI agents on their ability to replicate research papers. It tests models on complex, multi-step workflows involving code implementation, experimentation, and reproducing scientific results from academic publications.

The PaperBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2.5 by Moonshot AI leads with a score of 0.635. The average score across all models is 0.635.

The highest PaperBench score is 0.635, achieved by Kimi K2.5 from Moonshot AI.

1 models have been evaluated on the PaperBench benchmark, with 0 verified results and 1 self-reported results.

PaperBench is categorized under agents, code, and reasoning. The benchmark evaluates text models.

PaperBench

Progress Over Time

PaperBench Leaderboard

FAQ

What is the PaperBench benchmark?

What is the PaperBench leaderboard?

What is the highest PaperBench score?

How many models are evaluated on PaperBench?

What categories does PaperBench cover?