Benchmarks/code/RepoBench

RepoBench

RepoBench is a benchmark for evaluating repository-level code auto-completion systems through three interconnected tasks: RepoBench-R (retrieval of relevant code snippets across files), RepoBench-C (code completion with cross-file and in-file context), and RepoBench-P (pipeline combining retrieval and prediction). Supports Python and Java programming languages and addresses the gap in evaluating real-world, multi-file programming scenarios by providing a more complete comparison of performance in auto-completion systems.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on RepoBench

State-of-the-art frontier
Open
Proprietary

RepoBench Leaderboard

1 models
ContextCostLicense
1
Mistral AI
Mistral AI
22B
Notice missing or incorrect data?

FAQ

Common questions about RepoBench

RepoBench is a benchmark for evaluating repository-level code auto-completion systems through three interconnected tasks: RepoBench-R (retrieval of relevant code snippets across files), RepoBench-C (code completion with cross-file and in-file context), and RepoBench-P (pipeline combining retrieval and prediction). Supports Python and Java programming languages and addresses the gap in evaluating real-world, multi-file programming scenarios by providing a more complete comparison of performance in auto-completion systems.
The RepoBench paper is available at https://arxiv.org/abs/2306.03091. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The RepoBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Codestral-22B by Mistral AI leads with a score of 0.340. The average score across all models is 0.340.
The highest RepoBench score is 0.340, achieved by Codestral-22B from Mistral AI.
1 models have been evaluated on the RepoBench benchmark, with 0 verified results and 1 self-reported results.
RepoBench is categorized under code and reasoning. The benchmark evaluates text models.