RepoBench

Name: RepoBench Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on RepoBench

State-of-the-art frontier

Open

Proprietary

RepoBench Leaderboard

1 models

				Context	Cost	License
1	Codestral-22B Mistral AI		22B	—	—

Notice missing or incorrect data?

About this benchmark

What is RepoBench?

RepoBench is a benchmark for evaluating repository-level code auto-completion systems through three interconnected tasks: RepoBench-R (retrieval of relevant code snippets across files), RepoBench-C (code completion with cross-file and in-file context), and RepoBench-P (pipeline combining retrieval and prediction). Supports Python and Java programming languages and addresses the gap in evaluating real-world, multi-file programming scenarios by providing a more complete comparison of performance in auto-completion systems.

RepoBench is a text benchmark evaluating models on reasoning and code tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.3.

Compare leaders on the best AI for reasoning and best AI for code leaderboards.

Current leaders

Codestral-22B from Mistral AI currently leads the RepoBench leaderboard with a score of 0.340 across 1 evaluated AI models.

Codestral-22BMistral AI34.0%

Source paper

Title: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Authors: Tianyang Liu, Canwen Xu, Julian McAuley
Published: June 5, 2023
arXiv: 2306.03091

Abstract

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

FAQ

Common questions about the RepoBench benchmark and leaderboard.

What is the RepoBench benchmark?

What is the RepoBench leaderboard?

The RepoBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Codestral-22B by Mistral AI leads with a score of 0.340. The average score across all models is 0.340.

What is the highest RepoBench score?

The highest RepoBench score is 0.340, achieved by Codestral-22B from Mistral AI.

How many models are evaluated on RepoBench?

1 models have been evaluated on the RepoBench benchmark, with 0 verified results and 1 self-reported results.

Where can I find the RepoBench paper?

The RepoBench paper is available at https://arxiv.org/abs/2306.03091. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does RepoBench cover?

RepoBench is categorized under reasoning and code. The benchmark evaluates text models.

What is the best open-source model on RepoBench?

Codestral-22B by Mistral AI is the top-ranked open-source model on RepoBench, with a score of 0.340 (rank #1).

How recent are the RepoBench leaderboard results?

The RepoBench leaderboard was last updated in July 2026 and currently includes 1 evaluated models.