Aider

Implementation

Progress Over Time

Interactive timeline showing model performance evolution on Aider

State-of-the-art frontier
Open
Proprietary

Aider Leaderboard

4 models
ContextCostLicense
1236B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B128K$0.10 / $0.44
Notice missing or incorrect data?
About this benchmark

What is Aider?

Aider is a comprehensive code editing benchmark based on 133 practice exercises from Exercism's Python repository, designed to evaluate AI models' ability to translate natural language coding requests into executable code that passes unit tests. The benchmark measures end-to-end code editing capabilities, including GPT's ability to edit existing code and format code changes for automated saving to local files. The Aider Polyglot variant extends this evaluation across 225 challenging exercises spanning C++, Go, Java, JavaScript, Python, and Rust, making it a standard benchmark for assessing multilingual code editing performance in AI research.

Aider is a text benchmark evaluating models on reasoning and code tasks. LLM Stats tracks 4 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.7.

Compare leaders on the best AI for reasoning and best AI for code leaderboards.

Current leaders

DeepSeek-V2.5 from DeepSeek currently leads the Aider leaderboard with a score of 0.722 across 4 evaluated AI models.

1DeepSeek-V2.5DeepSeek72.2%
2Qwen3 235B A22BAlibaba Cloud / Qwen Team61.8%
3Qwen2.5-Coder 7B InstructAlibaba Cloud / Qwen Team55.6%

FAQ

Common questions about the Aider benchmark and leaderboard.

What is the Aider benchmark?

Aider is a comprehensive code editing benchmark based on 133 practice exercises from Exercism's Python repository, designed to evaluate AI models' ability to translate natural language coding requests into executable code that passes unit tests. The benchmark measures end-to-end code editing capabilities, including GPT's ability to edit existing code and format code changes for automated saving to local files. The Aider Polyglot variant extends this evaluation across 225 challenging exercises spanning C++, Go, Java, JavaScript, Python, and Rust, making it a standard benchmark for assessing multilingual code editing performance in AI research.

What is the Aider leaderboard?

The Aider leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, DeepSeek-V2.5 by DeepSeek leads with a score of 0.722. The average score across all models is 0.599.

What is the highest Aider score?

The highest Aider score is 0.722, achieved by DeepSeek-V2.5 from DeepSeek.

How many models are evaluated on Aider?

4 models have been evaluated on the Aider benchmark, with 0 verified results and 4 self-reported results.

Where can I find the Aider dataset?

The Aider dataset is available at https://github.com/Aider-AI/aider.

What categories does Aider cover?

Aider is categorized under reasoning and code. The benchmark evaluates text models.

What is the best open-source model on Aider?

DeepSeek-V2.5 by DeepSeek is the top-ranked open-source model on Aider, with a score of 0.722 (rank #1).

How recent are the Aider leaderboard results?

The Aider leaderboard was last updated in July 2026 and currently includes 4 evaluated models.