Aider
Progress Over Time
Interactive timeline showing model performance evolution on Aider
Aider Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 236B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 33B | 128K | $0.10 / $0.44 |
What is Aider?
Aider is a comprehensive code editing benchmark based on 133 practice exercises from Exercism's Python repository, designed to evaluate AI models' ability to translate natural language coding requests into executable code that passes unit tests. The benchmark measures end-to-end code editing capabilities, including GPT's ability to edit existing code and format code changes for automated saving to local files. The Aider Polyglot variant extends this evaluation across 225 challenging exercises spanning C++, Go, Java, JavaScript, Python, and Rust, making it a standard benchmark for assessing multilingual code editing performance in AI research.
Aider is a text benchmark evaluating models on reasoning and code tasks. LLM Stats tracks 4 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.7.
Compare leaders on the best AI for reasoning and best AI for code leaderboards.
Current leaders
DeepSeek-V2.5 from DeepSeek currently leads the Aider leaderboard with a score of 0.722 across 4 evaluated AI models.
FAQ
Common questions about the Aider benchmark and leaderboard.