Aider

Aider is a comprehensive code editing benchmark based on 133 practice exercises from Exercism's Python repository, designed to evaluate AI models' ability to translate natural language coding requests into executable code that passes unit tests. The benchmark measures end-to-end code editing capabilities, including GPT's ability to edit existing code and format code changes for automated saving to local files. The Aider Polyglot variant extends this evaluation across 225 challenging exercises spanning C++, Go, Java, JavaScript, Python, and Rust, making it a standard benchmark for assessing multilingual code editing performance in AI research.

Implementation

Progress Over Time

Interactive timeline showing model performance evolution on Aider

State-of-the-art frontier
Open
Proprietary

Aider Leaderboard

1 models
ContextCostLicense
1236B
Notice missing or incorrect data?

FAQ

Common questions about Aider

Aider is a comprehensive code editing benchmark based on 133 practice exercises from Exercism's Python repository, designed to evaluate AI models' ability to translate natural language coding requests into executable code that passes unit tests. The benchmark measures end-to-end code editing capabilities, including GPT's ability to edit existing code and format code changes for automated saving to local files. The Aider Polyglot variant extends this evaluation across 225 challenging exercises spanning C++, Go, Java, JavaScript, Python, and Rust, making it a standard benchmark for assessing multilingual code editing performance in AI research.
The Aider dataset is available at https://github.com/Aider-AI/aider.
The Aider leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, DeepSeek-V2.5 by DeepSeek leads with a score of 0.722. The average score across all models is 0.722.
The highest Aider score is 0.722, achieved by DeepSeek-V2.5 from DeepSeek.
1 models have been evaluated on the Aider benchmark, with 0 verified results and 1 self-reported results.
Aider is categorized under code and reasoning. The benchmark evaluates text models.