Aider-Polyglot Edit
A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.
Progress Over Time
Interactive timeline showing model performance evolution on Aider-Polyglot Edit
State-of-the-art frontier
Open
Proprietary
Aider-Polyglot Edit Leaderboard
10 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 671B | 131K | $0.27 / $1.10 | ||
| 2 | Google | — | 1.0M | $1.25 / $10.00 | ||
| 3 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 4 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 5 | Google | — | 1.0M | $0.30 / $2.50 | ||
| 6 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 7 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 8 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 9 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 10 | OpenAI | — | 1.0M | $0.10 / $0.40 |
Notice missing or incorrect data?
FAQ
Common questions about Aider-Polyglot Edit
A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.
The Aider-Polyglot Edit dataset is available at https://github.com/Aider-AI/polyglot-benchmark.
The Aider-Polyglot Edit leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, DeepSeek-V3 by DeepSeek leads with a score of 0.797. The average score across all models is 0.481.
The highest Aider-Polyglot Edit score is 0.797, achieved by DeepSeek-V3 from DeepSeek.
10 models have been evaluated on the Aider-Polyglot Edit benchmark, with 0 verified results and 10 self-reported results.
Aider-Polyglot Edit is categorized under code and general. The benchmark evaluates text models.