Benchmarks/code/Aider-Polyglot Edit

Aider-Polyglot Edit

A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.

Implementation

Progress Over Time

Interactive timeline showing model performance evolution on Aider-Polyglot Edit

State-of-the-art frontier
Open
Proprietary

Aider-Polyglot Edit Leaderboard

10 models
ContextCostLicense
1
DeepSeek
DeepSeek
671B131K$0.27 / $1.10
21.0M$1.25 / $10.00
3
OpenAI
OpenAI
200K$1.10 / $4.40
4
OpenAI
OpenAI
200K$1.10 / $4.40
51.0M$0.30 / $2.50
6
OpenAI
OpenAI
1.0M$2.00 / $8.00
7
OpenAI
OpenAI
128K$75.00 / $150.00
81.0M$0.40 / $1.60
9
OpenAI
OpenAI
128K$2.50 / $10.00
101.0M$0.10 / $0.40
Notice missing or incorrect data?

FAQ

Common questions about Aider-Polyglot Edit

A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.
The Aider-Polyglot Edit dataset is available at https://github.com/Aider-AI/polyglot-benchmark.
The Aider-Polyglot Edit leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, DeepSeek-V3 by DeepSeek leads with a score of 0.797. The average score across all models is 0.481.
The highest Aider-Polyglot Edit score is 0.797, achieved by DeepSeek-V3 from DeepSeek.
10 models have been evaluated on the Aider-Polyglot Edit benchmark, with 0 verified results and 10 self-reported results.
Aider-Polyglot Edit is categorized under code and general. The benchmark evaluates text models.