Benchmarks/code/Aider-Polyglot

Aider-Polyglot

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.

Implementation

Progress Over Time

Interactive timeline showing model performance evolution on Aider-Polyglot

State-of-the-art frontier
Open
Proprietary

Aider-Polyglot Leaderboard

22 models • 0 verified
ContextCostLicense
1
OpenAI
OpenAI
400K
$1.25
$10.00
2
1.0M
$1.25
$10.00
3
OpenAI
OpenAI
200K
$2.00
$8.00
4
1.0M
$1.25
$10.00
5
685B
6
671B131K
$0.50
$2.15
7
OpenAI
OpenAI
200K
$1.10
$4.40
8
671B164K
$0.27
$1.00
9
OpenAI
OpenAI
200K
$1.10
$4.40
10
1.0M
$0.30
$2.50
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
12
1.0T
12
Moonshot AI
Moonshot AI
1.0T200K
$0.50
$0.50
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K
$0.15
$0.80
15
OpenAI
OpenAI
1.0M
$2.00
$8.00
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K
$0.15
$1.50
17
DeepSeek
DeepSeek
671B131K
$0.27
$1.10
18
24B
19
1.0M
$0.40
$1.60
20
OpenAI
OpenAI
128K
$2.50
$10.00
21
1.0M
$0.10
$0.40
22
1.0M
$0.10
$0.40
Notice missing or incorrect data?

FAQ

Common questions about Aider-Polyglot

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.
The Aider-Polyglot dataset is available at https://github.com/Aider-AI/polyglot-benchmark.
The Aider-Polyglot leaderboard ranks 22 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.880. The average score across all models is 0.581.
The highest Aider-Polyglot score is 0.880, achieved by GPT-5 from OpenAI.
22 models have been evaluated on the Aider-Polyglot benchmark, with 0 verified results and 22 self-reported results.
Aider-Polyglot is categorized under code and general. The benchmark evaluates text models.