Aider-Polyglot
A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.
Progress Over Time
Interactive timeline showing model performance evolution on Aider-Polyglot
State-of-the-art frontier
Open
Proprietary
Aider-Polyglot Leaderboard
22 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | OpenAI | — | 400K | $1.25 $10.00 | ||
2 | — | 1.0M | $1.25 $10.00 | |||
3 | OpenAI | — | 200K | $2.00 $8.00 | ||
4 | Google | — | 1.0M | $1.25 $10.00 | ||
5 | DeepSeek | 685B | — | — | ||
6 | DeepSeek | 671B | 131K | $0.50 $2.15 | ||
7 | OpenAI | — | 200K | $1.10 $4.40 | ||
8 | DeepSeek | 671B | 164K | $0.27 $1.00 | ||
9 | OpenAI | — | 200K | $1.10 $4.40 | ||
10 | Google | — | 1.0M | $0.30 $2.50 | ||
11 | Alibaba Cloud / Qwen Team | 480B | — | — | ||
12 | Moonshot AI | 1.0T | — | — | ||
12 | Moonshot AI | 1.0T | 200K | $0.50 $0.50 | ||
14 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 $0.80 | ||
15 | OpenAI | — | 1.0M | $2.00 $8.00 | ||
16 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 $1.50 | ||
17 | DeepSeek | 671B | 131K | $0.27 $1.10 | ||
18 | Mistral AI | 24B | — | — | ||
19 | OpenAI | — | 1.0M | $0.40 $1.60 | ||
20 | OpenAI | — | 128K | $2.50 $10.00 | ||
21 | Google | — | 1.0M | $0.10 $0.40 | ||
22 | OpenAI | — | 1.0M | $0.10 $0.40 |
Notice missing or incorrect data?
FAQ
Common questions about Aider-Polyglot
A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.
The Aider-Polyglot dataset is available at https://github.com/Aider-AI/polyglot-benchmark.
The Aider-Polyglot leaderboard ranks 22 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.880. The average score across all models is 0.581.
The highest Aider-Polyglot score is 0.880, achieved by GPT-5 from OpenAI.
22 models have been evaluated on the Aider-Polyglot benchmark, with 0 verified results and 22 self-reported results.
Aider-Polyglot is categorized under code and general. The benchmark evaluates text models.