Aider-Polyglot

Implementation

Progress Over Time

Interactive timeline showing model performance evolution on Aider-Polyglot

State-of-the-art frontier
Open
Proprietary

Aider-Polyglot Leaderboard

22 models
ContextCostLicense
1
OpenAI
OpenAI
2
3
OpenAI
OpenAI
41.0M$1.25 / $10.00
5685B
6671B131K$0.55 / $2.19
7
OpenAI
OpenAI
8671B
9
OpenAI
OpenAI
101.0M$0.30 / $2.50
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
121.0T
12
Moonshot AI
Moonshot AI
1.0T
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
15
OpenAI
OpenAI
1.0M$2.00 / $8.00
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B
17
DeepSeek
DeepSeek
671B
1824B
191.0M$0.40 / $1.60
20
OpenAI
OpenAI
128K$2.50 / $10.00
21
221.0M$0.10 / $0.40
Notice missing or incorrect data?
About this benchmark

What is Aider-Polyglot?

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.

Aider-Polyglot is a text benchmark evaluating models on general and code tasks. LLM Stats tracks 22 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.9.

Compare leaders on the best AI for general and best AI for code leaderboards.

Current leaders

GPT-5 from OpenAI currently leads the Aider-Polyglot leaderboard with a score of 0.880 across 22 evaluated AI models.

1GPT-5OpenAI88.0%
3o3OpenAI81.3%
OSSDeepSeek-V3.2-Exp#5 open-weight74.5%

FAQ

Common questions about the Aider-Polyglot benchmark and leaderboard.

What is the Aider-Polyglot benchmark?

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.

What is the Aider-Polyglot leaderboard?

The Aider-Polyglot leaderboard ranks 22 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.880. The average score across all models is 0.581.

What is the highest Aider-Polyglot score?

The highest Aider-Polyglot score is 0.880, achieved by GPT-5 from OpenAI.

How many models are evaluated on Aider-Polyglot?

22 models have been evaluated on the Aider-Polyglot benchmark, with 0 verified results and 22 self-reported results.

Where can I find the Aider-Polyglot dataset?

The Aider-Polyglot dataset is available at https://github.com/Aider-AI/polyglot-benchmark.

What categories does Aider-Polyglot cover?

Aider-Polyglot is categorized under general and code. The benchmark evaluates text models.

What is the best open-source model on Aider-Polyglot?

DeepSeek-V3.2-Exp by DeepSeek is the top-ranked open-source model on Aider-Polyglot, with a score of 0.745 (rank #5).

How recent are the Aider-Polyglot leaderboard results?

The Aider-Polyglot leaderboard was last updated in July 2026 and currently includes 22 evaluated models.