NL2Repo

NL2Repo evaluates long-horizon coding capabilities including repository-level understanding, where models must generate or modify code across entire repositories from natural language specifications.

Progress Over Time

Interactive timeline showing model performance evolution on NL2Repo

State-of-the-art frontier
Open
Proprietary

NL2Repo Leaderboard

4 models
ContextCostLicense
1
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
2205K$0.30 / $1.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
Notice missing or incorrect data?

FAQ

Common questions about NL2Repo

NL2Repo evaluates long-horizon coding capabilities including repository-level understanding, where models must generate or modify code across entire repositories from natural language specifications.
The NL2Repo leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, GLM-5.1 by Zhipu AI leads with a score of 0.427. The average score across all models is 0.374.
The highest NL2Repo score is 0.427, achieved by GLM-5.1 from Zhipu AI.
4 models have been evaluated on the NL2Repo benchmark, with 0 verified results and 4 self-reported results.
NL2Repo is categorized under agents and coding. The benchmark evaluates text models.