Codegolf v2.2
Codegolf v2.2 benchmark
Gemma 3n E4B Instructed from Google currently leads the Codegolf v2.2 leaderboard with a score of 0.168 across 4 evaluated AI models.
Gemma 3n E4B Instructed leads with 16.8%, followed by
Gemma 3n E4B Instructed LiteRT Preview at 16.8% and
Gemma 3n E2B Instructed at 11.0%.
Progress Over Time
Interactive timeline showing model performance evolution on Codegolf v2.2
Codegolf v2.2 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | 8B | 32K | $20.00 / $40.00 | ||
| 1 | 2B | — | — | |||
| 3 | Google | 8B | — | — | ||
| 3 | 2B | — | — |
FAQ
Common questions about Codegolf v2.2.
More evaluations to explore
Related benchmarks in the same category
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.
Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.