Codegolf v2.2

Codegolf v2.2 benchmark

Gemma 3n E4B Instructed from Google currently leads the Codegolf v2.2 leaderboard with a score of 0.168 across 4 evaluated AI models.

Gemma 3n E4B Instructed leads with 16.8%, followed by Gemma 3n E4B Instructed LiteRT Preview at 16.8% and Gemma 3n E2B Instructed at 11.0%.

Progress Over Time

Interactive timeline showing model performance evolution on Codegolf v2.2

State-of-the-art frontier

Open

Proprietary

Codegolf v2.2 Leaderboard

4 models

			Context	Cost
1	Gemma 3n E4B Instructed Google	8B	32K	$20.00 / $40.00
1	Gemma 3n E4B Instructed LiteRT Preview Google	2B	—	—
3	Gemma 3n E2B Instructed Google	8B	—	—
3	Gemma 3n E2B Instructed LiteRT (Preview) Google	2B	—	—

Notice missing or incorrect data?

FAQ

Common questions about Codegolf v2.2.

What is the Codegolf v2.2 benchmark?

Codegolf v2.2 benchmark

What is the Codegolf v2.2 leaderboard?

The Codegolf v2.2 leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Gemma 3n E4B Instructed by Google leads with a score of 0.168. The average score across all models is 0.139.

What is the highest Codegolf v2.2 score?

The highest Codegolf v2.2 score is 0.168, achieved by Gemma 3n E4B Instructed from Google.

How many models are evaluated on Codegolf v2.2?

4 models have been evaluated on the Codegolf v2.2 benchmark, with 0 verified results and 4 self-reported results.

What categories does Codegolf v2.2 cover?

Codegolf v2.2 is categorized under code. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all code →

SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

code

89 models

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

code

71 models

HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

code

66 models

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

code

39 models

SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

code

27 models

Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

code

23 models