Kernel Bench L3

Kernel Bench L3 evaluates agentic GPU kernel optimization across 50 problems. Qwen reports two metrics for this benchmark: median per-problem speedup over the PyTorch eager reference and the fraction of problems faster than torch.compile.

Qwen3.7 Max from Alibaba Cloud / Qwen Team currently leads the Kernel Bench L3 leaderboard with a score of 0.960 across 1 evaluated AI models.

Alibaba Cloud / Qwen TeamQwen3.7 Max leads with 96.0%.

Progress Over Time

Interactive timeline showing model performance evolution on Kernel Bench L3

State-of-the-art frontier
Open
Proprietary

Kernel Bench L3 Leaderboard

1 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
Notice missing or incorrect data?

FAQ

Common questions about Kernel Bench L3.

What is the Kernel Bench L3 benchmark?

Kernel Bench L3 evaluates agentic GPU kernel optimization across 50 problems. Qwen reports two metrics for this benchmark: median per-problem speedup over the PyTorch eager reference and the fraction of problems faster than torch.compile.

What is the Kernel Bench L3 leaderboard?

The Kernel Bench L3 leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen3.7 Max by Alibaba Cloud / Qwen Team leads with a score of 0.960. The average score across all models is 0.960.

What is the highest Kernel Bench L3 score?

The highest Kernel Bench L3 score is 0.960, achieved by Qwen3.7 Max from Alibaba Cloud / Qwen Team.

How many models are evaluated on Kernel Bench L3?

1 models have been evaluated on the Kernel Bench L3 benchmark, with 0 verified results and 1 self-reported results.

What categories does Kernel Bench L3 cover?

Kernel Bench L3 is categorized under systems, agents, and coding. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all systems
BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents
46 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents
41 models
Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

agents
23 models
SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

agents
22 models
t2-bench

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

agents
22 models
MCP Atlas

MCP Atlas is a benchmark for evaluating AI models on scaled tool use capabilities, measuring how well models can coordinate and utilize multiple tools across complex multi-step tasks.

agents
19 models