QwenClawBench

Name: QwenClawBench Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

QwenClawBench is a real-user-distribution Claw agent benchmark for evaluating coding agents on realistic developer tasks.

Qwen3.7-Plus from Alibaba Cloud / Qwen Team currently leads the QwenClawBench leaderboard with a score of 0.618 across 1 evaluated AI models.

Implementation

About this benchmark

What QwenClawBench measures

QwenClawBench is a text benchmark that evaluates large language models on agents and code tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.6, with the leader reaching 0.6.

Compare leaders on the best AI for agents and best AI for code leaderboards.

Qwen3.7-Plus leads with 61.8%.

Progress Over Time

Interactive timeline showing model performance evolution on QwenClawBench

State-of-the-art frontier

Open

Proprietary

QwenClawBench Leaderboard

1 models

				Context	Cost	License
1	Qwen3.7-Plus Alibaba Cloud / Qwen Team		—	—	—

Notice missing or incorrect data?

FAQ

Common questions about QwenClawBench.

What is the QwenClawBench benchmark?

QwenClawBench is a real-user-distribution Claw agent benchmark for evaluating coding agents on realistic developer tasks.

What is the QwenClawBench leaderboard?

The QwenClawBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen3.7-Plus by Alibaba Cloud / Qwen Team leads with a score of 0.618. The average score across all models is 0.618.

What is the highest QwenClawBench score?

The highest QwenClawBench score is 0.618, achieved by Qwen3.7-Plus from Alibaba Cloud / Qwen Team.

How many models are evaluated on QwenClawBench?

1 models have been evaluated on the QwenClawBench benchmark, with 0 verified results and 1 self-reported results.

Where can I find the QwenClawBench dataset?

The QwenClawBench dataset is available at https://github.com/SKYLENAGE-AI/QwenClawBench.

What categories does QwenClawBench cover?

QwenClawBench is categorized under agents and code. The benchmark evaluates text models.

How recent are the QwenClawBench leaderboard results?

The QwenClawBench leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all agents →

SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

code

101 models

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

code

73 models

HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

code

66 models

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents

49 models

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents

47 models

GDPval-AA

GDPval-AA is an evaluation of AI model performance on economically valuable knowledge work tasks across professional domains including finance, legal, and other sectors. Run independently by Artificial Analysis, it uses Elo scoring to rank models on real-world work task performance.

agents

33 models