Web Bench

Name: Web Bench Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Progress Over Time

Interactive timeline showing model performance evolution on Web Bench

State-of-the-art frontier

Open

Proprietary

Web Bench Leaderboard

2 models

				Context	Cost	License
1	Seed 2.1 ProNew ByteDance		—	—	—
2	Seed 2.1 TurboNew ByteDance		—	—	—

Notice missing or incorrect data?

About this benchmark

What is Web Bench?

Web Bench evaluates agents on realistic web-development engineering tasks, measuring end-to-end implementation in browser-based workflows.

Web Bench is a text benchmark evaluating models on agents and coding tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.8.

Compare leaders on the best AI for agents and best AI for coding leaderboards.

Current leaders

Seed 2.1 Pro from ByteDance currently leads the Web Bench leaderboard with a score of 0.784 across 2 evaluated AI models.

Seed 2.1 ProByteDance78.4%

Seed 2.1 TurboByteDance73.6%

FAQ

Common questions about the Web Bench benchmark and leaderboard.

What is the Web Bench benchmark?

Web Bench evaluates agents on realistic web-development engineering tasks, measuring end-to-end implementation in browser-based workflows.

What is the Web Bench leaderboard?

The Web Bench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Seed 2.1 Pro by ByteDance leads with a score of 0.784. The average score across all models is 0.760.

What is the highest Web Bench score?

The highest Web Bench score is 0.784, achieved by Seed 2.1 Pro from ByteDance.

How many models are evaluated on Web Bench?

2 models have been evaluated on the Web Bench benchmark, with 0 verified results and 2 self-reported results.

What categories does Web Bench cover?

Web Bench is categorized under agents and coding. The benchmark evaluates text models.

How recent are the Web Bench leaderboard results?

The Web Bench leaderboard was last updated in June 2026 and currently includes 2 evaluated models.