Vending-Bench 2

Progress Over Time

Interactive timeline showing model performance evolution on Vending-Bench 2

State-of-the-art frontier
Open
Proprietary

Vending-Bench 2 Leaderboard

4 models
ContextCostLicense
11.0M$5.00 / $25.00
2
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
3
41.0M$0.50 / $3.00
Notice missing or incorrect data?
About this benchmark

What is Vending-Bench 2?

Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.

Vending-Bench 2 is a text benchmark evaluating models on reasoning and agents tasks. LLM Stats tracks 4 models on this benchmark, scored on a 0–1 scale. The current average is 5691.3, with the leader at 8017.6.

Compare leaders on the best AI for reasoning and best AI for agents leaderboards.

Current leaders

Claude Opus 4.6 from Anthropic currently leads the Vending-Bench 2 leaderboard with a score of 8017.590 across 4 evaluated AI models.

1Claude Opus 4.6Anthropic801759.0%
2GLM-5.1Zhipu AI563441.0%
3Gemini 3 ProGoogle547816.0%

FAQ

Common questions about the Vending-Bench 2 benchmark and leaderboard.

What is the Vending-Bench 2 benchmark?

Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.

What is the Vending-Bench 2 leaderboard?

The Vending-Bench 2 leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 8017.590. The average score across all models is 5691.290.

What is the highest Vending-Bench 2 score?

The highest Vending-Bench 2 score is 8017.590, achieved by Claude Opus 4.6 from Anthropic.

How many models are evaluated on Vending-Bench 2?

4 models have been evaluated on the Vending-Bench 2 benchmark, with 0 verified results and 4 self-reported results.

What categories does Vending-Bench 2 cover?

Vending-Bench 2 is categorized under reasoning and agents. The benchmark evaluates text models.

What is the best open-source model on Vending-Bench 2?

GLM-5.1 by Zhipu AI is the top-ranked open-source model on Vending-Bench 2, with a score of 5634.410 (rank #2).

Which model offers the best value on Vending-Bench 2?

Among models scoring within 10% of the leader, Claude Opus 4.6 from Anthropic is the cheapest, at $5.00 per million input tokens with a score of 8017.590.

How recent are the Vending-Bench 2 leaderboard results?

The Vending-Bench 2 leaderboard was last updated in June 2026 and currently includes 4 evaluated models.