Vending-Bench 2
Progress Over Time
Interactive timeline showing model performance evolution on Vending-Bench 2
Vending-Bench 2 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 2 | Zhipu AI | 754B | 200K | $1.40 / $4.40 | ||
| 3 | Google | — | — | — | ||
| 4 | Google | — | 1.0M | $0.50 / $3.00 |
What is Vending-Bench 2?
Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.
Vending-Bench 2 is a text benchmark evaluating models on reasoning and agents tasks. LLM Stats tracks 4 models on this benchmark, scored on a 0–1 scale. The current average is 5691.3, with the leader at 8017.6.
Compare leaders on the best AI for reasoning and best AI for agents leaderboards.
Current leaders
Claude Opus 4.6 from Anthropic currently leads the Vending-Bench 2 leaderboard with a score of 8017.590 across 4 evaluated AI models.
FAQ
Common questions about the Vending-Bench 2 benchmark and leaderboard.