Vending-Bench 2
Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.
Progress Over Time
Interactive timeline showing model performance evolution on Vending-Bench 2
State-of-the-art frontier
Open
Proprietary
Vending-Bench 2 Leaderboard
3 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Anthropic | — | 1.0M | $5.00 $25.00 | ||
2 | Google | — | — | — | ||
3 | Google | — | 1.0M | $0.50 $3.00 |
Notice missing or incorrect data?
FAQ
Common questions about Vending-Bench 2
Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.
The Vending-Bench 2 leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 8017.590. The average score across all models is 5710.250.
The highest Vending-Bench 2 score is 8017.590, achieved by Claude Opus 4.6 from Anthropic.
3 models have been evaluated on the Vending-Bench 2 benchmark, with 0 verified results and 3 self-reported results.
Vending-Bench 2 is categorized under agents and reasoning. The benchmark evaluates text models.