Benchmarks/agents/Vending-Bench 2

Vending-Bench 2

Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.

Progress Over Time

Interactive timeline showing model performance evolution on Vending-Bench 2

State-of-the-art frontier
Open
Proprietary

Vending-Bench 2 Leaderboard

3 models • 0 verified
ContextCostLicense
1
1.0M
$5.00
$25.00
2
3
1.0M
$0.50
$3.00
Notice missing or incorrect data?

FAQ

Common questions about Vending-Bench 2

Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.
The Vending-Bench 2 leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 8017.590. The average score across all models is 5710.250.
The highest Vending-Bench 2 score is 8017.590, achieved by Claude Opus 4.6 from Anthropic.
3 models have been evaluated on the Vending-Bench 2 benchmark, with 0 verified results and 3 self-reported results.
Vending-Bench 2 is categorized under agents and reasoning. The benchmark evaluates text models.