VIBE Simulation

VIBE benchmark subset for simulation code generation

MiniMax M2.1 from MiniMax currently leads the VIBE Simulation leaderboard with a score of 0.871 across 1 evaluated AI models.

MiniMaxMiniMax M2.1 leads with 87.1%.

Progress Over Time

Interactive timeline showing model performance evolution on VIBE Simulation

State-of-the-art frontier
Open
Proprietary

VIBE Simulation Leaderboard

1 models
ContextCostLicense
1230B1.0M$0.30 / $1.20
Notice missing or incorrect data?

FAQ

Common questions about VIBE Simulation.

What is the VIBE Simulation benchmark?

VIBE benchmark subset for simulation code generation

What is the VIBE Simulation leaderboard?

The VIBE Simulation leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MiniMax M2.1 by MiniMax leads with a score of 0.871. The average score across all models is 0.871.

What is the highest VIBE Simulation score?

The highest VIBE Simulation score is 0.871, achieved by MiniMax M2.1 from MiniMax.

How many models are evaluated on VIBE Simulation?

1 models have been evaluated on the VIBE Simulation benchmark, with 0 verified results and 1 self-reported results.

What categories does VIBE Simulation cover?

VIBE Simulation is categorized under code. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all code
SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

code
89 models
LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

code
71 models
HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

code
66 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

code
39 models
SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

code
27 models
Terminal-Bench

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

code
23 models