Minerva

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Minerva

State-of-the-art frontier
Open
Proprietary

Minerva Leaderboard

2 models
ContextCostLicense
1
ByteDance
ByteDance
2
ByteDance
ByteDance
Notice missing or incorrect data?
About this benchmark

What is Minerva?

Minerva is a benchmark for complex video reasoning, evaluating models on multi-step reasoning over long and information-dense video content.

Minerva is a multimodal benchmark evaluating models on multimodal, reasoning, video, and vision tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.7.

Compare leaders on the best AI for multimodal, best AI for reasoning, best AI for video and best AI for vision leaderboards.

Current leaders

Seed 2.1 Pro from ByteDance currently leads the Minerva leaderboard with a score of 0.707 across 2 evaluated AI models.

1Seed 2.1 ProByteDance70.7%
2Seed 2.1 TurboByteDance65.9%

Source paper

Title
MINERVA: Evaluating Complex Video Reasoning
Authors
Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, and 8 others
Published
Abstract

Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.

FAQ

Common questions about the Minerva benchmark and leaderboard.

What is the Minerva benchmark?

Minerva is a benchmark for complex video reasoning, evaluating models on multi-step reasoning over long and information-dense video content.

What is the Minerva leaderboard?

The Minerva leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Seed 2.1 Pro by ByteDance leads with a score of 0.707. The average score across all models is 0.683.

What is the highest Minerva score?

The highest Minerva score is 0.707, achieved by Seed 2.1 Pro from ByteDance.

How many models are evaluated on Minerva?

2 models have been evaluated on the Minerva benchmark, with 0 verified results and 2 self-reported results.

Where can I find the Minerva paper?

The Minerva paper is available at https://arxiv.org/abs/2505.00681. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Minerva cover?

Minerva is categorized under multimodal, reasoning, video, and vision. The benchmark evaluates multimodal models.

How recent are the Minerva leaderboard results?

The Minerva leaderboard was last updated in June 2026 and currently includes 2 evaluated models.