Benchmarks/long context/LongVideoBench

LongVideoBench

LongVideoBench is a question-answering benchmark featuring video-language interleaved inputs up to an hour long. It includes 3,763 varying-length web-collected videos with subtitles across diverse themes and 6,678 human-annotated multiple-choice questions in 17 fine-grained categories for comprehensive evaluation of long-term multimodal understanding.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on LongVideoBench

State-of-the-art frontier
Open
Proprietary

LongVideoBench Leaderboard

2 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
Notice missing or incorrect data?

FAQ

Common questions about LongVideoBench

LongVideoBench is a question-answering benchmark featuring video-language interleaved inputs up to an hour long. It includes 3,763 varying-length web-collected videos with subtitles across diverse themes and 6,678 human-annotated multiple-choice questions in 17 fine-grained categories for comprehensive evaluation of long-term multimodal understanding.
The LongVideoBench paper is available at https://arxiv.org/abs/2407.15754. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LongVideoBench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Kimi K2.5 by Moonshot AI leads with a score of 0.798. The average score across all models is 0.673.
The highest LongVideoBench score is 0.798, achieved by Kimi K2.5 from Moonshot AI.
2 models have been evaluated on the LongVideoBench benchmark, with 0 verified results and 2 self-reported results.
LongVideoBench is categorized under long context, multimodal, and vision. The benchmark evaluates multimodal models.