TempCompass

TempCompass is a comprehensive benchmark for evaluating temporal perception capabilities of Video Large Language Models (Video LLMs). It constructs conflicting videos that share identical static content but differ in specific temporal aspects to prevent models from exploiting single-frame bias. The benchmark evaluates multiple temporal aspects including action, motion, speed, temporal order, and attribute changes across diverse task formats including multi-choice QA, yes/no QA, caption matching, and caption generation.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TempCompass

State-of-the-art frontier
Open
Proprietary

TempCompass Leaderboard

2 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
Notice missing or incorrect data?

FAQ

Common questions about TempCompass

TempCompass is a comprehensive benchmark for evaluating temporal perception capabilities of Video Large Language Models (Video LLMs). It constructs conflicting videos that share identical static content but differ in specific temporal aspects to prevent models from exploiting single-frame bias. The benchmark evaluates multiple temporal aspects including action, motion, speed, temporal order, and attribute changes across diverse task formats including multi-choice QA, yes/no QA, caption matching, and caption generation.
The TempCompass paper is available at https://arxiv.org/abs/2403.00476. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The TempCompass leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Qwen2.5 VL 72B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.748. The average score across all models is 0.732.
The highest TempCompass score is 0.748, achieved by Qwen2.5 VL 72B Instruct from Alibaba Cloud / Qwen Team.
2 models have been evaluated on the TempCompass benchmark, with 0 verified results and 2 self-reported results.
TempCompass is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.