MMLongBench-128K
Progress Over Time
Interactive timeline showing model performance evolution on MMLongBench-128K
MMLongBench-128K Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Seed 2.1 ProNew ByteDance | — | — | — | ||
| 2 | ByteDance | — | — | — |
What is MMLongBench-128K?
MMLongBench-128K evaluates multimodal long-context understanding at a 128K token context length, testing how well vision-language models reason over very long mixed text and image inputs.
MMLongBench-128K is a multimodal benchmark evaluating models on multimodal, reasoning, long context, and vision tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.8.
Compare leaders on the best AI for multimodal, best AI for reasoning, best AI for long context and best AI for vision leaderboards.
Current leaders
Seed 2.1 Pro from ByteDance currently leads the MMLongBench-128K leaderboard with a score of 0.783 across 2 evaluated AI models.
Source paper
- Title
- MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
- Authors
- Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, and 8 others
- Published
- arXiv
- 2505.10610
Abstract
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.
FAQ
Common questions about the MMLongBench-128K benchmark and leaderboard.