MMLongBench-128K

Name: MMLongBench-128K Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MMLongBench-128K

State-of-the-art frontier

Open

Proprietary

MMLongBench-128K Leaderboard

2 models

				Context	Cost	License
1	Seed 2.1 ProNew ByteDance		—	—	—
2	Seed 2.1 TurboNew ByteDance		—	—	—

Notice missing or incorrect data?

About this benchmark

What is MMLongBench-128K?

MMLongBench-128K evaluates multimodal long-context understanding at a 128K token context length, testing how well vision-language models reason over very long mixed text and image inputs.

MMLongBench-128K is a multimodal benchmark evaluating models on multimodal, reasoning, long context, and vision tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.8.

Compare leaders on the best AI for multimodal, best AI for reasoning, best AI for long context and best AI for vision leaderboards.

Current leaders

Seed 2.1 Pro from ByteDance currently leads the MMLongBench-128K leaderboard with a score of 0.783 across 2 evaluated AI models.

Seed 2.1 ProByteDance78.3%

Seed 2.1 TurboByteDance76.9%

Source paper

Title: MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
Authors: Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, and 8 others
Published: May 15, 2025
arXiv: 2505.10610

Abstract

The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

FAQ

Common questions about the MMLongBench-128K benchmark and leaderboard.

What is the MMLongBench-128K benchmark?

MMLongBench-128K evaluates multimodal long-context understanding at a 128K token context length, testing how well vision-language models reason over very long mixed text and image inputs.

What is the MMLongBench-128K leaderboard?

The MMLongBench-128K leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Seed 2.1 Pro by ByteDance leads with a score of 0.783. The average score across all models is 0.776.

What is the highest MMLongBench-128K score?

The highest MMLongBench-128K score is 0.783, achieved by Seed 2.1 Pro from ByteDance.

How many models are evaluated on MMLongBench-128K?

2 models have been evaluated on the MMLongBench-128K benchmark, with 0 verified results and 2 self-reported results.

Where can I find the MMLongBench-128K paper?

The MMLongBench-128K paper is available at https://arxiv.org/abs/2505.10610. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MMLongBench-128K cover?

MMLongBench-128K is categorized under multimodal, reasoning, long context, and vision. The benchmark evaluates multimodal models.

How recent are the MMLongBench-128K leaderboard results?

The MMLongBench-128K leaderboard was last updated in June 2026 and currently includes 2 evaluated models.