LongBench v2

Progress Over Time

Interactive timeline showing model performance evolution on LongBench v2

State-of-the-art frontier
Open
Proprietary

LongBench v2 Leaderboard

16 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
3550B
4456B
5456B
5
Moonshot AI
Moonshot AI
1.0T
51.0T
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
8309B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
14
DeepSeek
DeepSeek
671B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
Notice missing or incorrect data?
About this benchmark

What is LongBench v2?

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

LongBench v2 is a text benchmark evaluating models on reasoning, structured output, long context, and general tasks. LLM Stats tracks 16 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.6.

Compare leaders on the best AI for reasoning, best AI for structured output, best AI for long context and best AI for general leaderboards.

Current leaders

Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team currently leads the LongBench v2 leaderboard with a score of 0.632 across 16 evaluated AI models.

1Qwen3.5-397B-A17BAlibaba Cloud / Qwen Team63.2%
2Qwen3.6 PlusAlibaba Cloud / Qwen Team62.0%

Source paper

Title
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Authors
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, and 8 others
Published
Abstract

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at https://longbench2.github.io.

FAQ

Common questions about the LongBench v2 benchmark and leaderboard.

What is the LongBench v2 benchmark?

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

What is the LongBench v2 leaderboard?

The LongBench v2 leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.632. The average score across all models is 0.557.

What is the highest LongBench v2 score?

The highest LongBench v2 score is 0.632, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.

How many models are evaluated on LongBench v2?

16 models have been evaluated on the LongBench v2 benchmark, with 0 verified results and 16 self-reported results.

Where can I find the LongBench v2 paper?

The LongBench v2 paper is available at https://arxiv.org/abs/2412.15204. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does LongBench v2 cover?

LongBench v2 is categorized under reasoning, structured output, long context, and general. The benchmark evaluates text models with multilingual support.

What is the best open-source model on LongBench v2?

Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on LongBench v2, with a score of 0.632 (rank #1).

Which model offers the best value on LongBench v2?

Among models scoring within 10% of the leader, Qwen3.5-27B from Alibaba Cloud / Qwen Team is the cheapest, at $0.30 per million input tokens with a score of 0.606.

How is LongBench v2 scored?

LongBench v2 is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the LongBench v2 leaderboard results?

The LongBench v2 leaderboard was last updated in July 2026 and currently includes 16 evaluated models.