Benchmarks/general/LongBench v2

LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on LongBench v2

State-of-the-art frontier
Open
Proprietary

LongBench v2 Leaderboard

13 models • 0 verified
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
2
456B1.0M
$0.55
$2.20
3
Moonshot AI
Moonshot AI
1.0T262K
$0.60
$2.50
3
456B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
5
309B256K
$0.10
$0.30
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K
$0.40
$3.20
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K
$0.25
$2.00
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
11
DeepSeek
DeepSeek
671B131K
$0.27
$1.10
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
Notice missing or incorrect data?

FAQ

Common questions about LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.
The LongBench v2 paper is available at https://arxiv.org/abs/2412.15204. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LongBench v2 leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.632. The average score across all models is 0.543.
The highest LongBench v2 score is 0.632, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
13 models have been evaluated on the LongBench v2 benchmark, with 0 verified results and 13 self-reported results.
LongBench v2 is categorized under general, long context, reasoning, and structured output. The benchmark evaluates text models with multilingual support.