LongBench v2
LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.
Progress Over Time
Interactive timeline showing model performance evolution on LongBench v2
State-of-the-art frontier
Open
Proprietary
LongBench v2 Leaderboard
13 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 $3.60 | ||
2 | MiniMax | 456B | 1.0M | $0.55 $2.20 | ||
3 | Moonshot AI | 1.0T | 262K | $0.60 $2.50 | ||
3 | MiniMax | 456B | — | — | ||
5 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
5 | Xiaomi | 309B | 256K | $0.10 $0.30 | ||
7 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 $3.20 | ||
8 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 $2.00 | ||
9 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
10 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
11 | DeepSeek | 671B | 131K | $0.27 $1.10 | ||
12 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
13 | Alibaba Cloud / Qwen Team | 800M | — | — |
Notice missing or incorrect data?
FAQ
Common questions about LongBench v2
LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.
The LongBench v2 paper is available at https://arxiv.org/abs/2412.15204. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The LongBench v2 leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.632. The average score across all models is 0.543.
The highest LongBench v2 score is 0.632, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
13 models have been evaluated on the LongBench v2 benchmark, with 0 verified results and 13 self-reported results.
LongBench v2 is categorized under general, long context, reasoning, and structured output. The benchmark evaluates text models with multilingual support.