LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team currently leads the LongBench v2 leaderboard with a score of 0.632 across 14 evaluated AI models.