What is the LongBench v2 leaderboard?

The LongBench v2 leaderboard ranks 14 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.632. The average score across all models is 0.548.

What is the highest LongBench v2 score?

The highest LongBench v2 score is 0.632, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.

How many models are evaluated on LongBench v2?

14 models have been evaluated on the LongBench v2 benchmark, with 0 verified results and 14 self-reported results.

Where can I find the LongBench v2 paper?

The LongBench v2 paper is available at https://arxiv.org/abs/2412.15204. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does LongBench v2 cover?

LongBench v2 is categorized under structured output, general, long context, and reasoning. The benchmark evaluates text models with multilingual support.

What is the best open-source model on LongBench v2?

Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on LongBench v2, with a score of 0.632 (rank #1).

Which model offers the best value on LongBench v2?

Among models scoring within 10% of the leader, Qwen3.5-35B-A3B from Alibaba Cloud / Qwen Team is the cheapest, at $0.25 per million input tokens with a score of 0.590.

How is LongBench v2 scored?

LongBench v2 is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the LongBench v2 leaderboard results?

The LongBench v2 leaderboard was last updated in May 2026 and currently includes 14 evaluated models.

All benchmarks

LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team currently leads the LongBench v2 leaderboard with a score of 0.632 across 14 evaluated AI models.

Paper Dataset Code