GovReport

Name: GovReport Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

A long document summarization dataset consisting of reports from government research agencies including Congressional Research Service and U.S. Government Accountability Office, with significantly longer documents and summaries than other datasets.

Phi-3.5-MoE-instruct from Microsoft currently leads the GovReport leaderboard with a score of 0.264 across 2 evaluated AI models.

Paper

About this benchmark

What GovReport measures

GovReport is a text benchmark that evaluates large language models on long context and summarization tasks. LLM Stats tracks 2 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.3, with the leader reaching 0.3.

Compare leaders on the best AI for long context and best AI for summarization leaderboards.

Publication

Paper: Efficient Attentions for Long Document Summarization
Authors: Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and 1 others
Published: April 5, 2021
arXiv: 2104.02112

Abstract

The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder-decoder attention with head-wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self-attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state-of-the-art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.

Phi-3.5-MoE-instruct leads with 26.4%, followed by Phi-3.5-mini-instruct at 25.9%.

Progress Over Time

Interactive timeline showing model performance evolution on GovReport

State-of-the-art frontier

Open

Proprietary

GovReport Leaderboard

2 models

				Context	Cost	License
1	Phi-3.5-MoE-instruct Microsoft		60B	—	—
2	Phi-3.5-mini-instruct Microsoft		4B	—	—

Notice missing or incorrect data?

FAQ

Common questions about GovReport.

What is the GovReport benchmark?

What is the GovReport leaderboard?

The GovReport leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.264. The average score across all models is 0.262.

What is the highest GovReport score?

The highest GovReport score is 0.264, achieved by Phi-3.5-MoE-instruct from Microsoft.

How many models are evaluated on GovReport?

2 models have been evaluated on the GovReport benchmark, with 0 verified results and 2 self-reported results.

Where can I find the GovReport paper?

The GovReport paper is available at https://arxiv.org/abs/2104.02112. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does GovReport cover?

GovReport is categorized under long context and summarization. The benchmark evaluates text models.

What is the best open-source model on GovReport?

Phi-3.5-MoE-instruct by Microsoft is the top-ranked open-source model on GovReport, with a score of 0.264 (rank #1).

How recent are the GovReport leaderboard results?

The GovReport leaderboard was last updated in June 2026 and currently includes 2 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all long context →

LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.

long contextmultimodal

20 models

LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

long context

16 models

AA-LCR

Agent Arena Long Context Reasoning benchmark

long context

13 models

MRCR v2 (8-needle)

MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.

long context

10 models

EgoSchema

A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors

long contextvideo

9 models