GovReport
A long document summarization dataset consisting of reports from government research agencies including Congressional Research Service and U.S. Government Accountability Office, with significantly longer documents and summaries than other datasets.
Phi-3.5-MoE-instruct from Microsoft currently leads the GovReport leaderboard with a score of 0.264 across 2 evaluated AI models.
Phi-3.5-MoE-instruct leads with 26.4%, followed by
Phi-3.5-mini-instruct at 25.9%.
Progress Over Time
Interactive timeline showing model performance evolution on GovReport
GovReport Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Microsoft | 60B | — | — | ||
| 2 | Microsoft | 4B | 128K | $0.10 / $0.10 |
FAQ
Common questions about GovReport.
More evaluations to explore
Related benchmarks in the same category
LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.
LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.
Agent Arena Long Context Reasoning benchmark
A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors
A comprehensive benchmark for multi-task long video understanding that evaluates multimodal large language models on videos ranging from 3 minutes to 2 hours across 9 distinct tasks including reasoning, captioning, recognition, and summarization.