SummScreenFD
SummScreenFD is the ForeverDreaming subset of the SummScreen dataset for abstractive screenplay summarization, comprising pairs of TV series transcripts and human-written recaps from 88 different shows. The dataset provides a challenging testbed for abstractive summarization where plot details are often expressed indirectly in character dialogues and scattered across the entirety of the transcript, requiring models to find and integrate these details to form succinct plot descriptions.
Phi-3.5-MoE-instruct from Microsoft currently leads the SummScreenFD leaderboard with a score of 0.169 across 2 evaluated AI models.
What SummScreenFD measures
SummScreenFD is a text benchmark that evaluates large language models on long context and summarization tasks. LLM Stats tracks 2 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.2, with the leader reaching 0.2.
Compare leaders on the best AI for long context and best AI for summarization leaderboards.
Publication
- Paper
- SummScreen: A Dataset for Abstractive Screenplay Summarization
- Authors
- Mingda Chen, Zewei Chu, Sam Wiseman, Kevin Gimpel
- Published
- arXiv
- 2104.07091
Abstract
We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succinct plot descriptions in the recaps. Also, TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps. Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics. Empirically, we characterize the dataset by evaluating several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models according to automatic metrics, showing that the neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis reveal that our non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models generate unfaithful facts, suggesting future research directions.
Phi-3.5-MoE-instruct leads with 16.9%, followed by
Phi-3.5-mini-instruct at 16.0%.
Progress Over Time
Interactive timeline showing model performance evolution on SummScreenFD
SummScreenFD Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Microsoft | 60B | — | — | ||
| 2 | Microsoft | 4B | — | — |
FAQ
Common questions about SummScreenFD.
More evaluations to explore
Related benchmarks in the same category
LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.
LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.
Agent Arena Long Context Reasoning benchmark
MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.
A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors