SummScreenFD

SummScreenFD is the ForeverDreaming subset of the SummScreen dataset for abstractive screenplay summarization, comprising pairs of TV series transcripts and human-written recaps from 88 different shows. The dataset provides a challenging testbed for abstractive summarization where plot details are often expressed indirectly in character dialogues and scattered across the entirety of the transcript, requiring models to find and integrate these details to form succinct plot descriptions.

Phi-3.5-MoE-instruct from Microsoft currently leads the SummScreenFD leaderboard with a score of 0.169 across 2 evaluated AI models.

Paper
About this benchmark

What SummScreenFD measures

SummScreenFD is a text benchmark that evaluates large language models on long context and summarization tasks. LLM Stats tracks 2 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.2, with the leader reaching 0.2.

Compare leaders on the best AI for long context and best AI for summarization leaderboards.

Publication

Paper
SummScreen: A Dataset for Abstractive Screenplay Summarization
Authors
Mingda Chen, Zewei Chu, Sam Wiseman, Kevin Gimpel
Published

Abstract

We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succinct plot descriptions in the recaps. Also, TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps. Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics. Empirically, we characterize the dataset by evaluating several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models according to automatic metrics, showing that the neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis reveal that our non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models generate unfaithful facts, suggesting future research directions.

MicrosoftPhi-3.5-MoE-instruct leads with 16.9%, followed by MicrosoftPhi-3.5-mini-instruct at 16.0%.

Progress Over Time

Interactive timeline showing model performance evolution on SummScreenFD

State-of-the-art frontier
Open
Proprietary

SummScreenFD Leaderboard

2 models
ContextCostLicense
160B
24B
Notice missing or incorrect data?

FAQ

Common questions about SummScreenFD.

What is the SummScreenFD benchmark?

SummScreenFD is the ForeverDreaming subset of the SummScreen dataset for abstractive screenplay summarization, comprising pairs of TV series transcripts and human-written recaps from 88 different shows. The dataset provides a challenging testbed for abstractive summarization where plot details are often expressed indirectly in character dialogues and scattered across the entirety of the transcript, requiring models to find and integrate these details to form succinct plot descriptions.

What is the SummScreenFD leaderboard?

The SummScreenFD leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.169. The average score across all models is 0.165.

What is the highest SummScreenFD score?

The highest SummScreenFD score is 0.169, achieved by Phi-3.5-MoE-instruct from Microsoft.

How many models are evaluated on SummScreenFD?

2 models have been evaluated on the SummScreenFD benchmark, with 0 verified results and 2 self-reported results.

Where can I find the SummScreenFD paper?

The SummScreenFD paper is available at https://arxiv.org/abs/2104.07091. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SummScreenFD cover?

SummScreenFD is categorized under long context and summarization. The benchmark evaluates text models.

What is the best open-source model on SummScreenFD?

Phi-3.5-MoE-instruct by Microsoft is the top-ranked open-source model on SummScreenFD, with a score of 0.169 (rank #1).

How recent are the SummScreenFD leaderboard results?

The SummScreenFD leaderboard was last updated in June 2026 and currently includes 2 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all long context
nolima
long context
52 models
LVBench

LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.

long contextmultimodal
20 models
LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

long context
16 models
AA-LCR

Agent Arena Long Context Reasoning benchmark

long context
13 models
MRCR v2 (8-needle)

MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.

long context
10 models
EgoSchema

A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors

long contextvideo
9 models