SummScreenFD

SummScreenFD is the ForeverDreaming subset of the SummScreen dataset for abstractive screenplay summarization, comprising pairs of TV series transcripts and human-written recaps from 88 different shows. The dataset provides a challenging testbed for abstractive summarization where plot details are often expressed indirectly in character dialogues and scattered across the entirety of the transcript, requiring models to find and integrate these details to form succinct plot descriptions.

Phi-3.5-MoE-instruct from Microsoft currently leads the SummScreenFD leaderboard with a score of 0.169 across 2 evaluated AI models.

Paper

Phi-3.5-MoE-instruct leads with 16.9%, followed by Phi-3.5-mini-instruct at 16.0%.

Progress Over Time

Interactive timeline showing model performance evolution on SummScreenFD

State-of-the-art frontier

Open

Proprietary

SummScreenFD Leaderboard

2 models

				Context	Cost	License
1	Phi-3.5-MoE-instruct Microsoft		60B	—	—
2	Phi-3.5-mini-instruct Microsoft		4B	128K	$0.10 / $0.10

Notice missing or incorrect data?

FAQ

Common questions about SummScreenFD.

What is the SummScreenFD benchmark?

What is the SummScreenFD leaderboard?

The SummScreenFD leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.169. The average score across all models is 0.165.

What is the highest SummScreenFD score?

The highest SummScreenFD score is 0.169, achieved by Phi-3.5-MoE-instruct from Microsoft.

How many models are evaluated on SummScreenFD?

2 models have been evaluated on the SummScreenFD benchmark, with 0 verified results and 2 self-reported results.

Where can I find the SummScreenFD paper?

The SummScreenFD paper is available at https://arxiv.org/abs/2104.07091. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SummScreenFD cover?

SummScreenFD is categorized under long context and summarization. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all long context →

LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.

long contextmultimodal

20 models

LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

long context

14 models

AA-LCR

Agent Arena Long Context Reasoning benchmark

long context

13 models

EgoSchema

A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors

long contextvideo

9 models

MLVU

A comprehensive benchmark for multi-task long video understanding that evaluates multimodal large language models on videos ranging from 3 minutes to 2 hours across 9 distinct tasks including reasoning, captioning, recognition, and summarization.

long contextmultimodal

9 models