QMSum

QMSum is a benchmark for query-based multi-domain meeting summarization consisting of 1,808 query-summary pairs over 232 meetings across academic, product, and committee domains. The dataset enables models to select and summarize relevant spans of meetings in response to specific queries. Published at NAACL 2021, QMSum presents significant challenges in long meeting summarization where models must identify and summarize relevant content based on user queries.

Phi-3.5-mini-instruct from Microsoft currently leads the QMSum leaderboard with a score of 0.213 across 2 evaluated AI models.

Paper

Phi-3.5-mini-instruct leads with 21.3%, followed by Phi-3.5-MoE-instruct at 19.9%.

Progress Over Time

Interactive timeline showing model performance evolution on QMSum

State-of-the-art frontier

Open

Proprietary

QMSum Leaderboard

2 models

				Context	Cost	License
1	Phi-3.5-mini-instruct Microsoft		4B	—	—
2	Phi-3.5-MoE-instruct Microsoft		60B	—	—

Notice missing or incorrect data?

FAQ

Common questions about QMSum.

What is the QMSum benchmark?

What is the QMSum leaderboard?

The QMSum leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Phi-3.5-mini-instruct by Microsoft leads with a score of 0.213. The average score across all models is 0.206.

What is the highest QMSum score?

The highest QMSum score is 0.213, achieved by Phi-3.5-mini-instruct from Microsoft.

How many models are evaluated on QMSum?

2 models have been evaluated on the QMSum benchmark, with 0 verified results and 2 self-reported results.

Where can I find the QMSum paper?

The QMSum paper is available at https://arxiv.org/abs/2104.05938. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does QMSum cover?

QMSum is categorized under summarization and long context. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all summarization →

LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.

long contextmultimodal

20 models

LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

long context

14 models

AA-LCR

Agent Arena Long Context Reasoning benchmark

long context

13 models

MRCR v2 (8-needle)

MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.

long context

10 models

EgoSchema

A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors

long contextvideo

9 models