XLSum English

Name: XLSum English Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Large-scale multilingual abstractive summarization dataset comprising 1 million professionally annotated article-summary pairs from BBC, covering 44 languages. XL-Sum is highly abstractive, concise, and of high quality, designed to encourage research on multilingual abstractive summarization tasks.

Llama 3.1 Nemotron 70B Instruct from NVIDIA currently leads the XLSum English leaderboard with a score of 0.316 across 1 evaluated AI models.

Paper

About this benchmark

What XLSum English measures

XLSum English is a text benchmark that evaluates large language models on language and summarization tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.3, with the leader reaching 0.3.

Compare leaders on the best AI for language and best AI for summarization leaderboards.

Publication

Paper: XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
Authors: Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin, and 4 others
Published: June 25, 2021
arXiv: 2106.13822

Abstract

Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/xl-sum}.

Llama 3.1 Nemotron 70B Instruct leads with 31.6%.

Progress Over Time

Interactive timeline showing model performance evolution on XLSum English

State-of-the-art frontier

Open

Proprietary

XLSum English Leaderboard

1 models

				Context	Cost	License
1	Llama 3.1 Nemotron 70B Instruct NVIDIA		70B	—	—

Notice missing or incorrect data?

FAQ

Common questions about XLSum English.

What is the XLSum English benchmark?

What is the XLSum English leaderboard?

The XLSum English leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.1 Nemotron 70B Instruct by NVIDIA leads with a score of 0.316. The average score across all models is 0.316.

What is the highest XLSum English score?

The highest XLSum English score is 0.316, achieved by Llama 3.1 Nemotron 70B Instruct from NVIDIA.

How many models are evaluated on XLSum English?

1 models have been evaluated on the XLSum English benchmark, with 0 verified results and 1 self-reported results.

Where can I find the XLSum English paper?

The XLSum English paper is available at https://arxiv.org/abs/2106.13822. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does XLSum English cover?

XLSum English is categorized under language and summarization. The benchmark evaluates text models with multilingual support.

What is the best open-source model on XLSum English?

Llama 3.1 Nemotron 70B Instruct by NVIDIA is the top-ranked open-source model on XLSum English, with a score of 0.316 (rank #1).

How recent are the XLSum English leaderboard results?

The XLSum English leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all language →

MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

language

127 models

MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

language

100 models

MMMLU

Multilingual Massive Multitask Language Understanding dataset released by OpenAI, featuring professionally translated MMLU test questions across 14 languages including Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, and Chinese. Contains approximately 15,908 multiple-choice questions per language covering 57 subjects.

language

48 models

MMLU-Redux

An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides more reliable evaluation metrics for language models by addressing dataset quality issues found in the original MMLU.

language

47 models

MMLU-ProX

Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.

language

30 models

Winogrande

WinoGrande: An Adversarial Winograd Schema Challenge at Scale. A large-scale dataset of 44,000 pronoun resolution problems designed to test machine commonsense reasoning. Uses adversarial filtering to reduce spurious biases and provides a more robust evaluation of whether AI systems truly understand commonsense or exploit statistical shortcuts. Current best AI methods achieve 59.4-79.1% accuracy, significantly below human performance of 94.0%.

language

22 models