SuperGLUE

SuperGLUE is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. It includes 8 primary tasks: BoolQ (Boolean Questions), CB (CommitmentBank), COPA (Choice of Plausible Alternatives), MultiRC (Multi-Sentence Reading Comprehension), ReCoRD (Reading Comprehension with Commonsense Reasoning), RTE (Recognizing Textual Entailment), WiC (Word-in-Context), and WSC (Winograd Schema Challenge). The benchmark evaluates diverse language understanding capabilities including reading comprehension, commonsense reasoning, causal reasoning, coreference resolution, textual entailment, and word sense disambiguation across multiple domains.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SuperGLUE

State-of-the-art frontier
Open
Proprietary

SuperGLUE Leaderboard

1 models
ContextCostLicense
1
OpenAI
OpenAI
128K$3.00 / $12.00
Notice missing or incorrect data?

FAQ

Common questions about SuperGLUE

SuperGLUE is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. It includes 8 primary tasks: BoolQ (Boolean Questions), CB (CommitmentBank), COPA (Choice of Plausible Alternatives), MultiRC (Multi-Sentence Reading Comprehension), ReCoRD (Reading Comprehension with Commonsense Reasoning), RTE (Recognizing Textual Entailment), WiC (Word-in-Context), and WSC (Winograd Schema Challenge). The benchmark evaluates diverse language understanding capabilities including reading comprehension, commonsense reasoning, causal reasoning, coreference resolution, textual entailment, and word sense disambiguation across multiple domains.
The SuperGLUE paper is available at https://arxiv.org/abs/1905.00537. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SuperGLUE leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, o1-mini by OpenAI leads with a score of 0.750. The average score across all models is 0.750.
The highest SuperGLUE score is 0.750, achieved by o1-mini from OpenAI.
1 models have been evaluated on the SuperGLUE benchmark, with 0 verified results and 1 self-reported results.
SuperGLUE is categorized under general, language, and reasoning. The benchmark evaluates text models.