RULER

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on RULER

State-of-the-art frontier
Open
Proprietary

RULER Leaderboard

4 models
ContextCostLicense
1550B
2120B
360B
44B
Notice missing or incorrect data?

Sub-benchmarks

About this benchmark

What is RULER?

RULER v1 is a synthetic long-context benchmark for measuring how model quality degrades as input length increases. This packaging follows the public standalone NVIDIA RULER implementation with 13 official tasks spanning retrieval, multi-hop tracing, aggregation, and QA.

RULER is a text benchmark evaluating models on long context and reasoning tasks. LLM Stats tracks 4 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.

Compare leaders on the best AI for long context and best AI for reasoning leaderboards.

Current leaders

Nemotron 3 Ultra (550B A55B) from NVIDIA currently leads the RULER leaderboard with a score of 0.947 across 4 evaluated AI models.

Source paper

Title
RULER: What's the Real Context Size of Your Long-Context Language Models?
Authors
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, and 4 others
Published
Abstract

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

FAQ

Common questions about the RULER benchmark and leaderboard.

What is the RULER benchmark?

RULER v1 is a synthetic long-context benchmark for measuring how model quality degrades as input length increases. This packaging follows the public standalone NVIDIA RULER implementation with 13 official tasks spanning retrieval, multi-hop tracing, aggregation, and QA.

What is the RULER leaderboard?

The RULER leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Nemotron 3 Ultra (550B A55B) by NVIDIA leads with a score of 0.947. The average score across all models is 0.894.

What is the highest RULER score?

The highest RULER score is 0.947, achieved by Nemotron 3 Ultra (550B A55B) from NVIDIA.

How many models are evaluated on RULER?

4 models have been evaluated on the RULER benchmark, with 0 verified results and 4 self-reported results.

Where can I find the RULER paper?

The RULER paper is available at https://arxiv.org/abs/2404.06654. The paper details the methodology, dataset construction, and evaluation criteria.

Where can I find the RULER dataset?

The RULER dataset is available at https://github.com/NVIDIA/RULER.

What categories does RULER cover?

RULER is categorized under long context and reasoning. The benchmark evaluates text models.

Are there variants of RULER?

Yes. RULER has 9 related variants: RULER 1000K, RULER 128k, RULER 16k, RULER 2048K.

What is the best open-source model on RULER?

Nemotron 3 Ultra (550B A55B) by NVIDIA is the top-ranked open-source model on RULER, with a score of 0.947 (rank #1).

How recent are the RULER leaderboard results?

The RULER leaderboard was last updated in July 2026 and currently includes 4 evaluated models.