RULER

RULER v1 is a synthetic long-context benchmark for measuring how model quality degrades as input length increases. This packaging follows the public standalone NVIDIA RULER implementation with 13 official tasks spanning retrieval, multi-hop tracing, aggregation, and QA.

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on RULER

State-of-the-art frontier
Open
Proprietary

RULER Leaderboard

3 models • 0 verified
ContextCostLicense
1
120B262K
$0.10
$0.50
2
60B
3
4B128K
$0.10
$0.10
Notice missing or incorrect data?

FAQ

Common questions about RULER

RULER v1 is a synthetic long-context benchmark for measuring how model quality degrades as input length increases. This packaging follows the public standalone NVIDIA RULER implementation with 13 official tasks spanning retrieval, multi-hop tracing, aggregation, and QA.
The RULER paper is available at https://arxiv.org/abs/2404.06654. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The RULER dataset is available at https://github.com/NVIDIA/RULER.
The RULER leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Nemotron 3 Super (120B A12B) by NVIDIA leads with a score of 0.917. The average score across all models is 0.877.
The highest RULER score is 0.917, achieved by Nemotron 3 Super (120B A12B) from NVIDIA.
3 models have been evaluated on the RULER benchmark, with 0 verified results and 3 self-reported results.
RULER is categorized under long context and reasoning. The benchmark evaluates text models.

Sub-benchmarks