RULER
RULER v1 is a synthetic long-context benchmark for measuring how model quality degrades as input length increases. This packaging follows the public standalone NVIDIA RULER implementation with 13 official tasks spanning retrieval, multi-hop tracing, aggregation, and QA.
Progress Over Time
Interactive timeline showing model performance evolution on RULER
RULER Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | 120B | 262K | $0.10 $0.50 | |||
2 | Microsoft | 60B | — | — | ||
3 | Microsoft | 4B | 128K | $0.10 $0.10 |
FAQ
Common questions about RULER
Sub-benchmarks
RULER 128k
RULER 128k evaluates the official 13-task RULER v1 suite at a 131072-token context budget.
RULER 16k
RULER 16k evaluates the official 13-task RULER v1 suite at a 16384-token context budget.
RULER 32k
RULER 32k evaluates the official 13-task RULER v1 suite at a 32768-token context budget.
RULER 4k
RULER 4k evaluates the official 13-task RULER v1 suite at a 4096-token context budget.
RULER 64k
RULER 64k evaluates the official 13-task RULER v1 suite at a 65536-token context budget.
RULER 8k
RULER 8k evaluates the official 13-task RULER v1 suite at an 8192-token context budget.