Back
Model Release·Technical Deep Dive

Nemotron 3 Super: Pricing, Benchmarks, Architecture & API

NVIDIA's Nemotron 3 Super: 120B hybrid Mamba-Transformer MoE with 12B active params, 1M context, 2.2x throughput vs GPT-OSS-120B. Pricing, benchmarks, and API.

Jonathan Chavez
Jonathan Chavez··14 min read
Nemotron 3 Super: Pricing, Benchmarks, Architecture & API

Architecture at a Glance

0B
Total Parameters
0B
Active per Token
0
Total Experts
0
Active Experts
0M
Context Window
0.2x
Throughput Gain

Expert Activation

Each cycle activates a different subset of 22 experts from 512 total.

NVIDIA released Nemotron 3 Super on March 11, 2026 at GTC. It's a 120 billion parameter hybrid Mamba-Transformer model with only 12 billion active parameters per token, a 1 million token context window, and inference throughput 2.2x higher than GPT-OSS-120B.

The model introduces LatentMoE, a new expert routing architecture that activates 4x more experts at the same computational cost by compressing tokens into a latent space before routing. It also features native NVFP4 pretraining (trained in 4-bit precision from the first gradient update) and Multi-Token Prediction for built-in speculative decoding.

Open weights, datasets, and the complete training recipe ship under the NVIDIA Nemotron Open Model License. NVIDIA is also releasing 10 trillion pretraining tokens, 40 million post-training samples, and 21 RL environment configurations.


Architecture: LatentMoE, Mamba-2, and Multi-Token Prediction

The technical report describes three innovations stacked together, each solving a different bottleneck in serving large MoE models.

LatentMoE: More Experts at the Same Cost

Standard MoE architectures route tokens from the model's full hidden dimension directly to experts. As models grow, this routing becomes the bottleneck: expert weight loads dominate latency, and all-to-all communication scales with the hidden dimension times the number of active experts.

LatentMoE (Elango et al., 2026) projects each token from the full hidden dimension (4096) into a compressed latent space (1024) before routing. Expert computation happens entirely in this smaller dimension. This 4x reduction means:

  • Expert weight loads drop by 4x, directly reducing latency in memory-bandwidth-bound serving
  • All-to-all routing traffic drops by 4x, reducing communication overhead in distributed serving
  • These savings fund 4x more experts: 512 total experts with 22 active per token (vs. a standard MoE with ~128 experts and 6 active)

With 512 experts, the router can activate distinct specialists for Python syntax vs. SQL logic vs. mathematical reasoning, and only pay for them when needed.

Standard MoE vs LatentMoE

Compressing tokens to a 4x smaller latent space before routing enables 4x more experts at the same inference cost.

Hybrid Mamba-Transformer Backbone

The 88-layer stack interleaves three types of layers in a periodic pattern:

  • Mamba-2 layers handle the bulk of sequence processing. State-space models provide linear-time complexity with respect to sequence length, making the 1M-token context window practical.
  • Transformer attention layers are placed at strategic depths as "global anchors." Pure SSMs can struggle with precise associative recall. The attention layers preserve this capability with Grouped-Query Attention (32 query heads, 2 KV heads).
  • LatentMoE layers are paired with every Mamba-2 block, providing sparse scaling.

This hybrid design is why Nemotron 3 Super can offer a 1M context window at the throughput levels it does. A pure transformer with 120B parameters would require quadratic KV cache growth. The Mamba layers eliminate that bottleneck for most of the network.

Multi-Token Prediction

Standard language models predict one token at a time. Nemotron 3 Super is trained with Multi-Token Prediction, where specialized prediction heads forecast several future tokens simultaneously.

The MTP heads serve as a built-in draft model for speculative decoding. On SPEED-Bench, Nemotron 3 Super achieves an average acceptance length of 3.45 tokens per verification step (vs. 2.70 for DeepSeek-R1), enabling up to 3x wall-clock speedups without requiring a separate draft model.

The architecture choices are tightly coupled. LatentMoE keeps per-token compute low. Mamba layers keep memory linear. MTP accelerates generation. Together, they enable a 120B-parameter model that serves faster than dense models a fraction of its size.


NVFP4: Training in 4-Bit Precision from Day One

Most quantized models start in full precision and get compressed afterward, which inevitably introduces accuracy loss. Nemotron 3 Super trains natively in NVFP4, NVIDIA's 4-bit floating-point format, from the first gradient update.

The NVFP4 format uses an E2M1 element format with 16-element micro-blocks, E4M3 scaling factors, and a second-level FP32 global scale. Select layers (attention projections, latent projections, MTP layers, and the final 15% of the network) run in BF16 or MXFP8 for stability.

On NVIDIA Blackwell GPUs, NVFP4 inference runs up to 4x faster than FP8 on Hopper with no loss in accuracy. This demonstrates that 4-bit pretraining at 25T tokens is stable and viable at scale.


Benchmarks

All scores are self-reported from the model card, evaluated using NeMo Evaluator SDK. NVIDIA provides reproducibility configs.

Reasoning & Math

BenchmarkNemotron 3 SuperQwen3.5-122BGPT-OSS-120B
MMLU-Pro83.73%86.70%81.00%
AIME 202590.21%90.36%92.50%
HMMT Feb25 (no tools)93.67%91.40%90.00%
HMMT Feb25 (with tools)94.73%89.55%
GPQA (no tools)79.23%86.60%80.10%
GPQA (with tools)82.70%80.09%
LiveCodeBench v581.19%78.93%88.00%
HLE (no tools)18.26%25.30%14.90%

Nemotron 3 Super leads on HMMT Feb25 by 2+ points over both competitors. The GPQA "with tools" score (82.70%) jumps 3.5 points over the "no tools" variant, suggesting tool-calling training translates directly into better science reasoning. The HLE gap (18.26% vs. Qwen3.5's 25.30%) reveals that raw scientific breadth remains an area where denser models have an edge.

Agentic & Coding

BenchmarkNemotron 3 SuperQwen3.5-122BGPT-OSS-120B
SWE-Bench Verified (OpenHands)60.47%66.40%41.90%
SWE-Bench Multilingual45.78%30.80%
Terminal Bench (hard)25.78%26.80%24.00%
Terminal Bench Core 2.031.00%37.50%18.70%

At 60.47% on SWE-Bench Verified, Nemotron 3 Super sits ~6 points behind Qwen3.5 but delivers 2.2x the throughput. For multi-agent systems running many agents concurrently, that throughput-per-accuracy trade-off matters. The SWE-Bench Multilingual result (45.78% vs. GPT-OSS's 30.80%) stands out.

Long Context

RULER Accuracy by Context Length

GPT-OSS-120B drops from 52% to 22% between 256K and 1M tokens. Nemotron 3 Super loses under 5 points across a 4x context increase.


Training Pipeline

The complete recipe is published on the Nemotron Developer Repository.

Pretraining (25T tokens)

Pretrained on 25 trillion tokens using NVFP4, spanning 10 trillion unique curated tokens. Phase 1 (80%, 20T tokens) covers broad data. Phase 2 (20%, 5T tokens) focuses on high-quality data for reasoning and coding. The pretraining data is released as Nemotron-Pre-Training-Datasets.

Supervised Fine-Tuning

Fine-tuned on ~7 million samples from a broader corpus of 40 million covering reasoning, instruction following, coding, safety, and multi-step agent tasks. Released as Nemotron-Post-Training-v3.

Multi-Environment RL

RL across 21 environment configurations using NeMo Gym and NeMo RL, generating 1.2 million environment rollouts. The RL uses asynchronous GRPO that decouples training from inference, with in-flight weight updates and MTP to accelerate rollout generation. This is the primary driver of improvements over Nemotron 3 Nano on software engineering and tool use benchmarks.


API, Pricing & Deployment

ProviderInput / 1MOutput / 1MMax Context
DeepInfra$0.10$0.50262K
Fireworks AIOn-demandOn-demand262K
Together AIAvailableAvailable262K
BasetenAvailableAvailable262K
OpenRouterAvailableAvailable262K

DeepInfra's pricing at $0.10/$0.50 makes Nemotron 3 Super one of the cheapest frontier-class models via API. That's roughly the same as Nemotron 3 Nano ($0.06/$0.24), making the "Super" upgrade nearly free for significantly better agentic performance.

Self-hosting requires 8x H100-80GB GPUs at BF16 precision. The model is packaged as an NVIDIA NIM microservice with support for vLLM, TensorRT-LLM, and SGLang.


Open Resources

NVIDIA is releasing the full stack:

Releasing RL environments and rollout data, not just weights and SFT data, allows researchers to replicate or improve the agentic training pipeline. This level of openness is rare at this capability tier.


Limitations

  • Hardware: 8x H100-80GB for full BF16. MoE requires memory for all 120B parameters, even though only 12B are active.
  • Conversational quality: Arena-Hard V2 at 73.88% trails GPT-OSS-120B's 90.26%. Optimized for agentic execution, not chat.
  • Scientific reasoning: GPQA at 79.23% lags behind Qwen3.5's 86.60%.
  • Provider context limits: Most APIs cap at 262K. Self-hosting required for 1M.
  • New architecture: The Mamba-Transformer hybrid with LatentMoE is less production-tested than pure transformer MoE designs.

Nemotron 3 Super makes a clear bet: optimize for throughput and agentic accuracy at the expense of conversational polish and raw scientific reasoning. At 2.2x the throughput of GPT-OSS-120B with comparable benchmark scores, and at $0.10/$0.50 per million tokens, it's positioned as the efficiency play for multi-agent systems.

Download weights from HuggingFace, try the API on build.nvidia.com, or read the full technical report.

Frequently Asked Questions

The Nemotron 3 Super context window supports up to 1 million tokens. RULER benchmark scores show 96.30% accuracy at 256K, 95.67% at 512K, and 91.75% at 1M tokens. Most API providers cap at 262K; self-hosting is required for full 1M context.
Nemotron 3 Super pricing on DeepInfra is $0.10/1M input and $0.50/1M output tokens. Self-hosting requires 8x H100-80GB GPUs at BF16 precision.
Nemotron 3 Super delivers 2.2x higher throughput than GPT-OSS-120B with comparable accuracy. It scores 60.47% on SWE-Bench Verified (OpenHands) vs. GPT-OSS's 41.90%, and 91.75% on RULER at 1M tokens vs. GPT-OSS's 22.30%.
LatentMoE projects tokens from the full hidden dimension (4096) into a compressed latent space (1024) before expert routing. This 4x reduction enables 512 total experts with 22 active per token at the same cost as a standard MoE with fewer experts.

More articles