Nemotron 3 Super: Pricing, Benchmarks, Architecture & API
NVIDIA's Nemotron 3 Super: 120B hybrid Mamba-Transformer MoE with 12B active params, 1M context, 2.2x throughput vs GPT-OSS-120B. Pricing, benchmarks, and API.


Architecture at a Glance
Expert Activation
Each cycle activates a different subset of 22 experts from 512 total.
NVIDIA released Nemotron 3 Super on March 11, 2026 at GTC. It's a 120 billion parameter hybrid Mamba-Transformer model with only 12 billion active parameters per token, a 1 million token context window, and inference throughput 2.2x higher than GPT-OSS-120B.
The model introduces LatentMoE, a new expert routing architecture that activates 4x more experts at the same computational cost by compressing tokens into a latent space before routing. It also features native NVFP4 pretraining (trained in 4-bit precision from the first gradient update) and Multi-Token Prediction for built-in speculative decoding.
Open weights, datasets, and the complete training recipe ship under the NVIDIA Nemotron Open Model License. NVIDIA is also releasing 10 trillion pretraining tokens, 40 million post-training samples, and 21 RL environment configurations.
Architecture: LatentMoE, Mamba-2, and Multi-Token Prediction
The technical report describes three innovations stacked together, each solving a different bottleneck in serving large MoE models.
LatentMoE: More Experts at the Same Cost
Standard MoE architectures route tokens from the model's full hidden dimension directly to experts. As models grow, this routing becomes the bottleneck: expert weight loads dominate latency, and all-to-all communication scales with the hidden dimension times the number of active experts.
LatentMoE (Elango et al., 2026) projects each token from the full hidden dimension (4096) into a compressed latent space (1024) before routing. Expert computation happens entirely in this smaller dimension. This 4x reduction means:
- Expert weight loads drop by 4x, directly reducing latency in memory-bandwidth-bound serving
- All-to-all routing traffic drops by 4x, reducing communication overhead in distributed serving
- These savings fund 4x more experts: 512 total experts with 22 active per token (vs. a standard MoE with ~128 experts and 6 active)
With 512 experts, the router can activate distinct specialists for Python syntax vs. SQL logic vs. mathematical reasoning, and only pay for them when needed.
Compressing tokens to a 4x smaller latent space before routing enables 4x more experts at the same inference cost.
Hybrid Mamba-Transformer Backbone
The 88-layer stack interleaves three types of layers in a periodic pattern:
- Mamba-2 layers handle the bulk of sequence processing. State-space models provide linear-time complexity with respect to sequence length, making the 1M-token context window practical.
- Transformer attention layers are placed at strategic depths as "global anchors." Pure SSMs can struggle with precise associative recall. The attention layers preserve this capability with Grouped-Query Attention (32 query heads, 2 KV heads).
- LatentMoE layers are paired with every Mamba-2 block, providing sparse scaling.
This hybrid design is why Nemotron 3 Super can offer a 1M context window at the throughput levels it does. A pure transformer with 120B parameters would require quadratic KV cache growth. The Mamba layers eliminate that bottleneck for most of the network.
Multi-Token Prediction
Standard language models predict one token at a time. Nemotron 3 Super is trained with Multi-Token Prediction, where specialized prediction heads forecast several future tokens simultaneously.
The MTP heads serve as a built-in draft model for speculative decoding. On SPEED-Bench, Nemotron 3 Super achieves an average acceptance length of 3.45 tokens per verification step (vs. 2.70 for DeepSeek-R1), enabling up to 3x wall-clock speedups without requiring a separate draft model.
The architecture choices are tightly coupled. LatentMoE keeps per-token compute low. Mamba layers keep memory linear. MTP accelerates generation. Together, they enable a 120B-parameter model that serves faster than dense models a fraction of its size.
NVFP4: Training in 4-Bit Precision from Day One
Most quantized models start in full precision and get compressed afterward, which inevitably introduces accuracy loss. Nemotron 3 Super trains natively in NVFP4, NVIDIA's 4-bit floating-point format, from the first gradient update.
The NVFP4 format uses an E2M1 element format with 16-element micro-blocks, E4M3 scaling factors, and a second-level FP32 global scale. Select layers (attention projections, latent projections, MTP layers, and the final 15% of the network) run in BF16 or MXFP8 for stability.
On NVIDIA Blackwell GPUs, NVFP4 inference runs up to 4x faster than FP8 on Hopper with no loss in accuracy. This demonstrates that 4-bit pretraining at 25T tokens is stable and viable at scale.
Benchmarks
All scores are self-reported from the model card, evaluated using NeMo Evaluator SDK. NVIDIA provides reproducibility configs.
Reasoning & Math
| Benchmark | Nemotron 3 Super | Qwen3.5-122B | GPT-OSS-120B |
|---|---|---|---|
| MMLU-Pro | 83.73% | 86.70% | 81.00% |
| AIME 2025 | 90.21% | 90.36% | 92.50% |
| HMMT Feb25 (no tools) | 93.67% | 91.40% | 90.00% |
| HMMT Feb25 (with tools) | 94.73% | 89.55% | — |
| GPQA (no tools) | 79.23% | 86.60% | 80.10% |
| GPQA (with tools) | 82.70% | — | 80.09% |
| LiveCodeBench v5 | 81.19% | 78.93% | 88.00% |
| HLE (no tools) | 18.26% | 25.30% | 14.90% |
Nemotron 3 Super leads on HMMT Feb25 by 2+ points over both competitors. The GPQA "with tools" score (82.70%) jumps 3.5 points over the "no tools" variant, suggesting tool-calling training translates directly into better science reasoning. The HLE gap (18.26% vs. Qwen3.5's 25.30%) reveals that raw scientific breadth remains an area where denser models have an edge.
Agentic & Coding
| Benchmark | Nemotron 3 Super | Qwen3.5-122B | GPT-OSS-120B |
|---|---|---|---|
| SWE-Bench Verified (OpenHands) | 60.47% | 66.40% | 41.90% |
| SWE-Bench Multilingual | 45.78% | — | 30.80% |
| Terminal Bench (hard) | 25.78% | 26.80% | 24.00% |
| Terminal Bench Core 2.0 | 31.00% | 37.50% | 18.70% |
At 60.47% on SWE-Bench Verified, Nemotron 3 Super sits ~6 points behind Qwen3.5 but delivers 2.2x the throughput. For multi-agent systems running many agents concurrently, that throughput-per-accuracy trade-off matters. The SWE-Bench Multilingual result (45.78% vs. GPT-OSS's 30.80%) stands out.
Long Context
GPT-OSS-120B drops from 52% to 22% between 256K and 1M tokens. Nemotron 3 Super loses under 5 points across a 4x context increase.
Training Pipeline
The complete recipe is published on the Nemotron Developer Repository.
Pretraining (25T tokens)
Pretrained on 25 trillion tokens using NVFP4, spanning 10 trillion unique curated tokens. Phase 1 (80%, 20T tokens) covers broad data. Phase 2 (20%, 5T tokens) focuses on high-quality data for reasoning and coding. The pretraining data is released as Nemotron-Pre-Training-Datasets.
Supervised Fine-Tuning
Fine-tuned on ~7 million samples from a broader corpus of 40 million covering reasoning, instruction following, coding, safety, and multi-step agent tasks. Released as Nemotron-Post-Training-v3.
Multi-Environment RL
RL across 21 environment configurations using NeMo Gym and NeMo RL, generating 1.2 million environment rollouts. The RL uses asynchronous GRPO that decouples training from inference, with in-flight weight updates and MTP to accelerate rollout generation. This is the primary driver of improvements over Nemotron 3 Nano on software engineering and tool use benchmarks.
API, Pricing & Deployment
| Provider | Input / 1M | Output / 1M | Max Context |
|---|---|---|---|
| DeepInfra | $0.10 | $0.50 | 262K |
| Fireworks AI | On-demand | On-demand | 262K |
| Together AI | Available | Available | 262K |
| Baseten | Available | Available | 262K |
| OpenRouter | Available | Available | 262K |
DeepInfra's pricing at $0.10/$0.50 makes Nemotron 3 Super one of the cheapest frontier-class models via API. That's roughly the same as Nemotron 3 Nano ($0.06/$0.24), making the "Super" upgrade nearly free for significantly better agentic performance.
Self-hosting requires 8x H100-80GB GPUs at BF16 precision. The model is packaged as an NVIDIA NIM microservice with support for vLLM, TensorRT-LLM, and SGLang.
Open Resources
NVIDIA is releasing the full stack:
- Checkpoints: BF16, FP8, NVFP4, and base model
- Pretraining data: 10T curated tokens
- Post-training: 40M samples for SFT, preference data, and RL trajectories
- RL environments: 21 configurations and 37 datasets for multi-step agent training
- Training recipe: Complete pretraining, SFT, and RL recipes in the Nemotron repo
Releasing RL environments and rollout data, not just weights and SFT data, allows researchers to replicate or improve the agentic training pipeline. This level of openness is rare at this capability tier.
Limitations
- Hardware: 8x H100-80GB for full BF16. MoE requires memory for all 120B parameters, even though only 12B are active.
- Conversational quality: Arena-Hard V2 at 73.88% trails GPT-OSS-120B's 90.26%. Optimized for agentic execution, not chat.
- Scientific reasoning: GPQA at 79.23% lags behind Qwen3.5's 86.60%.
- Provider context limits: Most APIs cap at 262K. Self-hosting required for 1M.
- New architecture: The Mamba-Transformer hybrid with LatentMoE is less production-tested than pure transformer MoE designs.
Nemotron 3 Super makes a clear bet: optimize for throughput and agentic accuracy at the expense of conversational polish and raw scientific reasoning. At 2.2x the throughput of GPT-OSS-120B with comparable benchmark scores, and at $0.10/$0.50 per million tokens, it's positioned as the efficiency play for multi-agent systems.
Download weights from HuggingFace, try the API on build.nvidia.com, or read the full technical report.