Back to blog
Model Release·Technical Deep Dive

Nemotron 3 Nano: Complete Guide to Pricing, Context Window, Benchmarks & API

A comprehensive look at NVIDIA's Nemotron 3 Nano — the hybrid Mamba-Transformer MoE model with 1M context window, 4x faster inference, open weights under NVIDIA Open Model License, and what it means for agentic AI.

Sebastian Crossa
Sebastian Crossa
Co-Founder @ LLM Stats
·7 min read
Nemotron 3 Nano: Complete Guide to Pricing, Context Window, Benchmarks & API

NVIDIA released Nemotron 3 Nano on December 15, 2025. It's a hybrid Mamba-Transformer model with a 1 million token context window and inference speeds up to 4x faster than its predecessor. The architecture is built around efficiency: 31.6 billion total parameters with only 3.6 billion active per token through Mixture-of-Experts (MoE) routing.

The model ships under the NVIDIA Open Model License, which allows commercial use. According to NVIDIA's official announcement, Nemotron 3 Nano is the first in a family of models designed for agentic AI workloads.


Quick Takeaways

  • Nemotron 3 Nano uses a hybrid Mamba2-Transformer architecture with MoE routing, activating only 3.6B of 31.6B parameters per token
  • The 1 million token context window is validated by RULER benchmark scores of 87.5% at 64K and 70.56% at 512K tokens
  • 4x faster inference than Nemotron 2 Nano, optimized for NVIDIA A100 and H100 GPUs
  • Outperforms Qwen3 30B-A3B on math (+21.74 points on MATH), code (HumanEval 78.05%), and reasoning benchmarks
  • Available through 7+ API providers including Baseten, DeepInfra, Fireworks, Together AI, and Amazon Bedrock
  • Open weights under NVIDIA Open Model License allow commercial use, fine-tuning, and redistribution
  • Trained on 10.6 trillion tokens with 33% synthetic data for math, code, and tool-calling

Nemotron 3 Nano Key Specifications

Nemotron 3 Nano Overview

View Nemotron 3 Nano details on LLM Stats ->

  • Release Date: December 15, 2025
  • Total Parameters: 31.6 billion (30B-A3B variant)
  • Active Parameters: ~3.6 billion per token
  • Architecture: Hybrid Mamba2-Transformer Mixture-of-Experts (MoE)
  • Context Window: 1 million tokens
  • Training Data: ~10.6 trillion tokens (with ~3.5 trillion synthetic)
  • Supported Languages: 20 languages including English, Spanish, French, German, Japanese, Chinese, Korean, Arabic
  • License: NVIDIA Open Model License (commercial use permitted)
  • Deployment: Hugging Face, vLLM, TensorRT-LLM, SGLang

The Nemotron 3 Nano API is available through Baseten, DeepInfra, Fireworks, FriendliAI, OpenRouter, and Together AI.


Architecture: The Mamba-Transformer Hybrid

The Nemotron 3 Nano technical report describes a hybrid architecture that uses both state-space models and transformers.

Nemotron 3 Nano Architecture Diagram

*Nemotron 3 Nano's hybrid architecture combines Mamba-2 layers for efficient long-context processing, Transformer attention layers for precise reasoning, and MoE routing that activates only 3.6B of 31.6B parameters per token. This design enables a 1 million token context window with 4x faster inference than previous generations.*

Why Hybrid?

Traditional transformers scale quadratically with sequence length. Every token attends to every other token, which gets expensive at 1 million tokens. State-space models like Mamba handle long sequences efficiently but can struggle with fine-grained reasoning. Research on hybrid architectures shows that combining both approaches can capture the benefits of each.

The hybrid splits the work:

  • Mamba-2 layers handle long-context processing with near-linear scaling
  • Transformer attention layers handle precise reasoning where token relationships matter
  • MoE routing activates only 3.6B of the 31.6B parameters per token

Inference Speed

The Nemotron 3 Nano latency numbers:

  • 4x faster than Nemotron 2 Nano
  • 3.3x faster than comparable models in its size class
  • Optimized for NVIDIA A100 and H100 GPUs

For agentic workflows, coding assistants, or customer-facing applications, the speed difference is noticeable.


Nemotron 3 Nano Benchmarks: How It Performs

Nemotron 3 Nano Benchmarks

View full benchmark data on Hugging Face ->

Nemotron 3 Nano benchmarks compared to Qwen3 30B-A3B, a similar MoE model:

BenchmarkNemotron 3 NanoQwen3 30B-A3B
MMLU (5-shot)78.56%81.07%
MMLU-Pro (5-shot, CoT)65.05%61.71%
AGIEval-En (CoT)68.32%63.12%
HumanEval (0-shot)78.05%70.73%
MBPP-Sanitized (3-shot)75.49%73.15%
GSM8K (8-shot)92.34%89.01%
MATH (4-shot)82.88%61.14%
HellaSwag (10-shot)85.56%83.14%
PIQA (0-shot)84.33%81.01%

Observations:

  • Math: The MATH benchmark shows a +21.74 point improvement over Qwen3
  • Code: Leads on HumanEval (78.05% vs 70.73%) and MBPP (75.49% vs 73.15%)
  • Reasoning: Outperforms on MMLU-Pro and AGIEval
  • Trade-off: Trails on vanilla MMLU, but leads on the harder MMLU-Pro variant

Long Context Performance (RULER Benchmark)

Nemotron 3 Nano Long Context Performance

* Nemotron 3 Nano RULER benchmark scores across context lengths from 64K to 512K tokens. The model maintains 87.5% accuracy at 64K tokens and 70.56% at 512K tokens, while Qwen3 30B-A3B caps out at 128K with only 60.69% accuracy. This validates Nemotron 3 Nano's 1 million token context window claims.*

The Nemotron 3 Nano context window holds up well at length:

Context LengthNemotron 3 NanoQwen3 30B-A3B
64K tokens87.50%63.55%
128K tokens82.92%60.69%
256K tokens75.44%Not Supported
512K tokens70.56%Not Supported

Qwen3 doesn't support contexts beyond 128K. Nemotron 3 Nano maintains accuracy at 512K tokens with room to scale to the full 1M window.


Training: 10.6 Trillion Tokens

The Nemotron 3 Nano paper (technical report) covers the training data:

Data Composition:

CategoryTokens
Web Data (Nemotron-CC)~3.9T
Multilingual~2.2T
Synthetic Data~3.5T
Code~747B
Papers & Academic~192B
Math (including synthetic)~73B
Total~10.6T

Synthetic Data

About 33% of the training corpus (3.5 trillion tokens) is synthetic, covering:

  • Mathematical reasoning problems
  • Code generation and explanation
  • Multilingual content
  • Tool-calling and agentic instruction following

Training Stages:

  1. Pre-training: Next-token prediction on the full 10.6T token corpus
  2. Instruction fine-tuning: Code, math, and tool-use scenarios
  3. Alignment: Safety and helpfulness optimization

Training used NVIDIA's Megatron-LM framework on H800 GPU clusters.


Nemotron 3 Nano API & Pricing

Nemotron 3 Nano Pricing

View current pricing and providers ->

The Nemotron 3 Nano API is available through multiple providers. Pricing varies.

Available Providers:

ProviderAvailabilityNotes
BasetenAvailableEnterprise-focused deployment
DeepInfraAvailableServerless inference
Fireworks AIAvailableLow-latency optimization
FriendliAIAvailableDedicated instances
OpenRouterAvailableMulti-model routing
Together AIAvailableServerless and dedicated
Amazon BedrockAvailableServerless on AWS
Google CloudComing Soon
Microsoft FoundryComing Soon

Self-Hosting:

For running on your own infrastructure:

  • vLLM: High-throughput serving with PagedAttention
  • TensorRT-LLM: NVIDIA-optimized inference
  • SGLang: Efficient multi-turn conversation serving

Nemotron 3 Nano price depends on your deployment. Self-hosting on A100 or H100 GPUs gives the lowest per-token cost at scale. Managed APIs are simpler to set up. For full precision (BF16), approximately 60GB of VRAM is recommended; quantized versions can fit in 20-32GB VRAM.


The NVIDIA Open Model License

The NVIDIA Open Model License is less restrictive than some "open" licenses:

  • Commercial use: Permitted without royalties
  • Modification: You can fine-tune for your use case
  • Distribution: You can redistribute the model and derivatives
  • Attribution: Required to maintain license notices

Real-World Applications

Nemotron 3 Nano is designed for agentic AI applications.

Agentic Workflows

The model fits agent use cases because of:

  • 1M token context for maintaining state across complex tasks
  • Strong tool-calling from specialized fine-tuning
  • Low latency for responsive multi-step execution

Code Generation

With HumanEval at 78.05% and strong MBPP scores:

  • Code completion and generation
  • Code review and explanation
  • Debugging
  • Repository-wide understanding (using the full context window)

Document Processing

The 1M token context window handles:

  • Complete legal contracts without chunking
  • Full technical documentation sets
  • Financial reports with cross-references
  • Multi-document research synthesis

Multilingual

20 languages supported, 74.47% on MMLU Global Lite.


Limitations

  • Base model only: The BF16 variant is a base model. Instruction following typically requires fine-tuning or an instruct variant.
  • Hardware requirements: 31.6B parameters needs significant GPU memory, even with MoE efficiency.
  • New architecture: The Mamba-Transformer hybrid is less tested in production than pure transformers.
  • Benchmark variance: Trails on vanilla MMLU while leading on harder variants.

Evaluate on your specific use cases before production deployment.


Conclusion

Nemotron 3 Nano represents a meaningful step forward in efficient open-weight language models. The hybrid Mamba-Transformer architecture addresses the fundamental tension between long-context capability and inference cost. By combining state-space models for sequence efficiency with transformer attention for precise reasoning, NVIDIA has produced a model that handles 1 million tokens while running 4x faster than its predecessor.

The benchmark results tell a clear story. Nemotron 3 Nano leads on mathematical reasoning (82.88% on MATH vs 61.14% for Qwen3), code generation (78.05% HumanEval), and long-context tasks (87.5% on RULER at 64K tokens). The trade-offs are real but manageable: slightly lower vanilla MMLU scores and the need for ~60GB VRAM at full precision.

For developers building agentic applications, the combination of open weights, commercial licensing, and strong tool-calling performance makes Nemotron 3 Nano worth evaluating. The model is available today through Hugging Face, Amazon Bedrock, and multiple inference providers.

If you're working on long-context applications, code assistants, or multi-step AI agents, test Nemotron 3 Nano against your workloads. The NVIDIA Nemotron 3 research page provides additional technical details and deployment guides.


Questions

Frequently Asked Questions

  • The Nemotron 3 Nano context window supports up to 1 million tokens. This is validated by RULER benchmark scores showing 87.5% accuracy at 64K tokens, 82.92% at 128K tokens, and 70.56% at 512K tokens. The 1M context window makes it suitable for processing entire codebases, long legal documents, or extended multi-turn conversations without truncation.
  • Nemotron 3 Nano pricing varies by provider and deployment method. Managed API providers like Baseten, DeepInfra, Fireworks, and Together AI offer per-token pricing. For self-hosting, you'll need approximately 60GB VRAM for BF16 precision (e.g., a single A100 80GB or H100 80GB), or 20-32GB for quantized versions. Self-hosting typically offers the lowest per-token cost at scale.
  • Nemotron 3 Nano benchmarks show it outperforms Qwen3 30B-A3B on math (82.88% vs 61.14% on MATH), code (78.05% vs 70.73% on HumanEval), and long-context tasks. Its hybrid Mamba-Transformer architecture provides a larger context window (1M vs 128K) and faster inference (4x improvement). The NVIDIA Open Model License allows commercial use similar to Meta's Llama license.
  • To run Nemotron 3 Nano at full BF16 precision, you need approximately 60GB of VRAM (NVIDIA A100 80GB or H100 80GB recommended). Quantized versions (INT8, INT4) can run on GPUs with 20-32GB VRAM. The model is optimized for NVIDIA hardware and works with vLLM, TensorRT-LLM, and SGLang for efficient serving.
  • According to NVIDIA's research page, Nemotron 3 Super (~100B parameters) and Nemotron 3 Ultra (~500B parameters) are planned for release in the first half of 2026. Nemotron 3 Super is optimized for collaborative agents and high-volume workloads, while Nemotron 3 Ultra targets state-of-the-art accuracy and reasoning performance.

Continue Reading