Nemotron 3 Nano: Complete Guide to Pricing, Context Window, Benchmarks & API
December 16, 2025

Nemotron 3 Nano: Complete Guide to Pricing, Context Window, Benchmarks & API

A comprehensive look at NVIDIA's Nemotron 3 Nano — the hybrid Mamba-Transformer MoE model with 1M context window, 4x faster inference, open weights under NVIDIA Open Model License, and what it means for agentic AI.

Model ReleaseTechnical Deep Dive
Sebastian Crossa
Sebastian Crossa
Co-Founder @ LLM Stats

NVIDIA released Nemotron 3 Nano on December 15, 2025. It's a hybrid Mamba-Transformer model with a 1 million token context window and inference speeds up to 4x faster than its predecessor. The architecture is built around efficiency: 31.6 billion total parameters with only 3.6 billion active per token through Mixture-of-Experts (MoE) routing.

The model ships under the NVIDIA Open Model License, which allows commercial use. According to NVIDIA's official announcement, Nemotron 3 Nano is the first in a family of models designed for agentic AI workloads.


Quick Takeaways

  • Nemotron 3 Nano uses a hybrid Mamba2-Transformer architecture with MoE routing, activating only 3.6B of 31.6B parameters per token
  • The 1 million token context window is validated by RULER benchmark scores of 87.5% at 64K and 70.56% at 512K tokens
  • 4x faster inference than Nemotron 2 Nano, optimized for NVIDIA A100 and H100 GPUs
  • Outperforms Qwen3 30B-A3B on math (+21.74 points on MATH), code (HumanEval 78.05%), and reasoning benchmarks
  • Available through 7+ API providers including Baseten, DeepInfra, Fireworks, Together AI, and Amazon Bedrock
  • Open weights under NVIDIA Open Model License allow commercial use, fine-tuning, and redistribution
  • Trained on 10.6 trillion tokens with 33% synthetic data for math, code, and tool-calling

Nemotron 3 Nano Key Specifications

Nemotron 3 Nano Overview

View Nemotron 3 Nano details on LLM Stats ->

  • Release Date: December 15, 2025
  • Total Parameters: 31.6 billion (30B-A3B variant)
  • Active Parameters: ~3.6 billion per token
  • Architecture: Hybrid Mamba2-Transformer Mixture-of-Experts (MoE)
  • Context Window: 1 million tokens
  • Training Data: ~10.6 trillion tokens (with ~3.5 trillion synthetic)
  • Supported Languages: 20 languages including English, Spanish, French, German, Japanese, Chinese, Korean, Arabic
  • License: NVIDIA Open Model License (commercial use permitted)
  • Deployment: Hugging Face, vLLM, TensorRT-LLM, SGLang

The Nemotron 3 Nano API is available through Baseten, DeepInfra, Fireworks, FriendliAI, OpenRouter, and Together AI.


Architecture: The Mamba-Transformer Hybrid

The Nemotron 3 Nano technical report describes a hybrid architecture that uses both state-space models and transformers.

Nemotron 3 Nano Architecture Diagram

*Nemotron 3 Nano's hybrid architecture combines Mamba-2 layers for efficient long-context processing, Transformer attention layers for precise reasoning, and MoE routing that activates only 3.6B of 31.6B parameters per token. This design enables a 1 million token context window with 4x faster inference than previous generations.*

Why Hybrid?

Traditional transformers scale quadratically with sequence length. Every token attends to every other token, which gets expensive at 1 million tokens. State-space models like Mamba handle long sequences efficiently but can struggle with fine-grained reasoning. Research on hybrid architectures shows that combining both approaches can capture the benefits of each.

The hybrid splits the work:

  • Mamba-2 layers handle long-context processing with near-linear scaling
  • Transformer attention layers handle precise reasoning where token relationships matter
  • MoE routing activates only 3.6B of the 31.6B parameters per token

Inference Speed

The Nemotron 3 Nano latency numbers:

  • 4x faster than Nemotron 2 Nano
  • 3.3x faster than comparable models in its size class
  • Optimized for NVIDIA A100 and H100 GPUs

For agentic workflows, coding assistants, or customer-facing applications, the speed difference is noticeable.


Nemotron 3 Nano Benchmarks: How It Performs

Nemotron 3 Nano Benchmarks

View full benchmark data on Hugging Face ->

Nemotron 3 Nano benchmarks compared to Qwen3 30B-A3B, a similar MoE model:

BenchmarkNemotron 3 NanoQwen3 30B-A3B
MMLU (5-shot)78.56%81.07%
MMLU-Pro (5-shot, CoT)65.05%61.71%
AGIEval-En (CoT)68.32%63.12%
HumanEval (0-shot)78.05%70.73%
MBPP-Sanitized (3-shot)75.49%73.15%
GSM8K (8-shot)92.34%89.01%
MATH (4-shot)82.88%61.14%
HellaSwag (10-shot)85.56%83.14%
PIQA (0-shot)84.33%81.01%

Observations:

  • Math: The MATH benchmark shows a +21.74 point improvement over Qwen3
  • Code: Leads on HumanEval (78.05% vs 70.73%) and MBPP (75.49% vs 73.15%)
  • Reasoning: Outperforms on MMLU-Pro and AGIEval
  • Trade-off: Trails on vanilla MMLU, but leads on the harder MMLU-Pro variant

Long Context Performance (RULER Benchmark)

Nemotron 3 Nano Long Context Performance

* Nemotron 3 Nano RULER benchmark scores across context lengths from 64K to 512K tokens. The model maintains 87.5% accuracy at 64K tokens and 70.56% at 512K tokens, while Qwen3 30B-A3B caps out at 128K with only 60.69% accuracy. This validates Nemotron 3 Nano's 1 million token context window claims.*

The Nemotron 3 Nano context window holds up well at length:

Context LengthNemotron 3 NanoQwen3 30B-A3B
64K tokens87.50%63.55%
128K tokens82.92%60.69%
256K tokens75.44%Not Supported
512K tokens70.56%Not Supported

Qwen3 doesn't support contexts beyond 128K. Nemotron 3 Nano maintains accuracy at 512K tokens with room to scale to the full 1M window.


Training: 10.6 Trillion Tokens

The Nemotron 3 Nano paper (technical report) covers the training data:

Data Composition:

CategoryTokens
Web Data (Nemotron-CC)~3.9T
Multilingual~2.2T
Synthetic Data~3.5T
Code~747B
Papers & Academic~192B
Math (including synthetic)~73B
Total~10.6T

Synthetic Data

About 33% of the training corpus (3.5 trillion tokens) is synthetic, covering:

  • Mathematical reasoning problems
  • Code generation and explanation
  • Multilingual content
  • Tool-calling and agentic instruction following

Training Stages:

  1. Pre-training: Next-token prediction on the full 10.6T token corpus
  2. Instruction fine-tuning: Code, math, and tool-use scenarios
  3. Alignment: Safety and helpfulness optimization

Training used NVIDIA's Megatron-LM framework on H800 GPU clusters.


Nemotron 3 Nano API & Pricing

Nemotron 3 Nano Pricing

View current pricing and providers ->

The Nemotron 3 Nano API is available through multiple providers. Pricing varies.

Available Providers:

ProviderAvailabilityNotes
BasetenAvailableEnterprise-focused deployment
DeepInfraAvailableServerless inference
Fireworks AIAvailableLow-latency optimization
FriendliAIAvailableDedicated instances
OpenRouterAvailableMulti-model routing
Together AIAvailableServerless and dedicated
Amazon BedrockAvailableServerless on AWS
Google CloudComing Soon
Microsoft FoundryComing Soon

Self-Hosting:

For running on your own infrastructure:

  • vLLM: High-throughput serving with PagedAttention
  • TensorRT-LLM: NVIDIA-optimized inference
  • SGLang: Efficient multi-turn conversation serving

Nemotron 3 Nano price depends on your deployment. Self-hosting on A100 or H100 GPUs gives the lowest per-token cost at scale. Managed APIs are simpler to set up. For full precision (BF16), approximately 60GB of VRAM is recommended; quantized versions can fit in 20-32GB VRAM.


The NVIDIA Open Model License

The NVIDIA Open Model License is less restrictive than some "open" licenses:

  • Commercial use: Permitted without royalties
  • Modification: You can fine-tune for your use case
  • Distribution: You can redistribute the model and derivatives
  • Attribution: Required to maintain license notices

Real-World Applications

Nemotron 3 Nano is designed for agentic AI applications.

Agentic Workflows

The model fits agent use cases because of:

  • 1M token context for maintaining state across complex tasks
  • Strong tool-calling from specialized fine-tuning
  • Low latency for responsive multi-step execution

Code Generation

With HumanEval at 78.05% and strong MBPP scores:

  • Code completion and generation
  • Code review and explanation
  • Debugging
  • Repository-wide understanding (using the full context window)

Document Processing

The 1M token context window handles:

  • Complete legal contracts without chunking
  • Full technical documentation sets
  • Financial reports with cross-references
  • Multi-document research synthesis

Multilingual

20 languages supported, 74.47% on MMLU Global Lite.


Limitations

  • Base model only: The BF16 variant is a base model. Instruction following typically requires fine-tuning or an instruct variant.
  • Hardware requirements: 31.6B parameters needs significant GPU memory, even with MoE efficiency.
  • New architecture: The Mamba-Transformer hybrid is less tested in production than pure transformers.
  • Benchmark variance: Trails on vanilla MMLU while leading on harder variants.

Evaluate on your specific use cases before production deployment.


Conclusion

Nemotron 3 Nano represents a meaningful step forward in efficient open-weight language models. The hybrid Mamba-Transformer architecture addresses the fundamental tension between long-context capability and inference cost. By combining state-space models for sequence efficiency with transformer attention for precise reasoning, NVIDIA has produced a model that handles 1 million tokens while running 4x faster than its predecessor.

The benchmark results tell a clear story. Nemotron 3 Nano leads on mathematical reasoning (82.88% on MATH vs 61.14% for Qwen3), code generation (78.05% HumanEval), and long-context tasks (87.5% on RULER at 64K tokens). The trade-offs are real but manageable: slightly lower vanilla MMLU scores and the need for ~60GB VRAM at full precision.

For developers building agentic applications, the combination of open weights, commercial licensing, and strong tool-calling performance makes Nemotron 3 Nano worth evaluating. The model is available today through Hugging Face, Amazon Bedrock, and multiple inference providers.

If you're working on long-context applications, code assistants, or multi-step AI agents, test Nemotron 3 Nano against your workloads. The NVIDIA Nemotron 3 research page provides additional technical details and deployment guides.


Frequently Asked Questions

What is the Nemotron 3 Nano context window size?

The Nemotron 3 Nano context window supports up to 1 million tokens. This is validated by RULER benchmark scores showing 87.5% accuracy at 64K tokens, 82.92% at 128K tokens, and 70.56% at 512K tokens. The 1M context window makes it suitable for processing entire codebases, long legal documents, or extended multi-turn conversations without truncation.

How much does Nemotron 3 Nano cost to run?

Nemotron 3 Nano pricing varies by provider and deployment method. Managed API providers like Baseten, DeepInfra, Fireworks, and Together AI offer per-token pricing. For self-hosting, you'll need approximately 60GB VRAM for BF16 precision (e.g., a single A100 80GB or H100 80GB), or 20-32GB for quantized versions. Self-hosting typically offers the lowest per-token cost at scale.

How does Nemotron 3 Nano compare to other open models like Llama and Qwen?

Nemotron 3 Nano benchmarks show it outperforms Qwen3 30B-A3B on math (82.88% vs 61.14% on MATH), code (78.05% vs 70.73% on HumanEval), and long-context tasks. Its hybrid Mamba-Transformer architecture provides a larger context window (1M vs 128K) and faster inference (4x improvement). The NVIDIA Open Model License allows commercial use similar to Meta's Llama license.

What hardware do I need to run Nemotron 3 Nano locally?

To run Nemotron 3 Nano at full BF16 precision, you need approximately 60GB of VRAM (NVIDIA A100 80GB or H100 80GB recommended). Quantized versions (INT8, INT4) can run on GPUs with 20-32GB VRAM. The model is optimized for NVIDIA hardware and works with vLLM, TensorRT-LLM, and SGLang for efficient serving.

When will Nemotron 3 Super and Ultra be released?

According to NVIDIA's research page, Nemotron 3 Super (~100B parameters) and Nemotron 3 Ultra (~500B parameters) are planned for release in the first half of 2026. Nemotron 3 Super is optimized for collaborative agents and high-volume workloads, while Nemotron 3 Ultra targets state-of-the-art accuracy and reasoning performance.