Nemotron 3 Nano: Complete Guide to Pricing, Context Window, Benchmarks & API

NVIDIA released Nemotron 3 Nano on December 15, 2025. It's a hybrid Mamba-Transformer model with a 1 million token context window and inference speeds up to 4x faster than its predecessor. The architecture is built around efficiency: 31.6 billion total parameters with only 3.6 billion active per token through Mixture-of-Experts (MoE) routing.

The model ships under the NVIDIA Open Model License, which allows commercial use. According to NVIDIA's official announcement, Nemotron 3 Nano is the first in a family of models designed for agentic AI workloads.

Quick Takeaways

Nemotron 3 Nano uses a hybrid Mamba2-Transformer architecture with MoE routing, activating only 3.6B of 31.6B parameters per token
The 1 million token context window is validated by RULER benchmark scores of 87.5% at 64K and 70.56% at 512K tokens
4x faster inference than Nemotron 2 Nano, optimized for NVIDIA A100 and H100 GPUs
Outperforms Qwen3 30B-A3B on math (+21.74 points on MATH), code (HumanEval 78.05%), and reasoning benchmarks
Available through 7+ API providers including Baseten, DeepInfra, Fireworks, Together AI, and Amazon Bedrock
Open weights under NVIDIA Open Model License allow commercial use, fine-tuning, and redistribution
Trained on 10.6 trillion tokens with 33% synthetic data for math, code, and tool-calling

Nemotron 3 Nano Key Specifications

View Nemotron 3 Nano details on LLM Stats ->

Release Date: December 15, 2025
Total Parameters: 31.6 billion (30B-A3B variant)
Active Parameters: ~3.6 billion per token
Architecture: Hybrid Mamba2-Transformer Mixture-of-Experts (MoE)
Context Window: 1 million tokens
Training Data: ~10.6 trillion tokens (with ~3.5 trillion synthetic)
Supported Languages: 20 languages including English, Spanish, French, German, Japanese, Chinese, Korean, Arabic
License: NVIDIA Open Model License (commercial use permitted)
Deployment: Hugging Face, vLLM, TensorRT-LLM, SGLang

The Nemotron 3 Nano API is available through Baseten, DeepInfra, Fireworks, FriendliAI, OpenRouter, and Together AI.

Architecture: The Mamba-Transformer Hybrid

The Nemotron 3 Nano technical report describes a hybrid architecture that uses both state-space models and transformers.

*Nemotron 3 Nano's hybrid architecture combines Mamba-2 layers for efficient long-context processing, Transformer attention layers for precise reasoning, and MoE routing that activates only 3.6B of 31.6B parameters per token. This design enables a 1 million token context window with 4x faster inference than previous generations.*

Why Hybrid?

Traditional transformers scale quadratically with sequence length. Every token attends to every other token, which gets expensive at 1 million tokens. State-space models like Mamba handle long sequences efficiently but can struggle with fine-grained reasoning. Research on hybrid architectures shows that combining both approaches can capture the benefits of each.

The hybrid splits the work:

Mamba-2 layers handle long-context processing with near-linear scaling
Transformer attention layers handle precise reasoning where token relationships matter
MoE routing activates only 3.6B of the 31.6B parameters per token

Inference Speed

The Nemotron 3 Nano latency numbers:

4x faster than Nemotron 2 Nano
3.3x faster than comparable models in its size class
Optimized for NVIDIA A100 and H100 GPUs

For agentic workflows, coding assistants, or customer-facing applications, the speed difference is noticeable.

Nemotron 3 Nano Benchmarks: How It Performs

View full benchmark data on Hugging Face ->

Nemotron 3 Nano benchmarks compared to Qwen3 30B-A3B, a similar MoE model:

Benchmark	Nemotron 3 Nano	Qwen3 30B-A3B
MMLU (5-shot)	78.56%	81.07%
MMLU-Pro (5-shot, CoT)	65.05%	61.71%
AGIEval-En (CoT)	68.32%	63.12%
HumanEval (0-shot)	78.05%	70.73%
MBPP-Sanitized (3-shot)	75.49%	73.15%
GSM8K (8-shot)	92.34%	89.01%
MATH (4-shot)	82.88%	61.14%
HellaSwag (10-shot)	85.56%	83.14%
PIQA (0-shot)	84.33%	81.01%

Observations:

Math: The MATH benchmark shows a +21.74 point improvement over Qwen3
Code: Leads on HumanEval (78.05% vs 70.73%) and MBPP (75.49% vs 73.15%)
Reasoning: Outperforms on MMLU-Pro and AGIEval
Trade-off: Trails on vanilla MMLU, but leads on the harder MMLU-Pro variant

Long Context Performance (RULER Benchmark)

* Nemotron 3 Nano RULER benchmark scores across context lengths from 64K to 512K tokens. The model maintains 87.5% accuracy at 64K tokens and 70.56% at 512K tokens, while Qwen3 30B-A3B caps out at 128K with only 60.69% accuracy. This validates Nemotron 3 Nano's 1 million token context window claims.*

The Nemotron 3 Nano context window holds up well at length:

Context Length	Nemotron 3 Nano	Qwen3 30B-A3B
64K tokens	87.50%	63.55%
128K tokens	82.92%	60.69%
256K tokens	75.44%	Not Supported
512K tokens	70.56%	Not Supported

Qwen3 doesn't support contexts beyond 128K. Nemotron 3 Nano maintains accuracy at 512K tokens with room to scale to the full 1M window.

Training: 10.6 Trillion Tokens

The Nemotron 3 Nano paper (technical report) covers the training data:

Data Composition:

Category	Tokens
Web Data (Nemotron-CC)	~3.9T
Multilingual	~2.2T
Synthetic Data	~3.5T
Code	~747B
Papers & Academic	~192B
Math (including synthetic)	~73B
Total	~10.6T

Synthetic Data

About 33% of the training corpus (3.5 trillion tokens) is synthetic, covering:

Mathematical reasoning problems
Code generation and explanation
Multilingual content
Tool-calling and agentic instruction following

Training Stages:

Pre-training: Next-token prediction on the full 10.6T token corpus
Instruction fine-tuning: Code, math, and tool-use scenarios
Alignment: Safety and helpfulness optimization

Training used NVIDIA's Megatron-LM framework on H800 GPU clusters.

Nemotron 3 Nano API & Pricing

View current pricing and providers ->

The Nemotron 3 Nano API is available through multiple providers. Pricing varies.

Available Providers:

Provider	Availability	Notes
Baseten	Available	Enterprise-focused deployment
DeepInfra	Available	Serverless inference
Fireworks AI	Available	Low-latency optimization
FriendliAI	Available	Dedicated instances
OpenRouter	Available	Multi-model routing
Together AI	Available	Serverless and dedicated
Amazon Bedrock	Available	Serverless on AWS
Google Cloud	Coming Soon
Microsoft Foundry	Coming Soon

Self-Hosting:

For running on your own infrastructure:

vLLM: High-throughput serving with PagedAttention
TensorRT-LLM: NVIDIA-optimized inference
SGLang: Efficient multi-turn conversation serving

Nemotron 3 Nano price depends on your deployment. Self-hosting on A100 or H100 GPUs gives the lowest per-token cost at scale. Managed APIs are simpler to set up. For full precision (BF16), approximately 60GB of VRAM is recommended; quantized versions can fit in 20-32GB VRAM.

The NVIDIA Open Model License

The NVIDIA Open Model License is less restrictive than some "open" licenses:

Commercial use: Permitted without royalties
Modification: You can fine-tune for your use case
Distribution: You can redistribute the model and derivatives
Attribution: Required to maintain license notices

Real-World Applications

Nemotron 3 Nano is designed for agentic AI applications.

Agentic Workflows

The model fits agent use cases because of:

1M token context for maintaining state across complex tasks
Strong tool-calling from specialized fine-tuning
Low latency for responsive multi-step execution

Code Generation

With HumanEval at 78.05% and strong MBPP scores:

Code completion and generation
Code review and explanation
Debugging
Repository-wide understanding (using the full context window)

Document Processing

The 1M token context window handles:

Complete legal contracts without chunking
Full technical documentation sets
Financial reports with cross-references
Multi-document research synthesis

Multilingual

20 languages supported, 74.47% on MMLU Global Lite.

Limitations

Base model only: The BF16 variant is a base model. Instruction following typically requires fine-tuning or an instruct variant.
Hardware requirements: 31.6B parameters needs significant GPU memory, even with MoE efficiency.
New architecture: The Mamba-Transformer hybrid is less tested in production than pure transformers.
Benchmark variance: Trails on vanilla MMLU while leading on harder variants.

Evaluate on your specific use cases before production deployment.

Conclusion

Nemotron 3 Nano represents a meaningful step forward in efficient open-weight language models. The hybrid Mamba-Transformer architecture addresses the fundamental tension between long-context capability and inference cost. By combining state-space models for sequence efficiency with transformer attention for precise reasoning, NVIDIA has produced a model that handles 1 million tokens while running 4x faster than its predecessor.

The benchmark results tell a clear story. Nemotron 3 Nano leads on mathematical reasoning (82.88% on MATH vs 61.14% for Qwen3), code generation (78.05% HumanEval), and long-context tasks (87.5% on RULER at 64K tokens). The trade-offs are real but manageable: slightly lower vanilla MMLU scores and the need for ~60GB VRAM at full precision.

For developers building agentic applications, the combination of open weights, commercial licensing, and strong tool-calling performance makes Nemotron 3 Nano worth evaluating. The model is available today through Hugging Face, Amazon Bedrock, and multiple inference providers.

If you're working on long-context applications, code assistants, or multi-step AI agents, test Nemotron 3 Nano against your workloads. The NVIDIA Nemotron 3 research page provides additional technical details and deployment guides.

Frequently Asked Questions

What is the Nemotron 3 Nano context window size?

The Nemotron 3 Nano context window supports up to 1 million tokens. This is validated by RULER benchmark scores showing 87.5% accuracy at 64K tokens, 82.92% at 128K tokens, and 70.56% at 512K tokens. The 1M context window makes it suitable for processing entire codebases, long legal documents, or extended multi-turn conversations without truncation.

How much does Nemotron 3 Nano cost to run?

Nemotron 3 Nano pricing varies by provider and deployment method. Managed API providers like Baseten, DeepInfra, Fireworks, and Together AI offer per-token pricing. For self-hosting, you'll need approximately 60GB VRAM for BF16 precision (e.g., a single A100 80GB or H100 80GB), or 20-32GB for quantized versions. Self-hosting typically offers the lowest per-token cost at scale.

How does Nemotron 3 Nano compare to other open models like Llama and Qwen?

Nemotron 3 Nano benchmarks show it outperforms Qwen3 30B-A3B on math (82.88% vs 61.14% on MATH), code (78.05% vs 70.73% on HumanEval), and long-context tasks. Its hybrid Mamba-Transformer architecture provides a larger context window (1M vs 128K) and faster inference (4x improvement). The NVIDIA Open Model License allows commercial use similar to Meta's Llama license.

What hardware do I need to run Nemotron 3 Nano locally?

To run Nemotron 3 Nano at full BF16 precision, you need approximately 60GB of VRAM (NVIDIA A100 80GB or H100 80GB recommended). Quantized versions (INT8, INT4) can run on GPUs with 20-32GB VRAM. The model is optimized for NVIDIA hardware and works with vLLM, TensorRT-LLM, and SGLang for efficient serving.

When will Nemotron 3 Super and Ultra be released?

According to NVIDIA's research page, Nemotron 3 Super (~100B parameters) and Nemotron 3 Ultra (~500B parameters) are planned for release in the first half of 2026. Nemotron 3 Super is optimized for collaborative agents and high-volume workloads, while Nemotron 3 Ultra targets state-of-the-art accuracy and reasoning performance.