
Nemotron 3 Nano: Complete Guide to Pricing, Context Window, Benchmarks & API
A comprehensive look at NVIDIA's Nemotron 3 Nano — the hybrid Mamba-Transformer MoE model with 1M context window, 4x faster inference, open weights under NVIDIA Open Model License, and what it means for agentic AI.

NVIDIA released Nemotron 3 Nano on December 15, 2025. It's a hybrid Mamba-Transformer model with a 1 million token context window and inference speeds up to 4x faster than its predecessor. The architecture is built around efficiency: 31.6 billion total parameters with only 3.6 billion active per token through Mixture-of-Experts (MoE) routing.
The model ships under the NVIDIA Open Model License, which allows commercial use. According to NVIDIA's official announcement, Nemotron 3 Nano is the first in a family of models designed for agentic AI workloads.
Quick Takeaways
- Nemotron 3 Nano uses a hybrid Mamba2-Transformer architecture with MoE routing, activating only 3.6B of 31.6B parameters per token
- The 1 million token context window is validated by RULER benchmark scores of 87.5% at 64K and 70.56% at 512K tokens
- 4x faster inference than Nemotron 2 Nano, optimized for NVIDIA A100 and H100 GPUs
- Outperforms Qwen3 30B-A3B on math (+21.74 points on MATH), code (HumanEval 78.05%), and reasoning benchmarks
- Available through 7+ API providers including Baseten, DeepInfra, Fireworks, Together AI, and Amazon Bedrock
- Open weights under NVIDIA Open Model License allow commercial use, fine-tuning, and redistribution
- Trained on 10.6 trillion tokens with 33% synthetic data for math, code, and tool-calling
Nemotron 3 Nano Key Specifications
View Nemotron 3 Nano details on LLM Stats ->
- Release Date: December 15, 2025
- Total Parameters: 31.6 billion (30B-A3B variant)
- Active Parameters: ~3.6 billion per token
- Architecture: Hybrid Mamba2-Transformer Mixture-of-Experts (MoE)
- Context Window: 1 million tokens
- Training Data: ~10.6 trillion tokens (with ~3.5 trillion synthetic)
- Supported Languages: 20 languages including English, Spanish, French, German, Japanese, Chinese, Korean, Arabic
- License: NVIDIA Open Model License (commercial use permitted)
- Deployment: Hugging Face, vLLM, TensorRT-LLM, SGLang
The Nemotron 3 Nano API is available through Baseten, DeepInfra, Fireworks, FriendliAI, OpenRouter, and Together AI.
Architecture: The Mamba-Transformer Hybrid
The Nemotron 3 Nano technical report describes a hybrid architecture that uses both state-space models and transformers.
*Nemotron 3 Nano's hybrid architecture combines Mamba-2 layers for efficient long-context processing, Transformer attention layers for precise reasoning, and MoE routing that activates only 3.6B of 31.6B parameters per token. This design enables a 1 million token context window with 4x faster inference than previous generations.*
Why Hybrid?
Traditional transformers scale quadratically with sequence length. Every token attends to every other token, which gets expensive at 1 million tokens. State-space models like Mamba handle long sequences efficiently but can struggle with fine-grained reasoning. Research on hybrid architectures shows that combining both approaches can capture the benefits of each.
The hybrid splits the work:
- Mamba-2 layers handle long-context processing with near-linear scaling
- Transformer attention layers handle precise reasoning where token relationships matter
- MoE routing activates only 3.6B of the 31.6B parameters per token
Inference Speed
The Nemotron 3 Nano latency numbers:
- 4x faster than Nemotron 2 Nano
- 3.3x faster than comparable models in its size class
- Optimized for NVIDIA A100 and H100 GPUs
For agentic workflows, coding assistants, or customer-facing applications, the speed difference is noticeable.
Nemotron 3 Nano Benchmarks: How It Performs
View full benchmark data on Hugging Face ->
Nemotron 3 Nano benchmarks compared to Qwen3 30B-A3B, a similar MoE model:
| Benchmark | Nemotron 3 Nano | Qwen3 30B-A3B |
|---|---|---|
| MMLU (5-shot) | 78.56% | 81.07% |
| MMLU-Pro (5-shot, CoT) | 65.05% | 61.71% |
| AGIEval-En (CoT) | 68.32% | 63.12% |
| HumanEval (0-shot) | 78.05% | 70.73% |
| MBPP-Sanitized (3-shot) | 75.49% | 73.15% |
| GSM8K (8-shot) | 92.34% | 89.01% |
| MATH (4-shot) | 82.88% | 61.14% |
| HellaSwag (10-shot) | 85.56% | 83.14% |
| PIQA (0-shot) | 84.33% | 81.01% |
Observations:
- Math: The MATH benchmark shows a +21.74 point improvement over Qwen3
- Code: Leads on HumanEval (78.05% vs 70.73%) and MBPP (75.49% vs 73.15%)
- Reasoning: Outperforms on MMLU-Pro and AGIEval
- Trade-off: Trails on vanilla MMLU, but leads on the harder MMLU-Pro variant
Long Context Performance (RULER Benchmark)
* Nemotron 3 Nano RULER benchmark scores across context lengths from 64K to 512K tokens. The model maintains 87.5% accuracy at 64K tokens and 70.56% at 512K tokens, while Qwen3 30B-A3B caps out at 128K with only 60.69% accuracy. This validates Nemotron 3 Nano's 1 million token context window claims.*
The Nemotron 3 Nano context window holds up well at length:
| Context Length | Nemotron 3 Nano | Qwen3 30B-A3B |
|---|---|---|
| 64K tokens | 87.50% | 63.55% |
| 128K tokens | 82.92% | 60.69% |
| 256K tokens | 75.44% | Not Supported |
| 512K tokens | 70.56% | Not Supported |
Qwen3 doesn't support contexts beyond 128K. Nemotron 3 Nano maintains accuracy at 512K tokens with room to scale to the full 1M window.
Training: 10.6 Trillion Tokens
The Nemotron 3 Nano paper (technical report) covers the training data:
Data Composition:
| Category | Tokens |
|---|---|
| Web Data (Nemotron-CC) | ~3.9T |
| Multilingual | ~2.2T |
| Synthetic Data | ~3.5T |
| Code | ~747B |
| Papers & Academic | ~192B |
| Math (including synthetic) | ~73B |
| Total | ~10.6T |
Synthetic Data
About 33% of the training corpus (3.5 trillion tokens) is synthetic, covering:
- Mathematical reasoning problems
- Code generation and explanation
- Multilingual content
- Tool-calling and agentic instruction following
Training Stages:
- Pre-training: Next-token prediction on the full 10.6T token corpus
- Instruction fine-tuning: Code, math, and tool-use scenarios
- Alignment: Safety and helpfulness optimization
Training used NVIDIA's Megatron-LM framework on H800 GPU clusters.
Nemotron 3 Nano API & Pricing
View current pricing and providers ->
The Nemotron 3 Nano API is available through multiple providers. Pricing varies.
Available Providers:
| Provider | Availability | Notes |
|---|---|---|
| Baseten | Available | Enterprise-focused deployment |
| DeepInfra | Available | Serverless inference |
| Fireworks AI | Available | Low-latency optimization |
| FriendliAI | Available | Dedicated instances |
| OpenRouter | Available | Multi-model routing |
| Together AI | Available | Serverless and dedicated |
| Amazon Bedrock | Available | Serverless on AWS |
| Google Cloud | Coming Soon | |
| Microsoft Foundry | Coming Soon |
Self-Hosting:
For running on your own infrastructure:
- vLLM: High-throughput serving with PagedAttention
- TensorRT-LLM: NVIDIA-optimized inference
- SGLang: Efficient multi-turn conversation serving
Nemotron 3 Nano price depends on your deployment. Self-hosting on A100 or H100 GPUs gives the lowest per-token cost at scale. Managed APIs are simpler to set up. For full precision (BF16), approximately 60GB of VRAM is recommended; quantized versions can fit in 20-32GB VRAM.
The NVIDIA Open Model License
The NVIDIA Open Model License is less restrictive than some "open" licenses:
- Commercial use: Permitted without royalties
- Modification: You can fine-tune for your use case
- Distribution: You can redistribute the model and derivatives
- Attribution: Required to maintain license notices
Real-World Applications
Nemotron 3 Nano is designed for agentic AI applications.
Agentic Workflows
The model fits agent use cases because of:
- 1M token context for maintaining state across complex tasks
- Strong tool-calling from specialized fine-tuning
- Low latency for responsive multi-step execution
Code Generation
With HumanEval at 78.05% and strong MBPP scores:
- Code completion and generation
- Code review and explanation
- Debugging
- Repository-wide understanding (using the full context window)
Document Processing
The 1M token context window handles:
- Complete legal contracts without chunking
- Full technical documentation sets
- Financial reports with cross-references
- Multi-document research synthesis
Multilingual
20 languages supported, 74.47% on MMLU Global Lite.
Limitations
- Base model only: The BF16 variant is a base model. Instruction following typically requires fine-tuning or an instruct variant.
- Hardware requirements: 31.6B parameters needs significant GPU memory, even with MoE efficiency.
- New architecture: The Mamba-Transformer hybrid is less tested in production than pure transformers.
- Benchmark variance: Trails on vanilla MMLU while leading on harder variants.
Evaluate on your specific use cases before production deployment.
Conclusion
Nemotron 3 Nano represents a meaningful step forward in efficient open-weight language models. The hybrid Mamba-Transformer architecture addresses the fundamental tension between long-context capability and inference cost. By combining state-space models for sequence efficiency with transformer attention for precise reasoning, NVIDIA has produced a model that handles 1 million tokens while running 4x faster than its predecessor.
The benchmark results tell a clear story. Nemotron 3 Nano leads on mathematical reasoning (82.88% on MATH vs 61.14% for Qwen3), code generation (78.05% HumanEval), and long-context tasks (87.5% on RULER at 64K tokens). The trade-offs are real but manageable: slightly lower vanilla MMLU scores and the need for ~60GB VRAM at full precision.
For developers building agentic applications, the combination of open weights, commercial licensing, and strong tool-calling performance makes Nemotron 3 Nano worth evaluating. The model is available today through Hugging Face, Amazon Bedrock, and multiple inference providers.
If you're working on long-context applications, code assistants, or multi-step AI agents, test Nemotron 3 Nano against your workloads. The NVIDIA Nemotron 3 research page provides additional technical details and deployment guides.
Frequently Asked Questions
What is the Nemotron 3 Nano context window size?
The Nemotron 3 Nano context window supports up to 1 million tokens. This is validated by RULER benchmark scores showing 87.5% accuracy at 64K tokens, 82.92% at 128K tokens, and 70.56% at 512K tokens. The 1M context window makes it suitable for processing entire codebases, long legal documents, or extended multi-turn conversations without truncation.
How much does Nemotron 3 Nano cost to run?
Nemotron 3 Nano pricing varies by provider and deployment method. Managed API providers like Baseten, DeepInfra, Fireworks, and Together AI offer per-token pricing. For self-hosting, you'll need approximately 60GB VRAM for BF16 precision (e.g., a single A100 80GB or H100 80GB), or 20-32GB for quantized versions. Self-hosting typically offers the lowest per-token cost at scale.
How does Nemotron 3 Nano compare to other open models like Llama and Qwen?
Nemotron 3 Nano benchmarks show it outperforms Qwen3 30B-A3B on math (82.88% vs 61.14% on MATH), code (78.05% vs 70.73% on HumanEval), and long-context tasks. Its hybrid Mamba-Transformer architecture provides a larger context window (1M vs 128K) and faster inference (4x improvement). The NVIDIA Open Model License allows commercial use similar to Meta's Llama license.
What hardware do I need to run Nemotron 3 Nano locally?
To run Nemotron 3 Nano at full BF16 precision, you need approximately 60GB of VRAM (NVIDIA A100 80GB or H100 80GB recommended). Quantized versions (INT8, INT4) can run on GPUs with 20-32GB VRAM. The model is optimized for NVIDIA hardware and works with vLLM, TensorRT-LLM, and SGLang for efficient serving.
When will Nemotron 3 Super and Ultra be released?
According to NVIDIA's research page, Nemotron 3 Super (~100B parameters) and Nemotron 3 Ultra (~500B parameters) are planned for release in the first half of 2026. Nemotron 3 Super is optimized for collaborative agents and high-volume workloads, while Nemotron 3 Ultra targets state-of-the-art accuracy and reasoning performance.
