NVIDIA AI brings Nemotron-3 Nano-30B to NVFP4 using Quantization Aware Distillation for Efficient Inference

NVIDIA releases new NVIDIA GPUs Nemotron-Nano-3-30B-A3B-NVFP4The production checkpoint is a model that uses a 30B parameter logic in 4 Bit NVFP4 While maintaining accuracy, the model is close to its BF16 benchmark. This model is a hybrid Mixture Mamba2 of Experts Architecture with a Quantization Aware Distillation (QAD) Recipe specifically designed for NVFP4 deployment. This is the ultra-efficient, NVFP4-precision version of Nemotron-3 Nano.

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

What is Nemotron-Nano-3-30B-A3B-NVFP4?

Nemotron-Nano-3-30B-A3B-NVFP4 This is the quantized version Nemotron-3-Nano-30B-A3B-BF16The NVIDIA team has trained it from the ground up as an unified reasoning model and chat. It’s built on a The hybrid Mamba2 transformer MoE network:

There are 30B parameter in total
52 layers in depth
The Mamba2 layer and the MoE layer
Six grouped attention layers for queries with two groups
The MoE layers each have 128 shared experts, 128 routed expert and a MoE layer.
There are 6 active experts per token. This gives 3.5B parameters active per token.

Pre-trained models are available 25T tokens Use a Warmup Stable Decay Schedule for learning rates with 3072 batches, 1e-3 as the maximum learning rate and 1e-5 at its minimum.

Three stages of post-training follow:

Fine tuning under supervision Synthetic and curated data can be used for code, mathematics, science, calling tools, following instructions, and producing structured outputs.
Reinforcement learning With synchronous GRPO, across multiple step tool usage, multi-turn chat, structured environments and RLHF, with a generative rewards model.
Post training quantization To NVFP4 using FP8 cache, a high-precision layout and selectively selected KV cache. Then QAD.

The NVFP4 Checkpoint retains all layers in BF16 including the Mamba and Attention layers. It quantizes any remaining layers at NVFP4 while using FP8 to cache the KV.

The NVFP4 Format and Why It Matters?

NVFP4 Is a The 4 Bit floating point NVFP4 is a format for training as well as inference using the latest NVIDIA GPUs. The main features of NVFP4 are:

Comparing NVFP4 to FP8, it delivers Two to three times faster arithmetic performance.
This reduces the memory consumption by around The 1.8-times increase in the number of people who are able to use the 1.8-time increase is a significant amount. Weights and activations
The MXFP4 is extended by reducing its Block size reduced from 32 to 16. Introduce yourself two level scaling.

Two-level scaling is used The E4M3 to FP8 Scales are per Block The a FP32 scale per tensor. The dual-scaling increases the dynamic range, while maintaining a low quantization error.

Simple post training quantization (PTQ) NVFP4 gives a decent level of accuracy in benchmarking. The research team states that PTQ can cause problems for smaller models and pipelines. non negligible accuracy dropsThis motivates the use of a recovery technique based on training.

QAT to QAD

Standard Quantization-Aware Training (QAT). The forward pass is inserted with a pseudo-quantization and the reused. The original Task LossIt is also known as the next token cross-entropy. It works for convolutional network. But the team of researchers lists two major issues that modern LLMs face:

It is difficult to replicate complex multi-stage post-training pipelines that include SFT, RL, and model merging.
Original data on open-model training is not always available in a public form.

Quantization Aware Distillation (QAD) The pipeline is not frozen, but the target. The pipeline is frozen. The BF16 can be used as a teacher The NVFP4 is a model for students. Training reduces KL divergence Not the original supervised goal or RL.

Three properties of QAD are highlighted by the research team:

The QAT aligns with the high-precision teacher better than the quantized model.
The QAD behavior is stable, even if the teacher has gone through multiple stages such as reinforcement learning, supervised fine-tuning and model merging. This is because QAD only attempts to match the teacher’s final behavior.
The system can work with synthetic, partial or filtered information, as it requires only the text input to ask the student and teacher, rather than the labels or rewards models.

Benchmarks for Nemotron-3 Nano-30B

Nemotron-3-Nano-30B-A3B is one of the RL heavy models in the QAD research. This table shows the accuracy of AA-LCR (AIME25), GPQAD, LiveCodeBench V5, SciCode-TQ and LiveCodeBench V5 as well as NVFP4QAT and NVFP4QAD.

https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf

What you need to know

Nemotron-3-Nano-30B-A3B-NVFP4 is a 30B parameter hybrid Mamba2 Transformer MoE model This runs on 4 bit NVFP4 and has a small number of BF16 layers for stability. It keeps about 3.5B parameters active per token, and supports context windows with up to 1M symbols.
NVFP4 has a block size of 16 with two-level scaling.The FP8 memory costs are about 1.8x lower than the FP8 scale for activations and weights.
QAD replaces original task losses with KL divergence on a frozen teacher in BF16The NVFP4 Student will match the output distribution of the teacher without having to repeat the entire SFT, model merging pipeline, or the model merge process.
The NVFP4 Version achieves upto a 80% increase in yield using the Quantization Aware Distillation Method. The BF16 is accurate to 99.4%
NVFP4 shows a noticeable loss of accuracy on AA-LCR and AIME25. NVFP4 also degrades more when using LiveCodeBench, SciCode and LiveCodeBench.NVFP4 QAD recovers the performance near BF16, reducing this gap by only a few percentage points.

Click here to find out more Paper You can also find out more about the following: Model Weights. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

NVIDIA AI brings Nemotron-3 Nano-30B to NVFP4 using Quantization Aware Distillation for Efficient Inference

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges

What is Adobe Firefly? Learn How To Use This Generative AI Tool

AI Nudify Websites are Raking in Millions Dollars

WIRED| WIRED

The Confessions Of A Recovering AI Porn Addict

Sam Altman: ChatGPT on the Track to Outtalk Humanity

Top Insights

The worst fears of gamers about AI are coming true

Google ADK Pipeline Tutorial in Python: Data loading, statistical testing, visualization, and report generation

Latest News

Apple’s new CEO must launch an AI killer product

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

NVIDIA AI brings Nemotron-3 Nano-30B to NVFP4 using Quantization Aware Distillation for Efficient Inference

What is Nemotron-Nano-3-30B-A3B-NVFP4?

The NVFP4 Format and Why It Matters?

QAT to QAD

Benchmarks for Nemotron-3 Nano-30B

What you need to know

Related Posts