Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • Apple’s new CEO must launch an AI killer product
  • OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing
  • 5 Reasons to Think Twice Before Using ChatGPT—or Any Chatbot—for Financial Advice
  • OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval
  • Your Favorite AI Gay Thirst Traps: The Men Behind them
  • Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin
  • Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges
  • Mend.io releases AI Security Governance Framework covering asset inventory, risk tiering, AI Supply Chain Security and Maturity model
AI-trends.todayAI-trends.today
Home»Tech»NVIDIA AI brings Nemotron-3 Nano-30B to NVFP4 using Quantization Aware Distillation for Efficient Inference

NVIDIA AI brings Nemotron-3 Nano-30B to NVFP4 using Quantization Aware Distillation for Efficient Inference

Tech By Gavin Wallace02/02/20265 Mins Read
Facebook Twitter LinkedIn Email
A Coding Implementation to Build an Interactive Transcript and PDF
A Coding Implementation to Build an Interactive Transcript and PDF
Share
Facebook Twitter LinkedIn Email

NVIDIA releases new NVIDIA GPUs Nemotron-Nano-3-30B-A3B-NVFP4The production checkpoint is a model that uses a 30B parameter logic in 4 Bit NVFP4 While maintaining accuracy, the model is close to its BF16 benchmark. This model is a hybrid Mixture Mamba2 of Experts Architecture with a Quantization Aware Distillation (QAD) Recipe specifically designed for NVFP4 deployment. This is the ultra-efficient, NVFP4-precision version of Nemotron-3 Nano.

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

What is Nemotron-Nano-3-30B-A3B-NVFP4?

Nemotron-Nano-3-30B-A3B-NVFP4 This is the quantized version Nemotron-3-Nano-30B-A3B-BF16The NVIDIA team has trained it from the ground up as an unified reasoning model and chat. It’s built on a The hybrid Mamba2 transformer MoE network:

  • There are 30B parameter in total
  • 52 layers in depth
  • The Mamba2 layer and the MoE layer
  • Six grouped attention layers for queries with two groups
  • The MoE layers each have 128 shared experts, 128 routed expert and a MoE layer.
  • There are 6 active experts per token. This gives 3.5B parameters active per token.

Pre-trained models are available 25T tokens Use a Warmup Stable Decay Schedule for learning rates with 3072 batches, 1e-3 as the maximum learning rate and 1e-5 at its minimum.

Three stages of post-training follow:

  1. Fine tuning under supervision Synthetic and curated data can be used for code, mathematics, science, calling tools, following instructions, and producing structured outputs.
  2. Reinforcement learning With synchronous GRPO, across multiple step tool usage, multi-turn chat, structured environments and RLHF, with a generative rewards model.
  3. Post training quantization To NVFP4 using FP8 cache, a high-precision layout and selectively selected KV cache. Then QAD.

The NVFP4 Checkpoint retains all layers in BF16 including the Mamba and Attention layers. It quantizes any remaining layers at NVFP4 while using FP8 to cache the KV.

The NVFP4 Format and Why It Matters?

NVFP4 Is a The 4 Bit floating point NVFP4 is a format for training as well as inference using the latest NVIDIA GPUs. The main features of NVFP4 are:

  • Comparing NVFP4 to FP8, it delivers Two to three times faster arithmetic performance.
  • This reduces the memory consumption by around The 1.8-times increase in the number of people who are able to use the 1.8-time increase is a significant amount. Weights and activations
  • The MXFP4 is extended by reducing its Block size reduced from 32 to 16. Introduce yourself two level scaling.

Two-level scaling is used The E4M3 to FP8 Scales are per Block The a FP32 scale per tensor. The dual-scaling increases the dynamic range, while maintaining a low quantization error.

Simple post training quantization (PTQ) NVFP4 gives a decent level of accuracy in benchmarking. The research team states that PTQ can cause problems for smaller models and pipelines. non negligible accuracy dropsThis motivates the use of a recovery technique based on training.

QAT to QAD

Standard Quantization-Aware Training (QAT). The forward pass is inserted with a pseudo-quantization and the reused. The original Task LossIt is also known as the next token cross-entropy. It works for convolutional network. But the team of researchers lists two major issues that modern LLMs face:

  • It is difficult to replicate complex multi-stage post-training pipelines that include SFT, RL, and model merging.
  • Original data on open-model training is not always available in a public form.

Quantization Aware Distillation (QAD) The pipeline is not frozen, but the target. The pipeline is frozen. The BF16 can be used as a teacher The NVFP4 is a model for students. Training reduces KL divergence Not the original supervised goal or RL.

Three properties of QAD are highlighted by the research team:

  1. The QAT aligns with the high-precision teacher better than the quantized model.
  2. The QAD behavior is stable, even if the teacher has gone through multiple stages such as reinforcement learning, supervised fine-tuning and model merging. This is because QAD only attempts to match the teacher’s final behavior.
  3. The system can work with synthetic, partial or filtered information, as it requires only the text input to ask the student and teacher, rather than the labels or rewards models.

Benchmarks for Nemotron-3 Nano-30B

Nemotron-3-Nano-30B-A3B is one of the RL heavy models in the QAD research. This table shows the accuracy of AA-LCR (AIME25), GPQAD, LiveCodeBench V5, SciCode-TQ and LiveCodeBench V5 as well as NVFP4QAT and NVFP4QAD.

https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf

What you need to know

  • Nemotron-3-Nano-30B-A3B-NVFP4 is a 30B parameter hybrid Mamba2 Transformer MoE model This runs on 4 bit NVFP4 and has a small number of BF16 layers for stability. It keeps about 3.5B parameters active per token, and supports context windows with up to 1M symbols.
  • NVFP4 has a block size of 16 with two-level scaling.The FP8 memory costs are about 1.8x lower than the FP8 scale for activations and weights.
  • QAD replaces original task losses with KL divergence on a frozen teacher in BF16The NVFP4 Student will match the output distribution of the teacher without having to repeat the entire SFT, model merging pipeline, or the model merge process.
  • The NVFP4 Version achieves upto a 80% increase in yield using the Quantization Aware Distillation Method. The BF16 is accurate to 99.4%
  • NVFP4 shows a noticeable loss of accuracy on AA-LCR and AIME25. NVFP4 also degrades more when using LiveCodeBench, SciCode and LiveCodeBench.NVFP4 QAD recovers the performance near BF16, reducing this gap by only a few percentage points.

Click here to find out more Paper You can also find out more about the following: Model Weights. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.


AI ar nvidia
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

24/04/2026

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

24/04/2026

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

24/04/2026

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges

24/04/2026
Top News

What is Adobe Firefly? Learn How To Use This Generative AI Tool

AI Nudify Websites are Raking in Millions Dollars

WIRED| WIRED

The Confessions Of A Recovering AI Porn Addict

Sam Altman: ChatGPT on the Track to Outtalk Humanity

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

The worst fears of gamers about AI are coming true

13/03/2026

Google ADK Pipeline Tutorial in Python: Data loading, statistical testing, visualization, and report generation

14/04/2026
Latest News

Apple’s new CEO must launch an AI killer product

24/04/2026

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

24/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.