A Compact Hybrid Model for Efficient Local AI

We’re excited to introduce Nemotron 3 Nano 4B, the latest and most compact member of the Nemotron 3 family. Leveraging hybrid Mamba-Transformer architecture, this model is designed for efficiency and accuracy in a targeted set of capabilities, setting a brand new standard for lightweight small language models. The model is offered across any NVIDIA GPU-enabled platforms and combines state-of-the-art instruction following and exceptional tool use with minimal VRAM footprint.

With just 4 billion parameters, Nemotron 3 Nano 4B is compact enough to run at the sting on NVIDIA Jetson platforms (Jetson Thor/Jetson Orin Nano) in addition to NVIDIA DGX Spark and NVIDIA RTX GPUs. This permits faster response times, enhanced data privacy, and versatile deployment while keeping inference costs low.

Nemotron 3 Nano 4B is our first model specifically optimized for on-device deployment and purpose-built to power local conversational agents and personas across GeForce RTX, Jetson and Spark customer use cases. This model achieves state-of-the-art accuracy and efficiency in several dimensions key to production use on the sting:

Instruction following (IFBench, IFEval): state-of-the-art in its size class
Gaming agency/intelligence (Orak): state-of-the-art in its size class
VRAM efficiency (peak memory use): lowest VRAM footprint in its size class under each high and low ISL/OSL settings (*1)
Latency: lowest TTFT in its size class under high ISL settings (*1)

(*1) Efficiency benchmarks were measured on an RTX 4070 using Llama.cpp with Q4_K_M-quantized versions of each models.

Moreover, Nemotron 3 Nano 4B delivers excellent tool-use performance and is very competitive in hallucination avoidance. Together, these capabilities show the model’s strong suitability for edge use cases.

Nemotron 3 Nano 4B was pruned and distilled from Nemotron Nano 9B v2 using the Nemotron Elastic framework, allowing it to inherit the strong reasoning capabilities as a hybrid reasoning model. It was further post-trained with a brand new recipe derived from Nemotron 3 Post-training data, enabling the model to excel at task solving even without explicit considering.

Finally, as an open-source model, it empowers the ecosystem to customize, fine-tune, and optimize it for domain-specific use cases.

For Orak, we evaluated the models in tactical games comparable to Super Mario, Darkest Dungeon and Stardew Valley.

Training Recipe for Nemotron 3 Nano 4B

Compressing 9B → 4B with Nemotron Elastic

Nemotron 3 Nano 4B was derived from Nemotron Nano 9B v2 using the Nemotron Elastic technology. Relatively than training a 4B model from scratch, or performing separate stages of pruning, candidate search, and distillation, as in an existing LLM compression technique, Nemotron Elastic uses structured pruning guided by a router, which is jointly trained with the model using auxiliary loss addressing the scholar model size plus the unique knowledge distillation loss. This technology enables achieving the optimal student model at a fraction of the fee of pretraining from scratch or conventional compression.

How the Router Decides What to Prune

Nemotron Elastic introduces an end-to-end trained router that performs neural architecture search over multiple compression axes, together with the knowledge distillation run. For Nano 4B, the framework was utilized in a single-budget configuration — targeting the 4B parameter count only — where the router’s role is to find out which axes to prune and by how much to succeed in the goal budget.

The router was given 4 pruning axes to select from:

Mamba heads — reducing the variety of SSM heads
Hidden dimension (embedding dimension) — shrinking the model-wide representation width
FFN channels — pruning intermediate neurons in MLP layers
Depth (layers) — removing entire layers from the network

For every width axis, prior knowledge about component importance was provided to the router by sorting channels, heads, and neurons in response to activation-based importance scores. For depth, a normalized MSE-based layer importance rating was used: each layer was iteratively removed, and the impact on the complete model’s output logits was measured, giving a principled ordering of which layers matter most. More details could be present in the Nemotron Elastic paper.
Given the 4B goal parameter budget, the router converged on the next pruning decisions:

Axis	Nemotron Nano 9B v2 (Parent)	Nemotron 3 Nano 4B
Depth	56 layers (27 Mamba, 4 attention, 25 MLP)	42 layers (21 Mamba, 4 Attention, 17 MLP)
Mamba heads	128	96
FFN intermediate dim	15680	12544
Embedding dim	4480	3136

Two-Stage Distillation for Accuracy Recovery

After the router determines the pruned architecture, the compressed model is retrained using knowledge distillation from the frozen 9B parent using Nano v2’s pre-training and post-training data. This accuracy recovery process runs in two stages:

Stage 1 — Short-context distillation (8K sequence length): The 4B model is trained on 63B tokens using an 8K context window using an information mix consisting of roughly 70% post-training data and 30% pretraining data from the parent Nano v2 recipe. This stage is crucial for the initial recovery of model accuracy after compression.
Stage 2 — Long-context extension (49K sequence length): To revive performance on more difficult tasks that require prolonged reasoning chains, the context is prolonged to 49K tokens. On this stage, the model is trained for 150 B tokens.

Supervised Nice-Tuning

We conducted two stages of SFT with relevant subsets from the Nemotron-Post-Training-v3 collection using Megatron-LM. The primary SFT stage trains the model on a combination of reasoning and non-reasoning data spanning across diverse domains like math, coding, science, chat, instruction following, and agentic tasks. The second stage is a smaller scale, focused training to strengthen safety behaviors.

Multi-environment Reinforcement Learning

Once the model is boot-strapped with SFT, we switch to a three-stage RL pipeline using NeMo-RL to focus on our focus areas, instruction following and tool-calling / agentic behavior. In the primary stage, we use single-turn instruction-following data. Within the second stage, we use NeMo-Gym environments for single-turn and multi-turn instruction following in addition to for structured outputs (JSON, XML). Finally, within the third stage, we use a preliminary version of Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1 for multi-turn conversational tool-calling. A balanced 50-50 ratio of reasoning and non-reasoning data was used throughout the three RLVR stages, with the KL penalty progressively increased at each stage.

Boosting Efficiency with Quantization

For edge devices, it is crucial to further reduce model size through quantization to enhance efficiency and reduce VRAM usage. Nemotron 3 Nano 4B is released in FP8 and Q4_K_M GGUF to be efficient on the sting device.

For the FP8 model, we applied Post-Training Quantization (PTQ) using the ModelOpt library. For the PTQ calibration dataset, we used a small subset of 1K samples from the post-training SFT dataset to estimate activation statistics to attenuate quantization related accuracy loss. To preserve accuracy while improving efficiency, we’ve got also applied a selective quantization strategy reasonably than quantizing the whole network. Comparing a set of quant configurations showed that keeping self- attention layers (4 out of 42 layers) and the 4 Mamba layers that precede the self-attention layers at BF16 provided a sweet-spot for accuracy recovery and efficiency gain trade-off. The model weights, activations, and KV-Cache are quantized to FP8. Conv1D inside all of the Mamba layers are kept in BF16. FP8 model achieved 100% median accuracy recovery across goal benchmarks in comparison with the BF16 model. The FP8 quantized version delivers as much as 1.8X improvement in latency and throughput in comparison with the unique BF16 version on DGX Spark & Jetson Thor.

For Llama.cpp support, we use the widely adopted GGUF quantization method Q4_K_M, a 4-bit scheme that gives a superb balance between efficiency and accuracy. The Q4_K_M GGUF version achieved 100% median accuracy recovery across the goal benchmarks in comparison with the BF16 model.

This GGUF release can also be well fitted to Jetson deployments. On Jetson Orin Nano 8GB designed for small embedded devices, the Q4_K_M checkpoint running with Llama.cpp delivers 18 tokens/s, as much as 2× higher throughput than Nemotron Nano 9B v2, highlighting Nemotron 3 Nano 4B’s efficiency for edge inference in embedded AI and robotics use cases.

Try It Now!

Nemotron 3 Nano 4B is offered across quite a lot of inference engines, including Transformers, vLLM, TRT-LLM, and Llama.cpp, enabling support for a wide selection of edge deployment scenarios.
To start, visit the Hugging Face repositories below to download the model checkpoints. Usage examples for Hugging Face Transformers, vLLM, TRT-LLM, and Llama.cpp can be found within the Model Card.

For Jetson, step-by-step instructions and ready-to-run commands can be found on the Jetson AI Lab model page.

Also, take a look at the NVIDIA In-Game Inferencing (NVIGI) SDK to speed up inference performance when running the model alongside heavy graphics workloads.

Source link

A Compact Hybrid Model for Efficient Local AI

Training Recipe for Nemotron 3 Nano 4B