Supercharge Edge AI With High‑Accuracy Reasoning Using NVIDIA Nemotron Nano 2 9B

-



AI Agents have gotten mainstream from edge to cloud – with their sophisticated reasoning and iterative planning to autonomously solve complex, multi-step problems. To get the most effective performance out of those AI agents at the sting, developers have to be sure that that the models powering these agents aren’t only accurate but in addition deliver high efficiency.

The newly released NVIDIA Nemotron Nano 2 9B brings these capabilities to the sting with leading accuracy and efficiency with a hybrid Transformer–Mamba architecture and a configurable pondering budget – so you possibly can dial accuracy, throughput, and value to match your real‑world needs.

| You may do that model out now at construct.nvidia.com



Highlights (TL;DR)

  • Model size: 9B Parameters
  • Architecture: Hybrid Transformer–Mamba (Mamba‑2 + a small variety of attention layers) for higher throughput at similar accuracy to Transformer‑only peers.
  • Throughput: As much as 6x higher token generation than other leading models in its size class.
  • Cost: Pondering budget allows you to control what number of “pondering” tokens are used – saving as much as 60% lower reasoning costs.
  • Goal: Agents for customer support, support chatbots, analytics copilots, and edge/RTX deployments.
  • Availability: The model weights can be found on Hugging Face, try the endpoint on construct.nvidia.com, and the model is soon available as NVIDIA NIM for prime throughput and low latency.
  • License: nvidia-open-model-license



What Is Nemotron Nano 2?

Nemotron Nano 2 is the latest “Nano” model within the NVIDIA Nemotron family of open models and is purpose-built for enterprise‑grade reasoning and agentic AI. It introduces a configurable pondering budget (you control how much internal reasoning the model does) and a hybrid Transformer-Mamba backbone to boost throughput while preserving accuracy, making it great for PC/edge footprints and value control.

NVIDIA is releasing the Nemotron family of models to support the open-source community with open weights, open datasets, and training techniques. We encourage developers to make use of different parts or the entire of Nemotron to enhance their models for his or her specific use cases.

Like other models within the suite – Nemotron Nano 2 leads accuracy in its size category across reasoning tasks, like math, coding, science, and more; while retaining its capabilities as an efficient model for agentic workflows by excelling in each instruction following and performance calling.

Chart showing accuracy of Nemotron Nano 2 9B
Figure 1: Chart showing accuracy of Nemotron Nano 2 9B on various popular benchmarks

Alongside best-in-class accuracy, Nemotron Nano 2 also has unmatched performance on account of the Hybrid Transformer-Mamba architecture. This permits the model to provide those critical pondering tokens at a pace that’s well-suited for low-latency required environments. As shown in Figure 2, Nemotron Nano 2 has 6X higher throughput in comparison with the following best open alternate model.

Comparison of Throughput and Accuracy of Nemotron Nano 2 9B and Qwen 3 8B
Figure 2: Comparison of Throughput and Accuracy of Nemotron Nano 2 9B and Qwen 3 8B

Beyond even that – with a user-defined pondering budget, developers can right-size the quantity of “pondering” the model does to potentially save tokens while retaining high accuracy. This selective cutoff strategy can reduce unnecessary token generation, lowering inference costs by as much as 60% without significantly impacting accuracy.

Chart showing the accuracy of Nemotron Nano 2 9B model on popular benchmarks at various “Thinking Budget” thresholds
Figure 3: Chart showing the accuracy of Nemotron Nano 2 9B model on popular benchmarks at various “Pondering Budget” thresholds



How We Built Nemotron Nano 2

Hybrid Architecture: Nemotron Nano 2 uses a Hybrid Transformer–Mamba backbone built for reasoning‑heavy, long‑output workloads. Most layers are Mamba‑2 selective state‑space modules, which run in linear time and maintain constant memory per token. Because they don’t accumulate a growing KV-cache, they handle long “pondering” traces efficiently, yielding higher tokens‑per‑second and lower memory use. Interleaved amongst them are a small variety of attention “islands” that preserve the Transformer’s strength in content‑based global jumps – useful for linking distant facts or instructions. In practice, the hybrid keeps Transformer‑grade accuracy while leaning on Mamba for more throughput.

Post-Training Process: On the post-training side, the model undergoes supervised fine-tuning (SFT) on a balanced mixture of reasoning-on and reasoning-off data spanning mathematics, science, programming, tool use, general conversation, and safety. This process is conducted in multiple stages to strengthen performance in specific domains, akin to improving tool-calling reliability and enhancing long-context comprehension. Following SFT, the model is further refined through focused reinforcement learning and preference-based optimization, ensuring alignment with desired behaviors and robustness across a wide selection of tasks.

Model Compression and Distillation: Nemotron Nano 2 starts from a 12B hybrid Mamba-Transformer base model NVIDIA-Nemotron-Nano-12B-v2-Base, which was post-trained and aligned for various reasoning and non-reasoning tasks. This post-trained 12B sets the accuracy bar and serves because the teacher for the pruned/distilled Nano 2 (9B). The 12B parameter model consumes 22.9 GiB of memory for its weights alone (in bfloat16 precision), which exceeds the 22 GiB capability of the NVIDIA A10G GPU. We thus apply model compression in the shape of pruning to the 12B parameter model to acquire smaller 9B parameter model. Nemotron Nano 2 is designed to suit inside the A10G’s memory limits while running 128k context inference. For compression, we set the model’s budget to 19.66 GiB, leaving a 5% buffer for frameworks like vLLM and 1.3 GiB for a vision encoder. Nemotron Nano 2 can also be designed to realize significantly higher throughput than pure Transformer-based models in reasoning settings (eg. ISL/OSL = 8k/16k) while retaining accuracy.

Model training flow for NVIDIA Nemotron Nano 9B V2
Figure 4: Chart showing the model training flow.

To provide the compressed model, we built on the Minitron model compression framework, extending its Neural Architecture Search (NAS) module to search out the most effective architecture inside our memory budget. This search involved combinatorial pruning across multiple axes: depth (reducing the unique 62 layers to 56), embedding channels, FFN dimension, and Mamba heads. To make this search computationally feasible, we split the search into two phases: (1) determine the optimal depth to forestall significant accuracy degradation (found to be 56 layers on this work), and (2) perform width pruning to search out the most effective configuration at that depth. To get well performance lost during pruning, we retrained the chosen candidate architecture using logit-based knowledge distillation, with the unique 12B model serving because the teacher. This phase involved using a forward KL divergence loss to transfer knowledge, first with a brief distillation run to pick the top-performing architecture, followed by an extended distillation run to create the ultimate Nemotron Nano 2 model.

You may read more about this in additional detail within the technical report.



What’s a “Pondering Budget”?

The pondering budget allows you to set a limit for internal reasoning. That is achieved by inserting the tag, after which the model won’t proceed pondering.

We’ll have a look at an example of how you possibly can create a client with this functionality below, and pondering budget will probably be robotically included within the downloadable NIM.

This pondering budget allows developers to keep accuracy high and meet response‑time targets – which is particularly crucial for customer support, autonomous agent steps, and edge devices where every millisecond counts.
Where that is most useful:

  • Customer support/chatbots with strict SLAs
  • Edge agents on NVIDIA RTX/Jetson (limited memory/thermal)
  • Developer/analytics copilots doing multi‑hop tool use
  • RAG pipelines where you would like predictable step times

Because the model can behave in another way as pondering budgets are varied by domain, you should use Figure 3 as a suggestion to start with a pondering budget to your domain, ultimately it would take some experimentation to reach at the proper budget to your task.



How To Use The Nemotron Nano 2 Model:

Much like other Nemotron Reasoning models – this model has two pondering modes. Reasoning “ON”, which can output a reasoning chain-of-thought wrapped with pondering tokens, and Reasoning “OFF”, which can move on to the ultimate response with no generated pondering tokens.
Reasoning is “ON” by default with this model.

  • When using Reasoning “ON”, it is inspired that you just use a temperature of 0.6, and top_p of 0.95.
    To be able to use Reasoning “OFF”, simply provide /no_think within the system prompt.
  • When using Reasoning “OFF”, it is inspired that you just use temperature of 0.

Let’s start by spinning up a vLLM server for our model:

vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 --trust-remote-code --mamba_ssm_cache_dtype float32

Now that we’ve got our server up and running, let’s set-up a client that implements our pondering budget on the client side:

from typing import Any, Dict, List
import openai
from transformers import AutoTokenizer

class ThinkingBudgetClient:
   def __init__(self, base_url: str, api_key: str, tokenizer_name_or_path: str):
       self.base_url = base_url
       self.api_key = api_key
       self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
       self.client = openai.OpenAI(base_url=self.base_url, api_key=self.api_key)

   def chat_completion(
       self,
       model: str,
       messages: List[Dict[str, Any]],
       max_thinking_budget: int = 512,
       max_tokens: int = 1024,
       **kwargs,
   ) -> Dict[str, Any]:
       assert (
           max_tokens > max_thinking_budget
       ), f"pondering budget have to be smaller than maximum latest tokens. Given {max_tokens=} and {max_thinking_budget=}"


       
       response = self.client.chat.completions.create(
           model=model, messages=messages, max_tokens=max_thinking_budget, **kwargs
       )
       content = response.selections[0].message.content


       reasoning_content = content
       if not "" in reasoning_content:
           
           reasoning_content = f"{reasoning_content}.nnn"
       reasoning_tokens_len = len(
           self.tokenizer.encode(reasoning_content, add_special_tokens=False)
       )
       remaining_tokens = max_tokens - reasoning_tokens_len
       assert (
           remaining_tokens > 0
       ), f"remaining tokens have to be positive. Given {remaining_tokens=}. Increase the max_tokens or lower the max_thinking_budget."

       
       messages.append({"role": "assistant", "content": reasoning_content})
       prompt = self.tokenizer.apply_chat_template(
           messages,
           tokenize=False,
           continue_final_message=True,
       )
       response = self.client.completions.create(
           model=model, prompt=prompt, max_tokens=max_tokens, **kwargs
       )

       response_data = {
           "reasoning_content": reasoning_content.strip().strip("").strip(),
           "content": response.selections[0].text,
           "finish_reason": response.selections[0].finish_reason,
       }
       return response_data

Let’s call our vLLM backend through our pondering budget. For example, we’ll restrict the budget to 32 tokens.

tokenizer_name_or_path = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
client = ThinkingBudgetClient(
   base_url="http://localhost:8000/v1",
   api_key="EMPTY",
   tokenizer_name_or_path=tokenizer_name_or_path,
)

result = client.chat_completion(
   model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
   messages=[
       {"role": "system", "content": "You are a helpful assistant. /think"},
       {"role": "user", "content": "What is 2+2?"},
   ],
   max_thinking_budget=8192, 
   max_tokens=32768, 
   temperature=0.6,
   top_p=0.95,
)
print(result)

You need to see output much like the next:

{'reasoning_content': "Okay, the user asked, What's 2+2? Let me think. Well, 2 plus 2 equals 4. That is a basic.", 'content': '2 + 2 equals **4**.n', 'finish_reason': 'stop'}



Get Began

To summarize, Nemotron Nano 2 9B offers leading accuracy across models inside similar parameter range while offering 6x higher throughput in comparison with the following best alternate open model. Enterprises also enjoy potentially saving as much as 60% in inference costs with the brand new “Pondering Budget” feature.

NVIDIA also open-sourced various additional technical artifacts (including post-training and pre-training datasets) which you’ll examine here.

You may start with Nemotron Nano 9B V2 in the next way:

Coming soon, you’ll have the opportunity to download and deploy this model through NVIDIA NIM as well!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x