Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

-



It has grow to be increasingly difficult to evaluate whether a model’s
reported improvements reflect real advances or variations in
evaluation conditions, dataset composition, or training data that
mirrors benchmark tasks. The NVIDIA Nemotron approach to openness
addresses this by publishing transparent and reproducible evaluation
recipes that make results independently verifiable.

NVIDIA released Nemotron 3 Nano 30B
A3B

with an explicitly open evaluation approach to make that distinction
clear. Alongside the model card, we’re publishing the entire
evaluation recipe used to generate the outcomes, built with the
NVIDIA NeMo
Evaluator
library, so
anyone can rerun the evaluation pipeline, inspect the artifacts, and
analyze the outcomes independently.

We imagine that open innovation is the inspiration of AI progress. This
level of transparency matters because most model evaluations omit
critical details. Configs, prompts, harness versions, runtime settings,
and logs are sometimes missing or underspecified, and even small differences
in these parameters can materially change results. And not using a complete
recipe, it’s nearly inconceivable to inform whether a model is genuinely
more intelligent or just optimized for a benchmark.

This blog shows developers exactly reproduce the evaluation
behind Nemotron 3 Nano 30B
A3B

using fully open tools, configurations, and artifacts. You’ll find out how
the evaluation was run, why the methodology matters, and execute
the identical end-to-end workflow using the NeMo Evaluator library so you’ll be able to
confirm results, compare models consistently, and construct transparent
evaluation pipelines of your personal.



Constructing a consistent and transparent evaluation workflow with NeMo Evaluator



A single, consistent evaluation system

Developers and researchers need evaluation workflows they will depend on,
not one-off scripts that behave in a different way from model to model. NeMo
Evaluator provides a unified option to define benchmarks, prompts,
configuration, and runtime behavior once, then reuse that methodology
across models and releases. This avoids the common scenario where the
evaluation setup quietly changes between runs, making comparisons over
time difficult or misleading.



Methodology independent of inference setup

Model outputs can vary by inference backend and configuration, so
evaluation tools should never be tied to a single inference solution.
Locking an evaluation tool to at least one inference solution would limit its
usefulness. NeMo Evaluator avoids this by separating the evaluation
pipeline from the inference backend, allowing the identical configuration to
run against hosted endpoints, local deployments, or third-party
providers. This separation enables meaningful comparisons even whenever you
change infrastructure or inference engines.



Built to scale beyond one-off experiments

Many evaluation pipelines work once after which break down because the scope
expands. NeMo Evaluator is designed to scale from quick,
single-benchmark validation to full model card suites and repeated
evaluations across multiple models. The launcher, artifact layout, and
configuration model support ongoing workflows, not only isolated
experiments, so teams can maintain consistent evaluation practices over
time.



Auditability with structured artifacts and logs

Transparent evaluation requires greater than final scores. Each evaluation
run produces structured results and logs by default, making it easy to
inspect how scores were computed, understand rating calculations, debug
unexpected behavior, and conduct deeper evaluation. Each component of the
evaluation is captured and reproducible.



A shared evaluation standard

By releasing Nemotron 3 Nano 30B
A3B

with its full evaluation
recipe
,
NVIDIA is providing a reference methodology that the community can run,
inspect, and construct upon. Using the identical configuration and tools brings
consistency to how benchmarks are chosen, executed, and interpreted,
enabling more reliable comparisons across models, providers, and
releases.



Open evaluation for Nemotron 3 Nano

Open evaluation means publishing not only the ultimate results, however the
full methodology behind them, so benchmarks are run consistently, and
results might be compared meaningfully over time. For Nemotron 3 Nano
30B
A3B
,
this includes open‑source tooling, transparent configurations, and
reproducible artifacts that anyone can run end‑to‑end.



Open-source model evaluation tooling

NeMo
Evaluator
is an
open-source library designed for robust, reproducible, and scalable
evaluation of generative models. As an alternative of introducing one more
standalone benchmark runner, it acts as a unifying orchestration layer
that brings multiple evaluation harnesses under a single, consistent
interface.

Under this architecture, NeMo Evaluator integrates and coordinates
lots of of benchmarks from many widely used evaluation harnesses,
including NeMo
Skills

for Nemotron instruction-following, tool use, and agentic evaluations,
in addition to the LM Evaluation
Harness

for base model and pre-training benchmarks, and plenty of more (full
benchmark
catalog
).
Each harness retains its native logic, datasets, and scoring semantics,
while NeMo Evaluator standardizes how they’re configured, executed, and
logged.

This provides two practical benefits: teams can run diverse benchmark
categories using a single configuration without rewriting custom
evaluation scripts, and results from different harnesses are stored and
inspected in a consistent, predictable way, even when the underlying
tasks differ. The identical orchestration framework used internally by
NVIDIA’s Nemotron research and model‑evaluation teams is now available
to the community, enabling developers to run heterogeneous,
multi‑harness evaluations through a shared, auditable workflow.



Open configurations

We published the precise YAML configuration used for the Nemotron 3
Nano 30B A3B model
card

evaluation with NeMo Evaluator. This includes:

  • model inference and deployment settings
  • benchmark and task selection
  • benchmark-specific parameters akin to sampling, repeats, and prompt
    templates
  • runtime controls including parallelism, timeouts, and retries
  • output paths and artifact layout

Using the identical configuration means running the identical evaluation
methodology.



Open logs and artifacts

Each evaluation run produces structured, inspectable outputs, including
per‑task results.json files, execution logs for debugging and
auditability, and artifacts organized by task for simple comparison. This
structure makes it possible to know not only the ultimate scores, but
also how those scores were produced and to perform deeper evaluation of
model behavior.



The reproducibility workflow

Reproducing Nemotron 3 Nano 30B A3B model
card

results follows an easy loop:

  1. Start from the released model checkpoint or hosted endpoint
  2. Use the published NeMo Evaluator
    config
  3. Execute the evaluation with a single CLI command
  4. Inspect logs and artifacts, and compare results to the model card

The identical workflow applies to any model you evaluate using NeMo
Evaluator. You may point the evaluation at a hosted endpoint or an area
deployment, including common inference providers akin to
HuggingFace,
construct.nvidia.com,
and
OpenRouter.
The important thing requirement is access to the model, either as weights you’ll be able to
serve or as an endpoint you’ll be able to call. For this tutorial, we use the
hosted endpoint on
construct.nvidia.com.



Reproducing Nemotron 3 Nano benchmark results

nano-3-nemotron

This tutorial reproduces the evaluation results for NVIDIA Nemotron
3 Nano 30B
A3B

using NeMo Evaluator. The step-by-step tutorial, including the
published configs used for the model card
evaluation
,
is out there on GitHub. Although now we have focused this tutorial on the
Nemotron 3 Nano 30B A3B, we also published recipes for the bottom
model
evaluation
.

This walkthrough runs a comprehensive evaluation suite of the published configs used for the model card
evaluation
for NVIDIA Nemotron
3 Nano 30B A3B
using the next benchmarks:

Benchmark Accuracy Category Description
BFCL v4 53.8 Function Calling Berkeley Function Calling Leaderboard v4
LiveCodeBench (v6 2025-08–2025-05) 68.3 Coding Real-world coding problems evaluation
MMLU-Pro 78.3 Knowledge Multi-task language understanding (10-choice)
GPQA 73.0 Science Graduate-level science questions
AIME 2025 89.1 Mathematics American Invitational Mathematics Exam
SciCode 33.3 Scientific Coding Scientific programming challenges
IFBench 71.5 Instruction Following Instruction following benchmark
HLE 10.6 Humanity’s Last Exam Expert-level questions across domains

For Model Card details, see the NVIDIA Nemotron
3 Nano 30B A3B Model Card
. For a deep dive into the architecture, datasets, and benchmarks, read the total Nemotron 3 Nano Technical Report.



1. Install NeMo Evaluator Launcher

pip install nemo-evaluator-launcher



2. Set required environment variables

# NVIDIA endpoint access
export NGC_API_KEY="your-ngc-api-key"

# Hugging Face access
export HF_TOKEN="your-huggingface-token"

# Required just for judge-based benchmarks akin to HLE
export JUDGE_API_KEY="your-judge-api-key"

Optional but beneficial for faster reruns:
export HF_HOME="/path/to/your/huggingface/cache"



3. Model endpoint

The evaluation uses the NVIDIA API endpoint hosted on
construct.nvidia.com:

goal:
  api_endpoint:
    model_id: nvidia/nemotron-nano-3-30b-a3b
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

Evaluations might be run against common inference providers akin to
HuggingFace,
construct.nvidia.com,
or
OpenRouter,
or anywhere that the model has an available endpoint.

Should you’re hosting the model locally or using a
different endpoint:

nemo-evaluator-launcher run 
  --config local_nvidia_nemotron_3_nano_30b_a3b.yaml 
  -o goal.api_endpoint.url=http://localhost:8000/v1/chat/completions



4. Run the total evaluation suite

Preview the run without executing using --dry-run:

nemo-evaluator-launcher run 
  --config local_nvidia_nemotron_3_nano_30b_a3b.yaml 
  --dry-run

From the examples directory, run the evaluation using the YAML
configuration provided:

nemo-evaluator-launcher run 
  --config /path/to/examples/nemotron/local_nvidia_nemotron_3_nano_30b_a3b.yaml

Note that for quick testing, you’ll be able to limit the number
of samples by setting
limit_samples:

nemo-evaluator-launcher run 
  --config local_nvidia_nemotron_3_nano_30b_a3b.yaml 
  -o evaluation.nemo_evaluator_config.config.params.limit_samples=10



5. Running a person benchmark

You may run specific benchmarks using the -t
flag (from the examples/nemotron
directory):

# Run only MMLU-Pro
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_mmlu_pro

# Run only coding benchmarks
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_livecodebench

# Run multiple specific benchmarks
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_gpqa -t ns_aime2025



6. Monitor execution and inspect results

# Check status of a selected job
nemo-evaluator-launcher status
# Stream logs for a selected job
nemo-evaluator-launcher logs 

Results are written to the defined output directory:

results_nvidia_nemotron_3_nano_30b_a3b/
├── artifacts/
│   └── /
│       └── results.json
└── logs/
    └── stdout.log



Interpreting results

When reproducing evaluations, it’s possible you’ll observe small differences in final
scores across runs. This variance reflects the probabilistic nature of
LLMs quite than a problem with the evaluation pipeline. Modern
evaluation introduces several sources of non‑determinism: decoding
settings, repeated trials, judge‑based scoring, parallel execution, and
differences in serving infrastructure. All of which might result in slight
fluctuations.

The aim of open evaluation is just not to force bit-wise equivalent
outputs, but to deliver methodological consistency with clear
provenance of evaluation results. To make sure your evaluation aligns with
the reference standard, confirm the next:

  • Configuration: use the published NeMo Evaluator YAML without
    modification, or document any changes explicitly
  • Benchmark selection: run the intended tasks, task versions, and
    prompt templates
  • Inference goal: confirm you might be evaluating the intended model and
    endpoint, including chat template behavior and reasoning settings when
    relevant
  • Execution settings: keep runtime parameters consistent, including
    repeats, parallelism, timeouts, and retry behavior
  • Outputs: confirm artifacts and logs are complete and follow the
    expected structure for every task

When these elements are consistent, your results represent a legitimate
reproduction of the methodology, even when individual runs differ
barely. NeMo Evaluator simplifies this process, tying benchmark
definitions, prompts, runtime settings, and inference configuration into
a single auditable workflow to attenuate inconsistencies.



Conclusion: A more transparent standard for open models

The evaluation recipe released alongside Nemotron 3 Nano represents a
meaningful step toward a more transparent and reliable approach to
open-model evaluation. We’re moving away from evaluation as a
collection of bespoke, “black box” scripts, and towards an outlined system
where benchmark selection, prompts, and execution semantics are encoded
right into a transparent workflow.

For developers and researchers, this transparency changes what it means
to share results. A rating is barely as trustworthy because the methodology
behind it and making that methodology public is what enables the
community to confirm claims, compare models fairly, and proceed constructing
on shared foundations. With open evaluation configurations, open
artifacts, and open tooling, Nemotron 3 Nano demonstrates what that
commitment to openness looks like in practice.

NeMo Evaluator supports this shift by providing a consistent
benchmarking methodology across models, releases, and inference
environments. The target isn’t equivalent numbers on every run; it’s
confidence in an evaluation methodology that’s explicit, inspectable,
and repeatable. And for organizations that need automated or large‑scale
evaluation pipelines, a separate microservice offering provides an
enterprise‑ready NeMo Evaluator
microservice
built on
the identical evaluation principles.

Use the published NeMo Evaluator
evaluation configuration
for an end-to-end walkthrough of the evaluation recipe.

Join the Community!

NeMo
Evaluator
is fully open
source, and community input is important to shaping the long run of open
evaluation. If there’s a benchmark you’d like us to support or an
improvement you should propose, open a problem, or contribute directly
on GitHub. Your contributions help strengthen the ecosystem and advance
a shared, transparent standard for evaluating generative models.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x