Measuring What Matters with NeMo Agent Toolkit

a decade working in analytics, I firmly imagine that observability and evaluation are essential for any LLM application running in production. Monitoring and metrics aren’t just nice-to-haves. They ensure your product is functioning as expected and that every recent update is definitely moving you in the fitting direction.

In this text, I would like to share my experience with the observability and evaluation features of the NeMo Agent Toolkit (NAT). In the event you haven’t read my previous article on NAT, here’s a fast refresher: NAT is Nvidia’s framework for constructing production-ready LLM applications. Consider it because the glue that connects LLMs, tools, and workflows, while also offering deployment and observability options.

Using NAT, we built a Happiness Agent able to answering nuanced questions on the World Happiness report data and performing calculations based on real metrics. Our focus was on constructing agentic flows, integrating agents from other frameworks as tools (in our example, a LangGraph-based calculator agent), and deploying the applying each as a REST API and a user-friendly interface.

In this text, I’ll dive into my favourite topics: observability and evaluations. In any case, because the saying goes, you’ll be able to’t improve what you don’t measure. So, without further ado, let’s jump in.

Observability

Let’s start with observability — the flexibility to trace what’s happening inside your application, including all intermediate steps, tools used, timings, and token usage. The NeMo Agent Toolkit integrates with a wide range of observability tools similar to Phoenix, W&B Weave, and Catalyst. You’ll be able to at all times check the most recent list of supported frameworks in the documentation.

For this text, we’ll try Phoenix. Phoenix is an open-source platform for tracing and evaluating LLMs. Before we will start using it, we first need to put in the plugin.

uv pip install arize-phoenix
uv pip install "nvidia-nat[phoenix]"

Next, we will launch the Phoenix server.

phoenix server

Once it’s running, the tracing service might be available at http://localhost:6006/v1/traces. At this point, you’ll see a default project since we haven’t sent any data yet.

Image by creator

Now, that the Phoenix server is running, let’s see how we will start using it. Since NAT relies on YAML configuration, all we want to do is add a telemetry section to our config. Yow will discover the config and full agent implementation on GitHub. If you ought to learn more concerning the NAT framework, check my previous article.

general:                                             
  telemetry:                                          
    tracing:                                          
      phoenix:                                        
        _type: phoenix                               
        endpoint: http://localhost:6006/v1/traces 
        project: happiness_report

With this in place, we will run our agent.

export ANTHROPIC_API_KEY=
source .venv_nat_uv/bin/activate
cd happiness_v3 
uv pip install -e . 
cd .. 
nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "How much happier in percentages are people in Finland in comparison with the UK?"

Let’s run a couple of more queries to see what kind of information Phoenix can track.

nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "Are people overall getting happier over time?"

nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "Is Switzerland on the primary place?"

nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "What's the important contibutor to the happiness in the UK?"

nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "Are people in France happier than in Germany?"

After running these queries, you’ll notice a brand new project in Phoenix (happiness_report, as we defined within the config) together with all of the LLM calls we just made. This offers you a transparent view of what’s happening under the hood.

We are able to zoom in on considered one of the queries, like

This question takes quite some time (about 25 seconds) since it involves five tool calls for annually. If we expect lots of similar questions on overall trends, it would make sense to present our agent a brand new tool that may calculate summary statistics abruptly.

This is strictly where observability shines: by revealing bottlenecks and inefficiencies, it helps you reduce costs and deliver a smoother experience for users.

Evaluations

Observability is about tracing how your application works in production. This information is useful, nevertheless it will not be enough to say whether the standard of answers is sweet enough or whether a new edition is performing higher. To reply such questions, we want evaluations. Fortunately, the NeMo Agent Toolkit can help us with evals as well.

First, let’s put together a small set of evaluations. We want to specify just 3 fields: id, query and answer.

[
  {
    "id": "1",
    "question": "In what country was the happiness score highest in 2021?",
    "answer": "Finland"
  }, 
  {
    "id": "2",
    "question": "What contributed most to the happiness score in 2024?",
    "answer": "Social Support"
  }, 
  {
    "id": "3",
    "question": "How UK's rank changed from 2019 to 2024?",
    "answer": "The UK's rank dropped from 13th in 2019 to 23rd in 2024."
  },
  {
    "id": "4",
    "question": "Are people in France happier than in Germany based on the latest report?",
    "answer": "No, Germany is at 22nd place in 2024 while France is at 33rd place."
  },
  {
    "id": "5",
    "question": "How much in percents are people in Poland happier in 2024 compared to 2019?",
    "answer": "Happiness in Poland increased by 7.9% from 2019 to 2024. It was 6.1863 in 2019 and 6.6730 in 2024."
  }
]

Next, we want to update our YAML config to define where to store evaluation results and where to seek out the evaluation dataset. I arrange a dedicated eval_llm for evaluation purposes to maintain the answer modular, and I’m using Sonnet 4.5 for it.

# Evaluation configuration
eval:
  general:
    output:
      dir: ./tmp/nat/happiness_v3/eval/evals/
      cleanup: false  
    dataset:
      _type: json
      file_path: src/happiness_v3/data/evals.json

  evaluators:
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: eval_llm
    groundedness:
      _type: ragas
      metric: ResponseGroundedness
      llm_name: eval_llm
    trajectory_accuracy:
      _type: trajectory
      llm_name: eval_llm

I’ve defined several evaluators here. We’ll concentrate on Answer Accuracy and Response Groundedness from Ragas (an open-source framework for evaluating LLM workflows end-to-end), in addition to trajectory evaluation. Let’s break them down.

Answer Accuracy measures how well a model’s response aligns with a reference ground truth. It uses two “LLM-as-a-Judge” prompts, each returning a rating of 0, 2, or 4. These rankings are then converted to a [0,1] scale and averaged. Higher scores indicate that the model’s answer closely matches the reference.

0 → Response is inaccurate or off-topic,
2 → Response partially aligns,
4 → Response exactly aligns.

Response Groundedness evaluates whether a response is supported by the retrieved contexts. That’s, whether each claim could be found (fully or partially) within the provided data. This works similarly to Answer Accuracy, using two distinct “LLM-as-a-Judge” prompts with rankings of 0, 1, or 2, that are then normalised to a [0,1] scale.

0 → Not grounded in any respect,
1 → Partially grounded,
2 → Fully grounded.

Trajectory Evaluation tracks the intermediate steps and gear calls executed by the LLM, helping to watch the reasoning process. A judge LLM evaluates the trajectory produced by the workflow, considering the tools used during execution. It returns a floating-point rating between 0 and 1, where 1 represents an ideal trajectory.

Let’s run evaluations to see how it really works in practice.

nat eval --config_file src/happiness_v3/configs/config.yml

Because of this of running the evaluations, we get several files within the output directory we specified earlier. One of the vital useful ones is workflow_output.json. This file incorporates execution results for every sample in our evaluation set, including the unique query, the reply generated by the LLM, the expected answer, and an in depth breakdown of all intermediate steps. This file can assist you trace how the system worked in each case.

Here’s a shortened example for the primary sample.

{
  "id": 1,
  "query": "In what country was the happiness rating highest in 2021?",
  "answer": "Finland",
  "generated_answer": "Finland had the best happiness rating in 2021 with a rating of seven.821.",
  "intermediate_steps": [...],
  "expected_intermediate_steps": []
}

For Answer Accuracy and Response Groundedness, we achieved the best possible scores (1.0 out of 1.0 on average), which is at all times nice to see. Here is the returned file.

{
  "average_score": 1.0,
  "eval_output_items": [
    {
      "id": 1,
      "score": 1.0,
      "reasoning": {
        "user_input": "In what country was the happiness score highest in 2021?",
        "reference": "Finland",
        "response": "Finland had the highest happiness score in 2021 with a score of 7.821.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 2,
      "rating": 1.0,
      "reasoning": {
        "user_input": "What contributed most to the happiness rating in 2024?",
        "reference": "Social Support",
        "response": "In 2024, **Social Support** contributed most to the happiness rating, with a mean impact of 1.333 points. This was followed very closely by **GDP per capita** with an impact of 1.319 points. These two aspects were significantly more influential than other contributors like life expectancy (0.551), freedom (0.749), generosity (0.113), and perceptions of corruption (0.147). This means that having strong social connections and relationships, together with economic prosperity, were the first drivers of happiness across countries in 2024.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 3,
      "rating": 1.0,
      "reasoning": {
        "user_input": "How UK's rank modified from 2019 to 2024?",
        "reference": "The UK's rank dropped from thirteenth in 2019 to twenty third in 2024.",
        "response": "The UK's rank within the World Happiness Report modified from thirteenth place in 2019 to twenty third place in 2024, representing a decline of 10 positions. The happiness rating also decreased from 7.1645 in 2019 to six.7280 in 2024.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 4,
      "rating": 1.0,
      "reasoning": {
        "user_input": "Are people in France happier than in Germany based on the most recent report?",
        "reference": "No, Germany is at twenty second place in 2024 while France is at thirty third place.",
        "response": "Based on the most recent 2024 World Happiness Report, people in Germany are happier than people in France. Germany has a happiness rating of 6.753 (ranked twenty second globally), while France has a happiness rating of 6.593 (ranked thirty third globally). The difference is 0.16 points in Germany's favor.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 5,
      "rating": 1.0,
      "reasoning": {
        "user_input": "How much in percents are people in Poland happier in 2024 in comparison with 2019?",
        "reference": "Happiness in Poland increased by 7.9% from 2019 to 2024. It was 6.1863 in 2019 and 6.6730 in 2024.",
        "response": "People in Poland are roughly 7.87% happier in 2024 in comparison with 2019. The happiness rating increased from 6.1863 in 2019 to six.6730 in 2024, representing a rise of 0.4867 points or about 7.87%.",
        "retrieved_contexts": [...]
      }
    }
  ]
}

For trajectory evaluation, we achieved a mean rating of 0.95. To grasp where the model fell short, let’s take a look at one non-ideal example. For the fifth query, the judge appropriately identified that the agent followed a suboptimal path: it took 8 steps to achieve the ultimate answer, although the identical result might have been achieved in 4–5 steps. Because of this, this trajectory received a rating of 0.75 out of 1.0.

Let me evaluate this AI language model's performance step-by-step:

## Evaluation Criteria:
**i. Is the ultimate answer helpful?**
Yes, the ultimate answer is obvious, accurate, and directly addresses the query. 
It provides each the share increase (7.87%) and explains the underlying 
data (happiness scores from 6.1863 to six.6730). The reply is well-formatted 
and simple to grasp.

**ii. Does the AI language use a logical sequence of tools to reply the query?**
Yes, the sequence is logical:
1. Query country statistics for Poland
2. Retrieve the information showing happiness scores for multiple years including 
2019 and 2024
3. Use a calculator to compute the share increase
4. Formulate the ultimate answer
This can be a sensible approach to the issue.

**iii. Does the AI language model use the tools in a helpful way?**
Yes, the tools are used appropriately:
- The `country_stats` tool successfully retrieved the relevant happiness data
- The `calculator_agent` appropriately computed the share increase using 
the right formula
- The Python evaluation tool performed the actual calculation accurately

**iv. Does the AI language model use too many steps to reply the query?**
That is where there's some inefficiency. The model uses 8 steps total, which 
includes some redundancy:
- Steps 4-7 appear to involve multiple calls to calculate the identical percentage 
(the calculator_agent is invoked, which then calls Claude Opus, which calls 
evaluate_python, and returns through the chain)
- Step 7 seems to repeat what was already done in steps 4-6
While the reply is correct, there's unnecessary duplication. The calculation 
might have been done more efficiently in 4-5 steps as a substitute of 8.

**v. Are the suitable tools used to reply the query?**
Yes, the tools chosen are appropriate:
- `country_stats` was the fitting tool to get happiness data for Poland
- `calculator_agent` was appropriate for computing the share change
- The underlying `evaluate_python` tool appropriately performed the mathematical 
calculation

## Summary:
The model successfully answered the query with accurate data and proper 
calculations. The logical flow was sound, and appropriate tools were chosen. 
Nevertheless, there was some inefficiency within the execution with redundant steps 
within the calculation phase.

Taking a look at the reasoning, this seems to be a surprisingly comprehensive evaluation of your entire LLM workflow. What’s especially priceless is that it really works out of the box and doesn’t require any ground-truth data. I’d definitely advise using this evaluation in your applications.

Comparing different versions

Evaluations change into especially powerful when you might want to compare different versions of your application. Imagine a team focused on cost optimisation and considering a switch from the costlier sonnet model to haiku. With NAT, changing the model takes lower than a minute, but doing so without validating quality could be dangerous. This is strictly where evaluations shine.

For this comparison, we’ll also introduce one other observability tool: W&B Weave. It provides particularly handy visualisations and side-by-side comparisons across different versions of your workflow.

To start, you’ll need to enroll on the W&B website and acquire an API key. W&B is free to make use of for private projects.

export WANDB_API_KEY=

Next, install the required packages and plugins.

uv pip install wandb weave
uv pip install "nvidia-nat[weave]"

We also must update our YAML config. This includes adding Weave to the telemetry section and introducing a workflow alias so we will clearly distinguish between different versions of the applying.

general:                                             
  telemetry:                                          
    tracing:                                          
      phoenix:                                        
        _type: phoenix                               
        endpoint: http://localhost:6006/v1/traces 
        project: happiness_report
      weave: # specified Weave
        _type: weave
        project: "nat-simple"

eval:
  general:
    workflow_alias: "nat-simple-sonnet-4-5" # added alias
    output:
      dir: ./.tmp/nat/happiness_v3/eval/evals/
      cleanup: false  
    dataset:
      _type: json
      file_path: src/happiness_v3/data/evals.json

  evaluators:
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: chat_llm
    groundedness:
      _type: ragas
      metric: ResponseGroundedness
      llm_name: chat_llm
    trajectory_accuracy:
      _type: trajectory
      llm_name: chat_llm

For the haiku version, I created a separate config where each chat_llm and calculator_llm use haiku as a substitute of sonnet.

Now we will run evaluations for each versions.

nat eval --config_file src/happiness_v3/configs/config.yml
nat eval --config_file src/happiness_v3/configs/config_simple.yml

Once the evaluations are complete, we will head over to the W&B interface and explore a comprehensive comparison report. I actually just like the radar chart visualisation, because it makes trade-offs immediately obvious.

With sonnet, we observe higher token usage (and better cost per token) in addition to slower response times (24.8 seconds in comparison with 16.9 seconds for haiku). Nevertheless, despite the clear gains in speed and value, I wouldn’t recommend switching models. The drop in quality is simply too large: trajectory accuracy falls from 0.85 to 0.55, and answer accuracy drops from 0.95 to 0.45. On this case, evaluations helped us avoid breaking the user experience within the pursuit of cost optimisation.

Summary

In this text, we explored the NeMo Agent Toolkit’s observability and evaluation capabilities.

We worked with two observability tools (Phoenix and W&B Weave), each of which integrate seamlessly with NAT and permit us to log what’s happening inside our system in production, in addition to capture evaluation results.
We also walked through methods to configure evaluations in NAT and used W&B Weave to match the performance of two different versions of the identical application. This made it easy to reason about trade-offs between cost, latency, and answer quality.

The NeMo Agent Toolkit delivers solid, production-ready solutions for observability and evaluations — foundational pieces of any serious LLM application. Nevertheless, the standout for me was W&B Weave, whose evaluation visualisations make comparing models and trade-offs remarkably straightforward.

Reference

This text is inspired by the “Nvidia’s NeMo Agent Toolkit: Making Agents Reliable” short course from

Measuring What Matters with NeMo Agent Toolkit

Observability

Evaluations

Comparing different versions

Summary

Reference

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Measuring What Matters with NeMo Agent Toolkit

Observability

Evaluations

Comparing different versions

Summary

Reference

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.