Agentic AI 102: Guardrails and Agent Evaluation

In the primary post of this series (Agentic AI 101: Starting Your Journey Constructing AI Agents), we talked concerning the fundamentals of making AI Agents and introduced concepts like reasoning, memory, and tools.

After all, that first post touched only the surface of this latest area of the info industry. There may be so far more that may be done, and we’re going to learn more along the way in which on this series.

So, it’s time to take one step further.

On this post, we are going to cover three topics:

Guardrails: these are secure blocks that prevent a Large Language Model (LLM) from responding about some topics.
Agent Evaluation: Have you ever ever considered how accurate the responses from LLM are? I bet you probably did. So we are going to see the most important ways to measure that.
Monitoring: We will even learn concerning the built-in monitoring app in Agno’s framework.

We will begin now.

Guardrails

Our first topic is the best, for my part. Guardrails are rules that can keep an AI agent from responding to a given topic or list of topics.

I imagine there’s a very good likelihood that you’ve ever asked something to ChatGPT or Gemini and received a response like “I can’t speak about this topic”, or “Please seek the advice of an expert specialist”, something like that. Normally, that happens with sensitive topics like health advice, psychological conditions, or financial advice.

Those blocks are safeguards to stop people from hurting themselves, harming their health, or their pockets. As we all know, LLMs are trained on massive amounts of text, ergo inheriting plenty of bad content with it, which could easily result in bad advice in those areas for people. And I didn’t even mention hallucinations!

Take into consideration what number of stories there are of people that lost money by following investment suggestions from online forums. Or how many individuals took the unsuitable medicine because they .

Well, I suppose you bought the purpose. We must prevent our agents from talking about certain topics or taking certain actions. For that, we are going to use guardrails.

The most effective framework I discovered to impose those blocks is Guardrails AI [1]. There, you will note a hub filled with predefined rules that a response must follow with a view to pass and be exhibited to the user.

To start quickly, first go to this link [2] and get an API key. Then, install the package. Next, type the guardrails setup command. It’s going to ask you a few questions that you may respond n (for No), and it’ll ask you to enter the API Key generated.

pip install guardrails-ai
guardrails configure

Once that’s accomplished, go to the Guardrails AI Hub [3] and select one that you just need. Every guardrail has instructions on how you can implement it. Principally, you put in it via the command line after which use it like a module in Python.

For this instance, we’re selecting one called [4], which, as its name says, lets the user talk only about what’s within the list. So, return to the terminal and install it using the code below.

guardrails hub install hub://tryolabs/restricttotopic

Next, let’s open our Python script and import some modules.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os

# Import Guard and Validator
from guardrails import Guard
from guardrails.hub import RestrictToTopic

Next, we create the guard. We’ll restrict our agent to speak only about or the . And we’re restricting it to speak about .

# Setup Guard
guard = Guard().use(
    RestrictToTopic(
        valid_topics=["sports", "weather"],
        invalid_topics=["stocks"],
        disable_classifier=True,
        disable_llm=False,
        on_fail="filter"
    )
)

Now we will run the agent and the guard.

# Create agent
agent = Agent(
    model= Gemini(id="gemini-1.5-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "An assistant agent",
    instructions= ["Be sucint. Reply in maximum two sentences"],
    markdown= True
    )

# Run the agent
response = agent.run("What is the ticker symbol for Apple?").content

# Run agent with validation
validation_step = guard.validate(response)

# Print validated response
if validation_step.validation_passed:
    print(response)
else:
    print("Validation Failed", validation_step.validation_summaries[0].failure_reason)

That is the response after we ask a couple of stock symbol.

Validation Failed Invalid topics found: ['stocks']

If I ask a couple of topic that will not be on the valid_topics list, I will even see a block.

"What is the primary soda drink?"
Validation Failed No valid topic was found.

Finally, let’s ask about sports.

"Who's Michael Jordan?"
Michael Jordan is a former skilled basketball player widely considered one in every of 
the best of all time.  He won six NBA championships with the Chicago Bulls.

And we saw a response this time, because it is a sound topic.

Let’s move on to the evaluation of agents now.

Agent Evaluation

Since I began studying LLMs and Agentic Ai, one in every of my most important questions has been about model evaluation. Unlike traditional Data Science Modeling, where you’ve structured metrics which might be adequate for every case, for AI Agents, that is more blurry.

Fortunately, the developer community is pretty quick find solutions for nearly all the things, and so that they created this nice package for LLMs evaluation: deepeval.

DeepEval [5] is a library created by Confident AI that gathers many methods to guage LLMs and AI Agents. On this section, let’s learn a few the most important methods, just so we will construct some intuition on the topic, and likewise since the library is sort of extensive.

The primary evaluation is essentially the most basic we will use, and it is known as G-Eval. As AI tools like ChatGPT grow to be more common in on a regular basis tasks, we now have to be sure they’re giving helpful and accurate responses. That’s where G-Eval from the DeepEval Python package is available in.

G-Eval is like a sensible reviewer that uses one other AI model to guage how well a chatbot or AI assistant is performing. For instance. My agent runs Gemini, and I’m using OpenAI to evaluate it. This method takes a more advanced approach than a human one by asking an AI to “grade” one other AI’s answers based on things like , , and .

It’s a pleasant technique to test and improve generative AI systems in a more scalable way. Let’s quickly code an example. We’ll import the modules, create a prompt, a straightforward chat agent, and ask it a couple of description of the weather for the month of May in NYC.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os
# Evaluation Modules
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

# Prompt
prompt = "Describe the weather in NYC for May"

# Create agent
agent = Agent(
    model= Gemini(id="gemini-1.5-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "An assistant agent",
    instructions= ["Be sucint"],
    markdown= True,
    monitoring= True
    )

# Run agent
response = agent.run(prompt)

# Print response
print(response.content)

It responds: ““.

Nice. Seems pretty good to me.

But how can we put a number on it and show a possible manager or client how our agent is doing?

Here is how:

Create a test case passing the prompt and the response to the LLMTestCase class.
Create a metric. We’ll use the tactic GEval and add a prompt for the model to check it for coherence, after which I give it the meaning of what coherence is to me.
Give the output as evaluation_params.
Run the measure method and get the rating and reason from it.

# Test Case
test_case = LLMTestCase(input=prompt, actual_output=response)

# Setup the Metric
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence. The agent can answer the prompt and the response is sensible.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

# Run the metric
coherence_metric.measure(test_case)
print(coherence_metric.rating)
print(coherence_metric.reason)

The output looks like this.

0.9
The response directly addresses the prompt about NYC weather in May, 
maintains logical consistency, flows naturally, and uses clear language. 
Nonetheless, it could possibly be barely more detailed.

0.9 seems pretty good, provided that the default threshold is 0.5.

If you would like to check the logs, use this next snippet.

# Check the logs
print(coherence_metric.verbose_logs)

Here’s the response.

Criteria:
Coherence. The agent can answer the prompt and the response is sensible.

Evaluation Steps:
[
    "Assess whether the response directly addresses the prompt; if it aligns,
 it scores higher on coherence.",
    "Evaluate the logical flow of the response; responses that present ideas
 in a clear, organized manner rank better in coherence.",
    "Consider the relevance of examples or evidence provided; responses that 
include pertinent information enhance their coherence.",
    "Check for clarity and consistency in terminology; responses that maintain
 clear language without contradictions achieve a higher coherence rating."
]

Very nice. Now allow us to study one other interesting use case, which is the evaluation of task completion for AI Agents. Elaborating a bit more, how our agent is doing when it’s requested to perform a task, and the way much of it the agent can deliver.

First, we’re creating a straightforward agent that may access Wikipedia and summarize the subject of the query.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.wikipedia import WikipediaTools
import os
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from deepeval import evaluate

# Prompt
prompt = "Search wikipedia for 'Time series evaluation' and summarize the three most important points"

# Create agent
agent = Agent(
    model= Gemini(id="gemini-2.0-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "You're a researcher specialized in searching the wikipedia.",
    tools= [WikipediaTools()],
    show_tool_calls= True,
    markdown= True,
    read_tool_call_history= True
    )

# Run agent
response = agent.run(prompt)

# Print response
print(response.content)

The result looks superb. Let’s evaluate it using the TaskCompletionMetric class.

# Create a Metric
metric = TaskCompletionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

# Test Case
test_case = LLMTestCase(
    input=prompt,
    actual_output=response.content,
    tools_called=[ToolCall(name="wikipedia")]
    )

# Evaluate
evaluate(test_cases=[test_case], metrics=[metric])

Output, including the agent’s response.

======================================================================

Metrics Summary

  - ✅ Task Completion (rating: 1.0, threshold: 0.7, strict: False, 
evaluation model: gpt-4o-mini, 
reason: The system successfully looked for 'Time series evaluation' 
on Wikipedia and provided a transparent summary of the three most important points, 
fully aligning with the user's goal., error: None)

For test case:

  - input: Search wikipedia for 'Time series evaluation' and summarize the three most important points
  - actual output: Listed here are the three most important points about Time series evaluation based on the
 Wikipedia search:

1.  **Definition:** A time series is a sequence of knowledge points indexed in time order,
 often taken at successive, equally spaced closing dates.
2.  **Applications:** Time series evaluation is utilized in various fields like statistics,
 signal processing, econometrics, weather forecasting, and more, wherever temporal 
measurements are involved.
3.  **Purpose:** Time series evaluation involves methods for extracting meaningful 
statistics and characteristics from time series data, and time series forecasting 
uses models to predict future values based on past observations.

  - expected output: None
  - context: None
  - retrieval context: None

======================================================================

Overall Metric Pass Rates

Task Completion: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval login' to avoid wasting and analyze evaluation results
 on Confident AI.

Our agent passed the test with honor: 100%!

You’ll be able to learn far more concerning the DeepEval library on this link [8].

Finally, in the following section, we are going to learn the capabilities of Agno’s library for monitoring agents.

Agent Monitoring

Like I told you in my previous post [9], I selected Agno to learn more about Agentic AI. Simply to be clear, this will not be a sponsored post. It’s just that I feel that is one of the best option for those starting their journey learning about this topic.

So, one in every of the cool things we will reap the benefits of using Agno’s framework is the app they make available for model monitoring.

Take this agent that may search the web and write Instagram posts, for instance.

# Imports
import os
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools
from agno.tools.googlesearch import GoogleSearchTools


# Topic
topic = "Healthy Eating"

# Create agent
agent = Agent(
    model= Gemini(id="gemini-1.5-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
                  description= f"""You're a social media marketer specialized in creating engaging content.
                  Search the web for 'trending topics about {topic}' and use them to create a post.""",
                  tools=[FileTools(save_files=True),
                         GoogleSearchTools()],
                  expected_output="""A brief post for instagram and a prompt for an image related to the content of the post.
                  Don't use emojis or special characters within the post. In the event you find an error within the character encoding, remove the character before saving the file.
                  Use the template:
                  - Post
                  - Prompt for the image
                  Save the post to a file named 'post.txt'.""",
                  show_tool_calls=True,
                  monitoring=True)

# Writing and saving a file
agent.print_response("""Write a brief post for instagram with suggestions and tricks that positions me as 
                     an authority in {topic}.""",
                     markdown=True)

To observe its performance, follow these steps:

Go to https://app.agno.com/settings and get an API Key.
Open a terminal and sort ag setup.
Whether it is the primary time, it’d ask for the API Key. Copy and Paste it within the terminal prompt.
You will note the Dashboard tab open in your browser.
If you would like to monitor your agent, add the argument monitoring=True.
Run your agent.
Go to the Dashboard on the net browser.
Click on Sessions. Because it is a single agent, you will note it under the tab Agents on the highest portion of the page.

Agno Dashboard after running the agent. Image by the writer.

The cools features we will see there are:

Info concerning the model
The response
Tools used
Tokens consumed

That is the resulting token consumption while saving the file. Image by the writer.

Pretty neat, huh?

This is helpful for us to know where the agent is spending roughly tokens, and where it’s taking more time to perform a task, for instance.

Well, let’s wrap up then.

Before You Go

We have now learned so much on this second round. On this post, we covered:

Guardrails for AI are essential safety measures and ethical guidelines implemented to stop unintended harmful outputs and ensure responsible AI behavior.
Model evaluation, exemplified by GEval for broad assessment and TaskCompletion with DeepEval for agents output quality, is crucial for understanding AI capabilities and limitations.
Model monitoring with Agno’s app, including tracking token usage and response time, which is important for managing costs, ensuring performance, and identifying potential issues in deployed AI systems.