Why testing agents is so hard
AI agent is performing as expected just isn’t easy. Even small tweaks to components like your prompt versions, agent orchestration, and models can have large and unexpected impacts.
Among the top challenges include:
Non-deterministic outputs
The underlying issue at hand is that agents are non-deterministic. The identical input goes in, two different outputs can come out.
How do you test for an expected end result once you don’t know what the expected end result shall be? Simply put, testing for strictly defined outputs doesn’t work.
Unstructured outputs
The second, and fewer discussed, challenge of testing agentic systems is that outputs are sometimes unstructured. The inspiration of agentic systems are models in any case.
It is far easier to define a test for structured data. For instance, the id field should never be NULL or at all times be an integer. How do you define the standard of a big field of text?
Cost and scale
LLM-as-judge is essentially the most common methodology for evaluating the standard or reliability of AI agents. Nonetheless, it’s an expensive workload and every user interaction (trace) can consist of lots of of interactions (spans).
So we rethought our agent testing strategy. On this post we’ll share our learnings including a brand new key concept that has proven pivotal to making sure reliability at scale.
Testing our agent
We’ve two agents in production which can be leveraged by greater than 30,000 users. The Troubleshooting Agent combs through lots of of signals to find out the foundation reason for a knowledge reliability incident while the Monitoring Agent makes smart data quality monitoring recommendations.
For the Troubleshooting agent we test three fundamental dimensions: semantic distance, groundedness, and gear usage. Here is how we test for every.
Semantic distance
We leverage deterministic tests when appropriate as they’re clear, explainable, and cost-effective. For instance, it is comparatively easy to deploy a test to make sure one in every of the subagent’s outputs is in JSON format, that they don’t exceed a certain length, or to make sure that the guardrails are being called as intended.
Nonetheless, there are occasions when deterministic tests won’t get the job done. For instance, we explored embedding each expected and recent outputs as vectors and using cosine similarity tests. We thought this could be a less expensive and faster solution to evaluate semantic distance (is the meaning similar) between observed and expected outputs.
Nonetheless, we found there have been too many cases by which the wording was similar, however the meaning was different.
As a substitute, we now provide our LLM judge the expected output from the present configuration and ask it to attain on a 0-1 scale of the brand new output.
Groundedness
For groundedness, we check to make sure that the important thing context is present when it must be, but in addition that the agent will decline to reply when the important thing context is missing or the query is out of scope.
This is vital as LLMs are desirous to please and can hallucinate after they aren’t grounded with good context.
Tool usage
For tool usage we’ve an LLM-as-judge evaluate whether the agent performed as expected for the pre-defined scenario meaning:
- No tool was expected and no tool was called
- A tool was expected and a permitted tool was used
- No required tools were omitted
- No non-permitted tools were used
The true magic just isn’t deploying these tests, but how these tests are applied. Here is our current setup informed by some painful trial and error.
Agent testing best practices
It’s vital to consider not only are your agents non-deterministic, but so are your LLM evaluations! These best practices are mainly designed to combat those inherent shortcomings.
Soft failures
Hard thresholds may be noisy with non-deterministic tests for obvious reasons. So we invented the concept of a “soft failure.”
The evaluation comes back with a rating between 0-1. Anything lower than a .5 is a tough failure, while anything above a .8 is a pass. Soft failures occur for scores between .5 to .8.
Changes may be merged for a soft failure. Nonetheless, if a certain threshold of sentimental failures is exceeded it constitutes a tough failure and the method is halted.
For our agent, it’s currently configured in order that if 33% of tests end in a soft failure or if there are any greater than 2 soft failures total, then it is taken into account a tough failure. This prevents the change from being merged.
Re-evaluate soft failures
Soft failures could be a canary in a coal mine, or in some cases they may be nonsense. About 10% of sentimental failures are the results of hallucinations. Within the case of a soft failure, the evaluations will routinely re-run. If the resulting tests pass we assume the unique result was incorrect.
Explanations
When a test fails, it’s essential to understand why it failed. We now ask every LLM judge to not only provide a rating, but to elucidate it. It’s imperfect, however it helps construct trust within the evaluation and infrequently speeds debugging.
Removing flaky tests
You could have to check your tests. Especially with LLM-as-judge evaluations, the best way the prompt is built can have a big impact on the outcomes. We run tests multiple times and if the delta across the outcomes is just too large we’ll revise the prompt or remove the flaky test.
Monitoring in production
Agent testing is recent and difficult, however it’s a walk within the park in comparison with monitoring agent behavior and outputs in production. Inputs are messier, there isn’t any expected output to baseline, and every part is at a much larger scale.
Not to say the stakes are much higher! System reliability problems quickly turn out to be business problems.
That is our current focus. We’re leveraging agent observability tools to tackle these challenges and can report recent learnings in a future post.
The Troubleshooting Agent has been probably the most impactful features we’ve ever shipped. Developing reliable agents has been a career-defining journey and we’re excited to share it with you.
