Introduction & Context
a well-funded AI team demo their multi-agent financial assistant to the chief committee. The system was impressive — routing queries intelligently, pulling relevant documents, generating articulate responses. Heads nodded. Budgets were approved. Then someone asked: “How will we comprehend it’s ready for production?” The room went quiet.
This scene plays out often across the industry. We’ve grow to be remarkably good at constructing sophisticated agent systems, but we haven’t developed the identical rigor around proving they work. Once I ask teams how they validate their agents before deployment, I typically hear some combination of “we tested it manually,” “the demo went well,” and “we’ll monitor it in production.” None of those are improper, but none of them constitute a top quality gate that governance can log out on or that engineering can automate.
The Problem: Evaluating Non-deterministic Multi-Agent Systems
The challenge isn’t that teams don’t care about quality — they do. The challenge is that evaluating LLM-based systems is genuinely hard, and multi-agent architectures make it harder.
Traditional software testing assumes determinism. Given input X, we expect output Y, and we write an assertion to validate. But when we ask an LLM the identical query twice and we’ll get different phrasings, different structures, sometimes different emphasis. Each responses may be correct. Or one may be subtly improper in ways in which aren’t obvious without domain expertise. The assertion-based mental model breaks down.
Now multiply this complexity across a multi-agent system. A router agent decides which specialist handles the query. That specialist might retrieve documents from a knowledge base. The retrieved context shapes the generated response. A failure anywhere on this chain degrades the output, but diagnosing where things went improper requires evaluating each component.
I’ve observed that teams need answers to a few distinct questions before they’ll confidently deploy:
- Is the router doing its job? When a user asks an easy query, does it go to the fast, low-cost agent? Once they ask something complex, does it path to the agent with deeper capabilities? Getting this improper has real consequences — either you’re wasting time and cash on over-engineered responses, otherwise you’re giving users shallow answers to questions that deserve depth.
- Are the responses actually good? This sounds obvious, but “good” has multiple dimensions. Is the data accurate? If the agent is doing evaluation, is the reasoning sound? If it’s generating a report, is it complete? Different query types need different quality criteria.
- For agents using retrieval, is the RAG pipeline working? Did we pull the fitting documents? Did the agent actually use them, or did it hallucinate information that sounds plausible but isn’t grounded within the retrieved context?
Offline vs Online: A Transient Distinction
Before diving into the framework, I need to make clear what I mean by “offline evaluation” since the terminology may be confusing.
Offline evaluation happens before deployment, against a curated dataset where you already know the expected outcomes. You’re testing in a controlled environment with no user impact. That is your quality gate — the checkpoint that determines whether a model version is prepared for production.
Online evaluation happens after deployment, against live traffic. You’re monitoring real user interactions, sampling responses for quality checks, detecting drift. That is your safety net — the continued assurance that production behavior matches expectations.
Each matter, but they serve different purposes. This text focuses on offline evaluation because that’s where I see the largest gap in current practice. Teams often jump straight to “we’ll monitor it in production” without establishing what “good” looks like beforehand. That’s backwards. You would like offline evaluation to define your quality baseline before online evaluation can let you know whether you’re maintaining it.
Article Roadmap
Here, I present a framework I’ve developed and refined across multiple agent deployments. I’ll walk through a reference architecture that illustrates common evaluation challenges, then introduce what I call the Three Pillars of offline evaluation — routing, LLM-as-judge, and RAG evaluation. For every pillar, I’ll explain not only what to measure but why it matters and the way to interpret the outcomes. Finally, I’ll cover the way to operationalize with automation (CI/CD) and connect it to governance requirements.
The System under Evaluation
Reference Architecture
To make this concrete, I’ll take an example that’s becoming more common in the present environment. A financial services company is modernizing its tools and services supporting its advisors who serve end customers. Considered one of the applications is a financial research assistant with capabilities to lookup financial instruments, do various evaluation and conduct detailed research.
That is architected as a multi agent system with different agents using different models based on task need and complexity. The router agent sits on the front, classifying incoming queries by complexity and directing them appropriately. Done well, this optimizes each cost and user experience. Done poorly, it creates frustrating mismatches — users waiting for easy answers, or getting superficial responses to complex questions.
Evaluation Challenges
This architecture is elegant in theory but creates evaluation challenges in practice. Different agents need different evaluation criteria, and this isn’t all the time obvious upfront.
- The easy agent must be fast and factually accurate, but no one expects it to supply deep reasoning.
- The evaluation agent must exhibit sound logic, not only accurate facts.
- The research agent must be comprehensive — missing a serious risk consider an investment evaluation is a failure even when every little thing else is correct.
- Then there’s the RAG dimension. For the agents that retrieve documents, you could have an entire separate set of questions. Did we retrieve the fitting documents? Did the agent actually use them? Or did it ignore the retrieved context and generate something plausible-sounding but ungrounded?
Evaluating this technique requires evaluating multiple components with different criteria. Let’s see how we approach this.
Three Pillars of Offline Evaluation
Framework Overview
Over the past two years, working across various agent implementations, I’ve converged on a framework with three evaluation pillars. Each addresses a definite failure mode, and together they supply reasonable coverage of what can go improper.

The pillars aren’t independent. Routing affects which agent handles the query, which affects whether RAG is involved, which affects what evaluation criteria apply. But separating them analytically helps you diagnose where problems originate moderately than simply observing that something went improper.
One essential principle: not every evaluation runs on every query. Running comprehensive RAG evaluation on an easy price lookup is wasteful — there’s no RAG to guage. Running only factual accuracy checks on a posh research report misses whether the reasoning was sound or the coverage was complete.
Pillar 1: Routing Evaluation
Routing evaluation answers what looks as if an easy query: did the router pick the fitting agent? In practice, getting this right is trickier than it appears, and getting it improper has cascading consequences.
I take into consideration routing failures in two categories. Under-routing happens when a posh query goes to an easy agent. The user asks for a comparative evaluation and gets back a superficial response that doesn’t address the nuances of their query. They’re frustrated, and rightfully so — the system had the aptitude to assist them but didn’t deploy it.
Over-routing is the other: easy queries going to complex agents. The user asks for a stock price and waits fifteen seconds while the research agent spins up, retrieves documents it doesn’t need, and generates an elaborate response to an issue that deserved three words. The reply might be positive, but you’ve wasted compute, money, and the user’s time.
In a single engagement, we discovered that the router was over-routing about 40% of straightforward queries. The responses were good, so no one had complained, however the system was spending five times what it must have on those queries. Fixing the router’s classification logic cut costs significantly with none degradation in user-perceived quality.

For evaluation, I take advantage of two approaches depending on the situation. Deterministic evaluation: Create a test dataset where each query is labeled with the expected agent, measure what percentage the router gets right. That is fast, low-cost, and offers a transparent accuracy number.
LLM-based evaluation: adds nuance for ambiguous cases. Some queries genuinely could go either way — “Tell me about Microsoft’s business” may very well be an easy overview or a deep evaluation depending on what the user actually wants. When the router’s selection differs out of your label, an LLM judge can assess whether the selection was reasonable even when it wasn’t what you expected. That is dearer but helps you distinguish true errors from judgment calls.
The metrics I track include overall routing accuracy, which is the headline number, but in addition a confusion matrix showing which agents get confused with which. If the router consistently sends evaluation queries to the research agent, that’s a selected calibration issue you may address. I also track over-routing and under-routing rates individually because they’ve different business impacts and different fixes.
Pillar 2: LLM-as-Judge Evaluation
The challenge with evaluating LLM outputs is that they usually are not deterministic, so that they can’t be matched against an expected answer. Valid responses vary in phrasing, structure, and emphasis. You would like evaluation that understands semantic equivalence, assesses reasoning quality, and catches subtle factual errors. Human evaluation does this well but doesn’t scale. It is just not feasible to have someone manually review 1000’s of test cases on every deployment.
LLM-as-judge addresses this through the use of a capable language model to guage other models’ outputs. You provide the judge with the query, the response, your evaluation criteria, and any ground truth you could have, and it returns a structured assessment. The approach has been validated in research showing strong correlation with human judgments when the evaluation criteria are well-specified.
A couple of practical notes before diving into the size. Your judge model needs to be no less than as capable because the models you’re evaluating — I typically use Claude Sonnet or GPT-4 for judging. Using a weaker model as judge results in unreliable assessments. Also, judge prompts have to be specific and structured. Vague instructions like “rate the standard” produce inconsistent results. Detailed rubrics with clear scoring criteria produce usable evaluations.
I evaluate three dimensions, applied selectively based on query complexity.

Factual accuracy is foundational. The judge extracts factual claims from the response and verifies each against your ground truth. For a financial query, this might mean checking that the P/E ratio cited is correct, that the revenue figure is accurate, that the expansion rate matches reality. The output is an accuracy rating plus a breakdown of which facts were correct, incorrect, or missing.
This is applicable to all queries no matter complexity. Even easy lookups need factual verification — arguably especially easy lookups, since users trust straightforward factual responses and errors undermine that trust.
Reasoning quality matters for analytical responses. When the agent is comparing investment options or assessing risk, it is advisable to evaluate not only whether the facts are right but whether the logic is sound. Does the conclusion follow from the premises? Are claims supported by evidence? Are assumptions made explicit? Does the response acknowledge uncertainty appropriately?
I only run reasoning evaluation on medium and high complexity queries. Easy factual lookups don’t involve reasoning — there’s nothing to guage. But for anything analytical, reasoning quality is usually more essential than factual accuracy. A response can cite correct numbers but draw invalid conclusions from them, and that’s a serious failure.
Completeness applies to comprehensive outputs like research reports. When a user asks for an investment evaluation, they expect coverage of certain elements: financial performance, competitive position, risk aspects, growth catalysts. Missing a serious element is a failure even when every little thing included is accurate and well-reasoned.

I run completeness evaluation only on high complexity queries where comprehensive coverage is anticipated. For easier queries, completeness isn’t meaningful — you don’t expect a stock price lookup to cover risk aspects.
The judge prompt structure matters greater than people realize. I all the time include the unique query (so the judge understands context), the response being evaluated, the bottom truth or evaluation criteria, a selected rubric explaining the way to rating each dimension, and a required output format (I take advantage of JSON for parseability). Investing time in prompt engineering in your judges pays off in evaluation reliability.
Pillar 3: RAG Evaluation
RAG evaluation addresses a failure mode that’s invisible if you happen to only take a look at final outputs: the system generating plausible-sounding responses that aren’t actually grounded in retrieved knowledge.
The RAG pipeline has two stages, and either can fail. Retrieval failure means the system didn’t pull the fitting documents — either it retrieved irrelevant content or it missed documents that were relevant. Generation failure means the system retrieved good documents but didn’t use them properly, either ignoring them entirely or hallucinating information not present within the context.
Standard response evaluation conflates these failures. If the ultimate answer is improper, you don’t know whether retrieval failed or generation failed. RAG-specific evaluation separates the concerns so you may diagnose and fix the actual problem.
I take advantage of the RAGAS (Retrieval Augmented Generation Assessment) framework for this, which provides standardized metrics which have grow to be industry standard. The metrics fall into two groups.

Retrieval quality metrics assess whether the fitting documents were retrieved. Context precision measures what fraction of retrieved documents were actually relevant — if you happen to retrieved 4 documents and only two were useful, that’s 50% precision. You’re pulling noise. Context recall measures what fraction of relevant documents were retrieved — if three documents were relevant and also you only got two, that’s 67% recall. You’re missing information.
Generation quality metrics assess whether retrieved context was used properly. Faithfulness is the critical one: it measures whether claims within the response are supported by the retrieved context. If the response makes five claims and 4 are grounded within the retrieved documents, that’s 80% faithfulness. The fifth claim is either from the model’s parametric knowledge or hallucinated — either way, it’s not grounded in your retrieval, which is an issue if you happen to’re counting on RAG for accuracy.

I need to emphasise faithfulness since it’s the metric most directly tied to hallucination risk in RAG systems. A response can sound authoritative and be completely fabricated. Faithfulness evaluation catches this by checking whether each claim traces back to retrieved content.
In a single project, we found that faithfulness scores varied dramatically by query type. For straightforward factual queries, faithfulness was above 90%. For complex analytical queries, it dropped to around 60% — the model was doing more “reasoning” that went beyond the retrieved context. That’s not necessarily improper, however it meant users couldn’t trust that analytical conclusions were grounded within the source documents. We ended up adjusting the prompts to more explicitly constrain the model to retrieved information for certain query types.
Implementation & Integration
Pipeline Architecture
The evaluation pipeline has 4 stages: load the dataset, execute the agent on each sample, run the suitable evaluations, and aggregate right into a report.

We start with the sample dataset to be evaluated. Each sample needs the query itself, metadata indicating complexity level and expected agent, ground truth facts for accuracy evaluation, and for RAG queries, the relevant documents that needs to be retrieved. Constructing this dataset is tedius work, but the standard of your evaluation depends entirely on the standard of your ground truth. See example below ( ):
{
"id": "eval_001",
"query": "Compare Microsoft and Google's P/E ratios",
"category": "comparison",
"complexity": "medium",
"expected_agent": "analysis_agent",
"ground_truth_facts": [
"Microsoft P/E is approximately 35",
"Google P/E is approximately 25"
],
"ground_truth_answer": "Microsoft trades at higher P/E (~35) than Google (~25)...",
"relevant_documents": ["MSFT_10K_2024", "GOOGL_10K_2024"]
}
I like to recommend starting with no less than 50 samples per complexity level, so 150 minimum for a three-tier system. More is best — 400 total gives you higher statistical confidence within the metrics. Stratify across query categories so that you’re not unintentionally over-indexing on one type.
For observability, I take advantage of Langfuse, which provides trace storage, rating attachment, and dataset run tracking. Each evaluation sample creates a trace, and every evaluation metric attaches as a rating to that trace. Over time, you construct a history of evaluation runs you can compare across model versions, prompt changes, or architecture modifications. The power to drill into specific failures and see the complete trace could be very helpful for troubleshooting.
Automated (CI/CD) Quality Gates
Evaluation becomes very powerful when it’s automated and blocking. Scheduled execution of evaluation against a representative dataset subset is a very good start. The run produces metrics. If metrics fall below defined thresholds, the downstream governance mechanism kicks in whether quality reviews, failed gate checks etc.
The thresholds have to be calibrated to your use case and risk tolerance. For a financial application where accuracy is critical, I would set factual accuracy at 90% and faithfulness at 85%. For an internal productivity tool with lower stakes, 80% and 75% may be acceptable. The secret is aligning the thresholds with governance and quality teams and applying them in a typical repeatable way.
I also recommend scheduled running of the evaluation against the complete dataset, not only the subset used for PR checks. This catches drift in external dependencies — API changes, model updates, knowledge base modifications — that may not surface within the smaller PR dataset.
When evaluation fails, the pipeline should generate a failure report identifying which metrics missed threshold and which specific samples failed. This provides the mandatory signals to the teams to resolve the failures
Governance & Compliance
For enterprise deployments, evaluation encompasses engineering quality and organizational accountability. Governance teams need evidence that AI systems meet defined standards. Compliance teams need audit trails. Risk teams need visibility into failure modes.
Offline evaluation provides this evidence. Every run creates a record: which model version was evaluated, which dataset was used, what scores were achieved, whether thresholds were met. These records accumulate into an audit trail demonstrating systematic quality assurance over time.
I like to recommend defining acceptance criteria collaboratively with governance stakeholders before the primary evaluation run. What factual accuracy threshold is appropriate in your use case? What faithfulness level is required? Getting alignment upfront prevents confusion and conflict on interpreting results.

The factors should reflect actual risk. A system providing medical information needs higher accuracy thresholds than one summarizing meeting notes. A system making financial recommendations needs higher faithfulness thresholds than one drafting marketing copy. One size doesn’t fit all, and governance teams understand this once you frame it by way of risk.
Finally, take into consideration reporting for various audiences. Engineering wants detailed breakdowns by metric and query type. Governance wants summary pass/fail status with trend lines. Executives need a dashboard showing green/yellow/red status across systems. Langfuse and similar tools support these different views, but it is advisable to configure them intentionally.
Conclusion
The gap between impressive demos and production-ready systems is bridged through rigorous, systematic evaluation. The framework presented here provides the structure to construct governance tailored to your specific agents, use cases, and risk tolerance.
Key Takeaways
- Evaluation Requirements — Requirements vary depending on the applying use case. A straightforward lookup needs factual accuracy checks. A posh evaluation needs reasoning evaluation. A RAG-enabled response needs faithfulness verification. Applying the fitting evaluations to the fitting queries gives you signal without noise.
- Automation- Manual evaluation doesn’t scale and doesn’t catch regressions. Integrating evaluation into CI/CD pipelines, with explicit thresholds that block deployment, turns quality assurance from an ad hoc motion right into a repeatable practice.
- Governance — Evaluation records provide the audit trail that compliance needs and the evidence that leadership must approve production deployment. Constructing this connection early makes AI governance a partnership moderately than an obstacle.
Where to Start
In the event you’re not doing systematic offline evaluation today, don’t attempt to implement every little thing without delay.
- Start with routing accuracy and factual accuracy — these are the highest-signal metrics and the best to implement. Construct a small evaluation dataset, possibly 50–100 samples. Run it manually a number of times to calibrate your expectations.
- Add reasoning evaluation for complex queries and RAG metrics for retrieval-enabled agents.
- Integrate into CI/CD. Define thresholds together with your governance partners. Construct, Test, Iterate.
The goal is to begin laying the inspiration and constructing processes to supply evidence of quality across defined criteria. That’s the inspiration for production readiness, stakeholder confidence, and responsible AI deployment.
This text turned out to be lengthy one, thanks a lot for sticking till the tip. I hope you found this convenient and would try these concepts. All one of the best and pleased constructing 🙂
