grow more complex, traditional logging and monitoring fall short. What teams really want is observability: the power to trace agent decisions, evaluate response quality mechanically, and detect drift over time—without writing and maintaining large amounts of custom evaluation and telemetry code.
Due to this fact, teams must adopt the suitable platform for observability while they deal with the core task of constructing and improving the agents’ orchestration. And integrate their application to the observability platform with minimal overhead to their functional code. In this text, I’ll display how you may arrange an open-source AI observability platform to perform the next using a minimal-code approach:
- LLM-as-a-Judge: Configure pre-built evaluators to attain responses for Correctness, Relevance, Hallucination and more. Display scores across runs with detailed logs and analytics.
- Testing at scale: Arrange datasets to store regression test cases for measuring accuracy against expected ground truth responses. Proactively detect LLM and agent drift.
- MELT data: Track metrics (latency, token usage, model drift), events (API calls, LLM calls, tool usage), logs (user interaction, tool execution, agent decision making) with detailed traces – all without detailed telemetry and instrumentation code.
We will likely be using Langfuse for observability. It’s open-source and framework-agnostic and might work with popular orchestration frameworks and LLM providers.
Multi-agent application
For this demonstration, I actually have attached the LangGraph code of a Customer Service application. The appliance accepts tickets from the user, classifies into Technical, Billing or Each using a Triage agent, then routes it to the Technical Support agent, Billing Support agent or to each of them. Then a finalizer agent synthesizes the response from each agents right into a coherent, more readable format. The flowchart is as follows:
# --------------------------------------------------
# 0. Load .env
# --------------------------------------------------
from dotenv import load_dotenv
load_dotenv(override=True)
# --------------------------------------------------
# 1. Imports
# --------------------------------------------------
import os
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
from langfuse.langchain import CallbackHandler
# --------------------------------------------------
# 2. Langfuse Client (WORKING CONFIG)
# --------------------------------------------------
langfuse = Langfuse(
host="https://cloud.langfuse.com",
public_key=os.environ["LANGFUSE_PUBLIC_KEY"] ,
secret_key=os.environ["LANGFUSE_SECRET_KEY"]
)
langfuse_callback = CallbackHandler()
os.environ["LANGGRAPH_TRACING"] = "false"
# --------------------------------------------------
# 3. Azure OpenAI Setup
# --------------------------------------------------
llm = AzureChatOpenAI(
azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
temperature=0.2,
callbacks=[langfuse_callback], # 🔑 enables token usage
)
# --------------------------------------------------
# 4. Shared State
# --------------------------------------------------
class AgentState(TypedDict, total=False):
ticket: str
category: str
technical_response: str
billing_response: str
final_response: str
# --------------------------------------------------
# 5. Agent Definitions
# --------------------------------------------------
def triage_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
name="triage_agent",
input={"ticket": state["ticket"]},
) as span:
span.update_trace(name="Customer Service Query - LangGraph Demo")
response = llm.invoke([
{
"role": "system",
"content": (
"Classify the query as one of: "
"Technical, Billing, Both. "
"Respond with only the label."
),
},
{"role": "user", "content": state["ticket"]},
])
raw = response.content.strip().lower()
if "each" in raw:
category = "Each"
elif "technical" in raw:
category = "Technical"
elif "billing" in raw:
category = "Billing"
else:
category = "Technical" # ✅ secure fallback
span.update(output={"raw": raw, "category": category})
return {"category": category}
def technical_support_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
name="technical_support_agent",
input={
"ticket": state["ticket"],
"category": state.get("category"),
},
) as span:
response = llm.invoke([
{
"role": "system",
"content": (
"You are a technical support specialist. "
"Provide a clear, step-by-step solution."
),
},
{"role": "user", "content": state["ticket"]},
])
answer = response.content
span.update(output={"technical_response": answer})
return {"technical_response": answer}
def billing_support_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
name="billing_support_agent",
input={
"ticket": state["ticket"],
"category": state.get("category"),
},
) as span:
response = llm.invoke([
{
"role": "system",
"content": (
"You are a billing support specialist. "
"Answer clearly about payments, invoices, or accounts."
),
},
{"role": "user", "content": state["ticket"]},
])
answer = response.content
span.update(output={"billing_response": answer})
return {"billing_response": answer}
def finalizer_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
name="finalizer_agent",
input={
"ticket": state["ticket"],
"technical": state.get("technical_response"),
"billing": state.get("billing_response"),
},
) as span:
parts = [
f"Technical:n{state['technical_response']}"
for k in ["technical_response"]
if state.get(k)
] + [
f"Billing:n{state['billing_response']}"
for k in ["billing_response"]
if state.get(k)
]
if not parts:
final = "Error: No agent responses available."
else:
response = llm.invoke([
{
"role": "system",
"content": (
"Combine the following agent responses into ONE clear, professional, "
"customer-facing answer. Do not mention agents or internal labels. "
f"Answer the user's query: '{state['ticket']}'."
),
},
{"role": "user", "content": "nn".join(parts)},
])
final = response.content
span.update(output={"final_response": final})
return {"final_response": final}
# --------------------------------------------------
# 6. LangGraph Construction
# --------------------------------------------------
builder = StateGraph(AgentState)
builder.add_node("triage", triage_agent)
builder.add_node("technical", technical_support_agent)
builder.add_node("billing", billing_support_agent)
builder.add_node("finalizer", finalizer_agent)
builder.set_entry_point("triage")
# Conditional routing
builder.add_conditional_edges(
"triage",
lambda state: state["category"],
{
"Technical": "technical",
"Billing": "billing",
"Each": "technical",
"__default__": "technical", # ✅ never dead-end
},
)
# Sequential resolution
builder.add_conditional_edges(
"technical",
lambda state: state["category"],
{
"Each": "billing", # Proceed to billing if Each
"__default__": "finalizer",
},
)
builder.add_edge("billing", "finalizer")
builder.add_edge("finalizer", END)
graph = builder.compile()
# --------------------------------------------------
# 9. Major
# --------------------------------------------------
if __name__ == "__main__":
print("===============================================")
print(" Conditional Multi-Agent Support System (Ready)")
print("===============================================")
print("Enter 'exit' or 'quit' to stop this system.n")
while True:
# Get user input for the ticket
ticket = input("Enter your support query (ticket): ")
# Check for exit command
if ticket.lower() in ["exit", "quit"]:
print("nExiting the support system. Goodbye!")
break
if not ticket.strip():
print("Please enter a non-empty query.")
proceed
try:
# --- Run the graph with the user's ticket ---
result = graph.invoke(
{"ticket": ticket},
config={"callbacks": [langfuse_callback]},
)
# --- Print Results ---
category = result.get('category', 'N/A')
print(f"n✅ Triage Classification: **{category}**")
# Check which agents were executed based on the presence of a response
executed_agents = []
if result.get("technical_response"):
executed_agents.append("Technical")
if result.get("billing_response"):
executed_agents.append("Billing")
print(f"🛠️ Agents Executed: {', '.join(executed_agents) if executed_agents else 'None (Triage Failed)'}")
print("n================ FINAL RESPONSE ================n")
print(result["final_response"])
print("n" + "="*60 + "n")
except Exception as e:
# This is significant for debugging: print the exception type and message
print(f"nAn error occurred during processing ({type(e).__name__}): {e}")
print("nPlease try one other query.")
print("n" + "="*60 + "n")
Observability Configuration
To establish Langfuse, go to https://cloud.langfuse.com/, and arrange an account with a Billing tier (hobby tier with generous limits available), then arrange a Project. Within the project settings, you may generate the general public and secret keys which must be provided at first of the code. You furthermore may must add the LLM connection, which will likely be used for the LLM-as-a-Judge evaluation.

LLM-as-a-Judge setup
That is the core of the performance evaluation setup for agents. Here you may configure various pre-built Evaluators from the Evaluator Library which can rating the responses on various criteria akin to Conciseness, Correctness, Hallucination, Answer Critic etc. These should suffice for many use cases, else Custom Evaluators might be arrange also. Here’s a view of the Evaluator library:

Select the evaluator, say Relevance, that you just wish to make use of. You may decide to run it for brand spanking new or existing traces or for Dataset runs. As well as, review the evaluation prompt to make sure it satisfies your evaluation objective. Most significantly, the query, generation and other variables must be appropriately mapped to the source (often, to the Input and Output from the appliance trace). For our case, these will likely be the ticket data entered by the user and the response generated by the finalizer agent respectively. As well as, for Dataset runs, you may compare the generated responses to the Ground Truth responses stored as expected outputs (explained in the following sections).
Here is the configuration for the ‘’ evaluation I arrange for brand spanking new Dataset runs, together with the Variable mapping. The evaluation prompt preview can also be depicted. Many of the evaluators rating inside a spread of 0 to 1:


For the shopper service demo, I actually have configured 3 evaluators – which run for all recent traces, and , which deploys for Dataset runs only.

Datasets setup
Create a dataset to make use of as a test case repository. Here, you may store test cases with the input query and the best expected response. To create the dataset, there are 3 selections: create one record at a time, upload a CSV of queries and expected responses, or, quite conveniently, add inputs and outputs whose responses are adjudged to be of excellent quality by human experts.
Here is the dataset I actually have created for the demo. These are a combination of technical, billing, or ‘Each’ queries, and I actually have created all of the records from application traces:

That’s it! The configuration is completed and we’re able to run observability.
Observability Results
The Langfuse Home page is a dashboard of several useful charts. It shows the count of execution traces, scores and averages at a look, traces by time, model usage and price etc.

MELT data
Probably the most useful observability data is out there within the ‘Tracing’ option, which displays summarized and detailed views of all executions. Here’s a view of the dashboard depicting the time, name, input, output and the crucial latency and token usage metrics. Note that for each agent execution of our application, there are 2 evaluation traces generated for the and evaluators we arrange.


Let’s take a look at the main points of considered one of the executions of the Customer Service application. On the left panel, the agent flow is depicted each as a tree in addition to a flowchart. It shows the LangGraph nodes (agents) and the LLM calls together with the token usage. If our agents had tool calls or human-in-the-loop steps, they might have been depicted here as well. Note that the evaluation scores for and are also depicted on top, that are 0.40 and 1 respectively for this run. Clicking on them shows the explanation for the rating and a link to take us to the evaluator trace.
On the suitable, for every agent, LLM and power call, we are able to see the Input and generated output. As an illustration, here we see that the query was categorized as ‘Each’, and subsequently within the left chart, it shows each the technical and billing support agents were called, which confirms our flow is working as expected.

On top of the suitable hand panel, there may be the ‘Add to datasets’ button. At any step of the tree, this button, when clicked, will open up a panel just like the one depicted below, where you may add the input and output of that step on to a test dataset created within the previous section. This can be a useful feature for human experts so as to add regularly occurring user queries and good responses to the dataset during normal agent operations, thereby constructing a Regression test repository with minimal effort. In future, when there may be a serious upgrade or release to the appliance, the Regression dataset might be run and the generated outputs might be scored against the Expected outputs (ground truth) recorded here using the ‘’ evaluator we created through the LLM-as-a-judge setup. This helps to detect LLM drift (or agent drift) early and take corrective steps.

Here is considered one of the evaluation traces () for this application trace. The evaluator provides the reasoning for the rating of 0.4 it adjudged this response to be.

Scores
The Scores option in Langfuse show a listing of all of the evaluation runs from the assorted lively evaluators together with their scores. More pertinent is the Analytics dashboard, where two scores might be chosen and metrics akin to mean and standard deviation together with trend lines might be viewed.


Regression testing
With Datasets, we’re able to run regression testing using the test case repository of queries and expected outputs. Now we have stored 4 queries in our Regression dataset, with a combination of technical, billing and ‘Each’ queries.
For this, we are able to run the attached code which gets the relevant dataset and runs the experiment. All of the test runs are logged together with the typical scores. We will view the results of a particular test with scores for every test case in a single dashboard. And as needed, the detailed trace might be accessed to see the reasoning for the rating.
from langfuse import get_client
from langfuse.openai import OpenAI
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
import os
# Initialize client
from dotenv import load_dotenv
load_dotenv(override=True)
langfuse = Langfuse(
host="https://cloud.langfuse.com",
public_key=os.environ["LANGFUSE_PUBLIC_KEY"] ,
secret_key=os.environ["LANGFUSE_SECRET_KEY"]
)
llm = AzureChatOpenAI(
azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME"),
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
temperature=0.2,
)
# Define your task function
def my_task(*, item, **kwargs):
query = item.input['ticket']
response = llm.invoke([{"role": "user", "content": question}])
raw = response.content.strip().lower()
return raw
# Get dataset from Langfuse
dataset = langfuse.get_dataset("Regression")
# Run experiment directly on the dataset
result = dataset.run_experiment(
name="Production Model Test",
description="Monthly evaluation of our production model",
task=my_task # see above for the duty definition
)
# Use format method to display results
print(result.format())


Key Takeaways
- AI observability doesn’t must be code-heavy.
Most evaluation, tracing, and regression testing capabilities for LLM agents might be enabled through configuration reasonably than custom code, significantly reducing development and maintenance effort. - Wealthy evaluation workflows might be defined declaratively.
Capabilities akin to LLM-as-a-Judge scoring (), variable mapping, and evaluation prompts are configured directly within the observability platform—without writing bespoke evaluation logic. - Datasets and regression testing are configuration-first features.
Test case repositories, dataset runs, and ground-truth comparisons might be arrange and reused through the UI or easy configuration, allowing teams to run regression tests across agent versions with minimal additional code. - Full MELT observability comes “out of the box.”
Metrics (latency, token usage, cost), events (LLM and power calls), logs, and traces are mechanically captured and correlated, avoiding the necessity for manual instrumentation across agent workflows. - Minimal instrumentation, maximum visibility.
With lightweight SDK integration, teams gain deep visibility into multi-agent execution paths, evaluation results, and performance trends—freeing developers to deal with agent logic reasonably than observability plumbing.
Conclusion
As LLM agents turn into more complex, observability is not any longer optional. Without it, multi-agent systems quickly turn into black boxes which can be difficult to guage, debug, and improve.
An AI observability platform shifts this burden away from developers and application code. Using a minimal-code, configuration-first approach, teams can enable LLM-as-a-Judge evaluation, regression testing, and full MELT observability without constructing and maintaining custom pipelines. This not only reduces engineering effort but additionally accelerates the trail from prototype to production.
By adopting an open-source, framework-agnostic platform like Langfuse, teams gain a single source of truth for agent performance—making AI systems easier to trust, evolve, and operate at scale.
Need to know more? The Customer Service agentic application presented here follows a manager-worker architecture pattern, which work in CrewAI. Examine how observability helped me to repair this well-known issue with the manager-worker hierarchical technique of CrewAI, by tracing agent responses at each step and refining them to get the orchestration to work because it should. Full evaluation here:
All images and data utilized in this text are synthetically generated. Figures and code created by me
