LLM Monitoring and Observability: Hands-on with Langfuse

: You will have built a fancy LLM application that responds to user queries about a selected domain. You will have spent days organising the entire pipeline, from refining your prompts to adding context retrieval, chains, tools and eventually presenting the output. Nonetheless, after deployment, you realize that the applying’s response appears to be missing the mark e.g., either you aren’t satisfied with its responses or it’s taking an exorbitant period of time to reply. Whether the issue is rooted in your prompts, your retrieval, API calls, or some place else, monitoring and observability can enable you sort it out.

On this tutorial, we’ll start by learning the fundamentals of LLM monitoring and observability. Then, we’ll explore the open-source ecosystem, culminating our discussion on Langfuse. Finally, we’ll implement monitoring and observability of a Python based LLM application using Langfuse.

What’s Monitoring and Observability?

Monitoring and observability are crucial concepts in maintaining the health of IT system. While the terms ‘monitoring’ and ‘observability’ are sometimes clipped together, they represent barely different concepts.

In keeping with IBM’s definition, monitoring is the technique of collecting and analyzing system data to trace performance over time. It relies on predefined metrics to detect anomalies or potential failures. Common examples include tracking system’s CPU and memory usage and alerting when certain thresholds are breached.

Observability provides a deeper understanding of the system’s internal state based on external outputs. It means that you can diagnose and understand why something is occurring, not only that something is incorrect. For instance, observability means that you can trace inputs and outputs through various parts of the system to identify where a bottleneck is going on.

The above definitions are also valid within the realm of LLM applications. It’s through monitoring and observability that we will trace the inner states of an LLM application, resembling how user query is processed through various modules (e.g., retrieval, generation) and what are associated latencies and costs.

Listed here are some key terms utilized in the monitoring and observability:

Telemetry: Telemetry is a broad term which encompasses collecting data out of your application while it’s running and processing it to grasp the behavior of the applying.

Instrumentation: Instrumentation is the technique of adding code to your application to gather telemetry data. For LLM applications, this implies adding hooks at various key points to capture internal states, resembling API calls to the LLM or the retriever’s outputs.

Trace: Trace, a direct consequence of instrumentation, highlights the detailed execution journey of a request through your complete application. This encompasses input/output at each key point and the corresponding time taken at each point. Each trace is made up of a sequence of spans.

Remark: Each trace is made up of a number of observations, which might be of type Span, Event or Generation.

Span: Span is a unit of labor or operation, which explains the method being carried out on each key point.

Generation: Generation is a special type of span which tracks the input request sent to the LLM model and its output response.

Logs: Logs are time stamped records of events and interactions throughout the LLM application.

Metrics: Metrics are numerical measurements that provide aggregate insights into the LLM’s behavior and performance resembling hallucinations or answer relevancy.

A sample trace containing several spans and generations. Image source: Langfuse Tracing

Why is LLM Monitoring and Observability Obligatory?

As LLM applications have gotten increasingly complex, LLM monitoring and observability can play a vital role in optimizing the applying performance. Listed here are some the explanation why it’s important:

Reliability: LLM applications are critical to organizations; performance degradation can directly impact their businesses. Monitoring ensures that the applying is performing inside the appropriate limits by way of quality, latency and uptime etc.

Debugging: A posh LLM application might be unpredictable; it might probably produce erroneous responses or encounter errors. Monitoring and Observability may also help discover problems in the applying by sifting through the entire lifecycle of every request and pinpointing the basis cause.

User Experience: Monitoring user experience and feedback is significant for LLM applications which directly interact with the client base. This enables organizations to reinforce user experience by tracking the user conversations and making informed decisions. Most significantly, it allows collection of users’ feedback to enhance the model and downstream processes.

Bias and Fairness: LLMs are trained on publicly available data and subsequently sometimes internalize the possible bias within the available data. This might cause them to supply offensive or harmful information. Observability may also help in mitigating such responses through proper corrective measures.

Cost Management: Monitoring can enable you track and optimize costs incurred in the course of the regular operations, resembling LLM’s API costs per token. It’s also possible to arrange alerts in case of over usage.

Tools for Monitoring and Observability

There are numerous amazing tools and libraries available for enabling monitoring and observability of LLM applications. Loads of these tools are open source, offering free self-hosting solutions on local infrastructure in addition to enterprise level deployment on their respective cloud servers. Each of those tools offers common features resembling tracing, token count, latencies, total requests, and time-based filtering etc. Aside from this, each solution has its own set of distinct features and strengths.

Here, we’re going to name only just a few open-source tools which supply free self-hosting solutions.

Langfuse: A well-liked open source LLM monitoring tool, which is each model and framework agnostic. It offers a wide selection of monitoring options using Client SDKs purpose built for Python and JavaScript/TypeScript.

Arize Phoenix: One other popular tool which offers each self-hosting and Phoenix Cloud deployment. Phoenix offers SDKs for Python and JavaScript/TypeScript.

AgentOps: AgentOps is a widely known solution which tracks LLM outputs, retrievers, allows benchmarking, and ensures compliance. It offers integration with several LLM providers.

Grafana: A classic and widely used monitoring tool which might be combined with OpenTelemetry to offer detailed LLM tracing and monitoring.

Weave: Weights & Biases’ Weave is one other LLM tracking and experimentation tool for LLM based applications, which offers each self-managed and dedicated cloud environments. The Client SDKs can be found in Python and TypeScript.

Introducing Langfuse

Langfuse offers a wide range of features resembling LLM observability, tracing, LLM token and value monitoring, prompt management, datasets and LLM security. Moreover, Langfuse offers evaluation of LLM responses using various techniques resembling LLM-as-a-Judge and user’s feedback. Furthermore, Langfuse offers LLM playground to its premium users, which means that you can tweak your LLM prompts and parameters on the spot and watch how LLM responds to those changes. We’ll discuss more details in a while in our tutorial.

Langfuse’s solution to LLM monitoring and observability consists of two parts:

Langfuse SDKs
Langfuse Server

The Langfuse SDKs are the coding side of Langfuse, available for various platforms, which can help you enable instrumentation in your application’s code. They’re nothing greater than just a few lines of code which might be used appropriately in your application’s codebase.

The Langfuse server, however, is the UI based dashboard, together with other underlying services, which might be used to log, view and persist all of the traces and metrics. The Langfuse’s dashboard will likely be accessible through any modern web browser.

Before organising the dashboard, it’s necessary to notice that Langfuse offers three other ways of hosting dashboards, that are:

Self-hosting (local)
Managed hosting (using Langfuse’s cloud infrastructure)
On-premises deployment

The managed and on-premises deployment are beyond the scope of this tutorial. You possibly can visit Langfuse’s official documentation to get all of the relevant information.

A solution, because the name implies, lets you simply run an instance of Langfuse on your individual machine (e.g., PC, laptop, virtual machine or web service). Nonetheless, there may be a catch on this simplicity. The Langfuse server requires a persistent Postgres database server to constantly maintain its states and data. Which means together with a Langfuse server, we also need to establish a Postgres server. But don’t worry, we’ve got got things under control. You possibly can either use a Postgres server hosted on any cloud service (resembling Azure, AWS), or you may easily self-host it, similar to Langfuse service. Capiche?

How is Langfuse’s self-hosting achieved? Langfuse offers several ways to do this, resembling using docker/docker-compose or Kubernetes and/or deploying on cloud servers. In the meanwhile, let’s follow leveraging docker commands.

Setting Up a Langfuse Server

Now, it’s time to get hands-on experience with organising a Langfuse dashboard for an LLM application and logging traces and metrics onto it. Once we say Langfuse server, we mean the Langfuse’s dashboard and other services which permit the traces to be logged, viewed and endured. This requires a fundamental understanding of docker and its associated concepts. You possibly can undergo this tutorial, in case you aren’t already aware of docker.

Using docker-compose

Probably the most convenient and the fastest strategy to arrange Langfuse on your individual machine is to make use of a docker-compose file. That is only a two-step process, which involves cloning Langfuse in your local machine and easily invoking docker-compose.

Step 1: Clone the Langfuse’s repository:

$ git clone https://github.com/langfuse/langfuse.git
$ cd langfuse

Step 2: Start all services

$ docker compose up

And that’s it! Go to your web browser and open http://localhost:3000 to witness Langfuse UI working. Also cherish the incontrovertible fact that docker-compose takes care of the Postgres server mechanically.

From this point, we will safely move on to the section of organising Python SDK and enabling instrumentation in our code.

Using docker

The docker setup of the Langfuse server is sort of a docker-compose implementation, with an obvious difference: we’ll arrange each the containers (Langfuse and Postgres) individually and can connect them using an internal network. This may be helpful in scenarios where docker-compose will not be the acceptable first alternative, perhaps because you have already got your Postgres server running, or you must run each services individually for more control, resembling hosting each services individually on Azure Web App Services resulting from resource limitations.

Step 1: Create a custom network

First, we’d like to establish a custom bridge network, which is able to allow each the containers to speak with one another privately.

$ docker network create langfuse-network

This command creates a network by the name langfuse-network. Be at liberty to vary it in keeping with your preferences.

Step 2: Arrange a Postgres service

We’ll start by running the Postgres container, since Langfuse service is determined by this, using the next command:

$ docker run -d  
--name postgres-db  
--restart at all times 
-p 5432:5432 
  --network langfuse-network 
  -v database_data:/var/lib/postgresql/data 
  -e POSTGRES_USER=postgres 
  -e POSTGRES_PASSWORD=postgres 
  -e POSTGRES_DB=postgres 
  postgres:latest

Explanation:

This command will run a docker image of postgres:latest as a container with the name postgres-db, on a network named langfuse-network and expose this service to port 5432 in your local machine. For persistence, (i.e. to maintain data intact for future use) it’s going to create a volume and connect it to a folder named database_data in your local machine. Moreover, it’s going to arrange and assign values to a few crucial environment variables of a Postgres server’s superuser: POSTGRES_USER, POSTGRES_PASSWORD and POSTGRES_DB.

Step 3: Arrange the Langfuse service

$ docker run –d 
--name langfuse-server 
--network langfuse-network 
-p 3000:3000 
-e DATABASE_URL=postgresql://postgres:postgres@postgres-db:5432/postgres 
-e NEXTAUTH_SECRET=mysecret 
-e SALT=mysalt 
-e ENCRYPTION_KEY=0000000000000000000000000000000000000000000000000000000000000000 
-e NEXTAUTH_URL=http://localhost:3000  
langfuse/langfuse:2

Explanation:

Likewise, this command will run a docker image of langfuse/langfuse:2 within the detached mode (-d), as a container with the name langfuse-server, on the identical network called langfuse-network and expose this service to port 3000. It should also assign values to mandatory environment variables. The NEXTAUTH_URL must point to the URL where the langfuse-server could be deployed.

ENCRYPTION_KEY have to be 256 bits, 64 string characters in hex format. You possibly can generate this in Linux via:

$ openssl rand -hex 32

The DATABASE_URL is an environment variable which defines the entire database path and credentials. The final format for Postgres URL is:

postgresql://[POSTGRES_USER[:POSTGRES_PASSWORD]@][host[:port]/[POSTGRES_DB]

Here, the host is the host name (i.e. container name) of our PostgreSQL server or the IP address.

Finally, go to your web browser and open http://localhost:3000 to confirm that the Langfuse server is obtainable.

Configuring Langfuse Dashboard

Once you’ve got successfully arrange the Langfuse server, it’s time to configure the Langfuse dashboard before you may start tracing application data.

Go to the http://localhost:3000 in your web browser, as explained within the previous section. You could create a brand new organization, members and a project under which you’d be tracing and logging all of your metrics. Follow through the method on the dashboard that takes you thru all of the steps.

For instance, here we’ve got arrange a corporation by the name of datamonitor, added a member by the name data-user1 with “Owner” role, and a project named . It will lead us to the next screen:

Setup screen of Langfuse dashboard (Screenshot by creator)

This screen displays each private and non-private API keys, which will probably be used while organising tracing using SDKs; keep them saved for future use. And with this step, we’re finally done with configuring the langfuse server. The one other task left is to begin the instrumentation process on the code side of our application.

Enabling Langfuse Tracing using SDKs

Langfuse offers a simple strategy to enable tracing of LLM applications with minimal lines of code. As mentioned earlier, Langfuse offers tracing solutions for various languages, frameworks and LLM models, resembling Langchain, LlamaIndex, OpenAI and others. You possibly can even enable Langfuse tracing in serverless functions resembling AWS Lambda.

But before we trace our application, let’s actually create a sample application using OpenAI’s framework. We’ll create a quite simple chat completion application using OpenAI’s gpt-4o-mini for demonstration purposes only.

First, install the required packages:

$ pip install openai

import os
import openai

from dotenv import load_dotenv
load_dotenv()

api_key = os.getenv('OPENAI_KEY','')
client = openai.OpenAI(api_key=api_key)

country = 'Pakistan'
query = f"Name the capital of {country} in a single phrase only"

response = client.chat.completions.create(
                            model="gpt-4o-mini",
                            messages=[
                            {"role": "system", "content": "You are a helpful assistant"},
                            {"role": "user", "content": query}],
                            max_tokens=100,
                            )
print(response.selections[0].message.content)

Output:

Islamabad.

Let’s now enable langfuse tracing within the given code. You will have to make minor adjustments to the code, starting with installing the langfuse package.

Install all of the required packages once more:

$ pip install langfuse openai --upgrade

The code with langfuse enabled looks like this:

import os
#import openai
from langfuse.openai import openai

from dotenv import load_dotenv
load_dotenv()

api_key = os.getenv('OPENAI_KEY','')
client = openai.OpenAI(api_key=api_key)

LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_HOST="http://localhost:3000"

os.environ['LANGFUSE_SECRET_KEY'] = LANGFUSE_SECRET_KEY
os.environ['LANGFUSE_PUBLIC_KEY'] = LANGFUSE_PUBLIC_KEY
os.environ['LANGFUSE_HOST'] = LANGFUSE_HOST

country = 'Pakistan'
query = f"Name the capital of {country} in a single phrase only"


response = client.chat.completions.create(
                            model="gpt-4o-mini",
                            messages=[
                            {"role": "system", "content": "You are a helpful assistant"},
                            {"role": "user", "content": query}],
                            max_tokens=100,
                            )
print(response.selections[0].message.content)

You see, we’ve got just replaced import openai with from langfuse.openai import openai to enable tracing.

For those who now go to your Langfuse dashboard, you’ll observe traces of the OpenAI application.

A Complete End-to-End Example

Now let’s dive into enabling monitoring and observability on an entire LLM application. We’ll implement a RAG pipeline, which fetches relevant context from the vector database. We’re going to use ChromaDB as a vector database.

We’ll use the Langchain framework to construct our RAG based application (discuss with figure above). You possibly can learn Langchain by pursuing this tutorial on construct LLM applications with Langchain.

If you must learn the fundamentals of RAG, this tutorial might be an excellent place to begin. As for the vector database, discuss with this tutorial on organising ChromaDB.

Step 1: Installation and Setup

Install all required packages including langchain, chromadb and langfuse.

pip install -U langchain-community langchain-openai chromadb langfuse

Next, we import all of the required packages and libraries:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langfuse.callback import CallbackHandler
from dotenv import load_dotenv

The load_dotenv package is used to load all environment variables, that are saved in a .env file. Make certain that your OpenAI’s secret secret’s saved as OPENAI_API_KEY within the .env file.

Finally, we integrate Langfuse’s Langchain callback system to enable tracing in our application.

langfuse_handler = CallbackHandler(
secret_key="sk-lf-...",
public_key="pk-lf-...",
host="http://localhost:3000"
)

Step 2: Arrange Knowledge Base

To mimic a RAG system, we’ll:

Scrape some insightful articles from the Confiz’ blogs section using WebBaseLoader
Break them into smaller chunks using RecursiveCharacterTextSplitter
Convert them into vector embeddings using OpenAI’s embeddings
Ingest them into our Chroma vector database. It will serve because the knowledge base for our LLM to search for and answer user queries.

urls = [
    "https://www.confiz.com/blog/a-cios-guide-6-essential-insights-for-a-successful-generative-ai-launch/",
    "https://www.confiz.com/blog/ai-at-work-how-microsoft-365-copilot-chat-is-driving-transformation-at-scale/",
    "https://www.confiz.com/blog/setting-up-an-in-house-llm-platform-best-practices-for-optimal-performance/",
]

loader = WebBaseLoader(urls)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=20,
        length_function=len,
    )
chunks = text_splitter.split_documents(docs)

# Create the vector store
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    persist_directory="chroma_db",
    collection_name="confiz_blog" 
)
retriever = vectordb.as_retriever(search_type="similarity",search_kwargs={"k": 3})

We have now assumed a piece size of 500 tokens with an overlap of 20 tokens in Recursive Text Splitter, which considers various aspects before chunking on the given size. The vectordb object of ChromaDB is converted right into a retriever object, allowing us to make use of it conveniently within the Langchain retrieval pipeline.

Step 3: Arrange RAG pipeline

The following step is to establish the RAG chain, using the ability of LLM together with the knowledge base of the vector database to reply user queries. As previously, we’ll use OpenAI’s gpt-4o-mini as our base model.

model = ChatOpenAI(
        model_name="gpt-4o-mini",
    )

template = """
    You might be an AI assistant providing helpful information based on the given context.
    Answer the query using only the provided context."
    Context:
    {context}
    Query:
    {query}
    Answer:
    """
    
prompt = PromptTemplate(
        template=template,
        input_variables=["context", "question"]
    )

qa_chain = RetrievalQA.from_chain_type(
        llm=model,
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt},
    )

We have now used RetrievalQA that implements end-to-end pipeline comprising document retrieval and LLM’s query answering capability.

Step 4: Run RAG pipeline

It’s time to run our RAG pipeline. Let’s concoct just a few queries related to the articles ingested within the ChromaDB and observe LLM’s response within the Langfuse dashboard

queries = [
    "What are the ways to deal with compliance and security issues in generative AI?",
    "What are the key considerations for a successful generative AI launch?",
    "What are the key benefits of Microsoft 365 Copilot Chat?",
    "What are the best practices for setting up an in-house LLM platform?",
    ]
for query in queries:
    response = qa_chain.invoke({"query": query}, config={"callbacks": [langfuse_handler]})
    print(response)
    print('-'*60)

As you may have noticed, the argument within the qa_chain is what gives Langfuse the flexibility to capture traces of the entire RAG pipeline. Langfuse supports various frameworks and LLM libraries which might be found here.

Step 5: Observing the traces

Finally, it’s time to open Langfuse Dashboard running in the net browser and reap the fruits of our exertions. If you’ve got followed our tutorial from the start, we created a project named data-demo under the organization named datamonitor. On the landing page of your Langfuse dashboard, one can find this project. Click on ‘’ and one can find a dashboard with various panels resembling traces and model costs etc.

Langfuse Dashboard with traces and costs

As visible, you may adjust the time window and add filters in keeping with your needs. The cool part is that you simply don’t have to manually add LLM’s description and input/output token costs to enable cost tracking; Langfuse mechanically does it for you.But this will not be just it; within the left bar, select Tracing > Traces to have a look at all the person traces. Since we’ve got asked 4 queries, we’ll observe 4 different traces each representing the entire pipeline against each query.

Each trace is distinguished by an ID, timestamp and incorporates corresponding latency and total cost. The usage column shows the full input and output token usage against each trace.

For those who click on any of those traces, the Langfuse will depict the entire picture of the underlying processes, resembling inputs and outputs for every stage, covering every thing from retrieval, LLM call and the generation. Insightful, isn’t it?

Evaluation Metrics

As a bonus feature, let’s also add our custom metrics related to the LLM’s response on the identical dashboard. On a self-hosted solution, similar to we’ve got implemented, this might be made possible by fetching all traces from the dashboard, applying customized evaluation on those traces and publishing them back to the dashboard.

The evaluation might be applied by simply employing one other LLM with suitable prompts. Otherwise, we will use evaluation frameworks, resembling DeepEval or promptfoo etc., which also use LLMs under the hood. We will go together with DeepEval, which is an open-source framework developed to guage the response of LLMs.

Let’s do that process in the next steps:

Step 1: Installation and Setup

First, we install deepeval framework:

$ pip install deepeval

Next, we make mandatory imports:

from langfuse import Langfuse
from datetime import datetime, timedelta
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from dotenv import load_dotenv

load_dotenv()

Step 2: Fetching the traces from the dashboard

Step one is to fetch all of the traces, throughout the given time window, from the running Langfuse server into our Python code.

langfuse_handler = Langfuse(
secret_key="sk-lf-...",
public_key="pk-lf-...",
host="http://localhost:3000"
)

 
now = datetime.now()
five_am_today = datetime(now.yr, now.month, now.day, 5, 0)
five_am_yesterday = five_am_today - timedelta(days=1)


traces_batch = langfuse_handler.fetch_traces(
                                    limit=5,
                                    from_timestamp=five_am_yesterday,
                                    to_timestamp=datetime.now()
                                   ).data

print(f"Traces in first batch: {len(traces_batch)}")

Note that we’re using the identical secret and public keys as previously, since we’re fetching the traces from our data-demo project. Also note that we’re fetching traces from 5 am yesterday till the present time.

Step 3: Applying Evaluation

Once we’ve got the traces, we will apply various evaluation metrics resembling bias, toxicity, hallucinations and relevance. For simplicity, let’s stick only to the AnswerRelevancyMetric metric.

def calculate_relevance(trace):

    relevance_model = 'gpt-4o-mini'
    relevancy_metric = AnswerRelevancyMetric(
        threshold=0.7,model=relevance_model,
        include_reason=True
    )
    test_case = LLMTestCase(
        input=trace.input['query'],
        actual_output=trace.output['result']
    )
    relevancy_metric.measure(test_case)
    return {"rating": relevancy_metric.rating, "reason": relevancy_metric.reason}

# Do that for every trace
for trace in traces_batch:
        try:
            relevance_measure = calculate_relevance(trace)
            langfuse_handler.rating(
                trace_id=trace.id,
                name="relevance",
                value=relevance_measure['score'],
                comment=relevance_measure['reason']
            )
        except Exception as e:
            print(e)
            proceed

Within the above code snippet, we’ve got defined the calculate_relevance function to calculate relevance of the given trace using DeepEval’s standard metric. Then we loop over all of the traces and individually calculate each trace’s relevance rating. The langfuse_handler object takes care of logging that rating back to the dashboard against each trace ID.

Step 4: Observing the metrics

Now in case you concentrate on the identical dashboard as previous, the ‘Scores’ panel has been populated as well.

You’ll notice that relevance rating has been added to the person traces as well.

It’s also possible to view the feedback provided by the DeepEval, for every trace individually.

This instance showcases an easy way of logging evaluation metrics on the dashboard. After all, there may be more to it by way of metrics calculation and handling, but let’s keep it for the long run. Also importantly, you may wonder what probably the most appropriate way is to log evaluation metrics on the dashboard of a running application. For the self-hosting solution, a simple answer is to run the evaluation script as a Cron Job, at specific times. For the enterprise version, Langfuse offers live evaluation metrics of the LLM response, as they’re populated on the dashboard.

Advanced Features

Langfuse offers many advanced features, resembling:

Prompt Management

This enables management and versioning of prompts using the Langfuse Dashboard UI. This permits users to control evolving prompts in addition to record all metrics against each version of the prompt. Moreover, it also supports prompt playground to tweak prompts and model parameters and observe their effects on the general LLM response, directly within the Langfuse UI.

Datasets

Datasets feature allows users to create a benchmark dataset to measure the performance of the LLM application against different model parameters and tweaked prompts. As recent edge-cases are reported, they might be directly fed into the prevailing datasets.

User Management

This feature allows organizations to trace the prices and metrics related to each user. This also signifies that organizations can trace the activity of every user, encouraging fair use of the LLM application.

Conclusion

On this tutorial, we’ve got explored LLM Monitoring and Observability and its related concepts. We implemented Monitoring and Observability using Langfuse—an open-source framework, offering free and enterprise solutions. Choosing the self-hosting solution, we arrange Langfuse dashboard using docker file together with PostgreSQL server for persistence. We then enabled instrumentation in our sample LLM application using Langfuse Python SDKs. Finally, we observed all of the traces within the dashboard and likewise performed evaluation on these traces using the DeepEval framework.

In a future tutorial, we may explore advanced features of the Langfuse framework or explore other open-source frameworks resembling Arize Phoenix. We may work on the deployment of Langfuse dashboard on a cloud service resembling Azure, AWS or GCP.

LLM Monitoring and Observability: Hands-on with Langfuse

What’s Monitoring and Observability?

Why is LLM Monitoring and Observability Obligatory?

Tools for Monitoring and Observability

Introducing Langfuse

Setting Up a Langfuse Server

Using docker-compose

Step 1: Clone the Langfuse’s repository:

Step 2: Start all services

Using docker

Step 1: Create a custom network

Step 2: Arrange a Postgres service

Step 3: Arrange the Langfuse service

Configuring Langfuse Dashboard

Enabling Langfuse Tracing using SDKs

A Complete End-to-End Example

Step 1: Installation and Setup

Step 2: Arrange Knowledge Base

Step 3: Arrange RAG pipeline

Step 4: Run RAG pipeline

Step 5: Observing the traces

Evaluation Metrics

Step 1: Installation and Setup

Step 2: Fetching the traces from the dashboard

Step 3: Applying Evaluation

Step 4: Observing the metrics

Advanced Features

Prompt Management

Datasets

User Management

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Look Back and Forward

High quality-tuning Florence-2 – Microsoft’s Cutting-edge Vision Language Models

The Importance of Data Quality

The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel

a Powerful Embedding Model Tailored for Patents and IP with Expert Support from Hugging Face

LLM Monitoring and Observability: Hands-on with Langfuse

What’s Monitoring and Observability?

Why is LLM Monitoring and Observability Obligatory?

Tools for Monitoring and Observability

Introducing Langfuse

Setting Up a Langfuse Server

Using docker-compose

Step 1: Clone the Langfuse’s repository:

Step 2: Start all services

Using docker

Step 1: Create a custom network

Step 2: Arrange a Postgres service

Step 3: Arrange the Langfuse service

Configuring Langfuse Dashboard

Enabling Langfuse Tracing using SDKs

A Complete End-to-End Example

Step 1: Installation and Setup

Step 2: Arrange Knowledge Base

Step 3: Arrange RAG pipeline

Step 4: Run RAG pipeline

Step 5: Observing the traces

Evaluation Metrics

Step 1: Installation and Setup

Step 2: Fetching the traces from the dashboard

Step 3: Applying Evaluation

Step 4: Observing the metrics

Advanced Features

Prompt Management

Datasets

User Management

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.