Tracking Large Language Models (LLM) with MLflow : A Complete Guide

As Large Language Models (LLMs) grow in complexity and scale, tracking their performance, experiments, and deployments becomes increasingly difficult. That is where MLflow is available in – providing a comprehensive platform for managing your entire lifecycle of machine learning models, including LLMs.

On this in-depth guide, we’ll explore leverage MLflow for tracking, evaluating, and deploying LLMs. We’ll cover all the pieces from establishing your environment to advanced evaluation techniques, with loads of code examples and best practices along the way in which.

Functionality of MLflow in Large Language Models (LLMs)

MLflow has develop into a pivotal tool within the machine learning and data science community, especially for managing the lifecycle of machine learning models. With regards to Large Language Models (LLMs), MLflow offers a strong suite of tools that significantly streamline the technique of developing, tracking, evaluating, and deploying these models. Here’s an outline of how MLflow functions inside the LLM space and the advantages it provides to engineers and data scientists.

Tracking and Managing LLM Interactions

MLflow’s LLM tracking system is an enhancement of its existing tracking capabilities, tailored to the unique needs of LLMs. It allows for comprehensive tracking of model interactions, including the next key elements:

Parameters: Logging key-value pairs that detail the input parameters for the LLM, corresponding to model-specific parameters like top_k and temperature. This provides context and configuration for every run, ensuring that each one elements of the model’s configuration are captured.
Metrics: Quantitative measures that provide insights into the performance and accuracy of the LLM. These could be updated dynamically because the run progresses, offering real-time or post-process insights.
Predictions: Capturing the inputs sent to the LLM and the corresponding outputs, that are stored as artifacts in a structured format for straightforward retrieval and evaluation.
Artifacts: Beyond predictions, MLflow can store various output files corresponding to visualizations, serialized models, and structured data files, allowing for detailed documentation and evaluation of the model’s performance.

This structured approach ensures that each one interactions with the LLM are meticulously recorded, providing a comprehensive lineage and quality tracking for text-generating models.

Evaluation of LLMs

Evaluating LLMs presents unique challenges resulting from their generative nature and the dearth of a single ground truth. MLflow simplifies this with specialized evaluation tools designed for LLMs. Key features include:

Versatile Model Evaluation: Supports evaluating various kinds of LLMs, whether it’s an MLflow pyfunc model, a URI pointing to a registered MLflow model, or any Python callable representing your model.
Comprehensive Metrics: Offers a spread of metrics tailored for LLM evaluation, including each SaaS model-dependent metrics (e.g., answer relevance) and function-based metrics (e.g., ROUGE, Flesch Kincaid).
Predefined Metric Collections: Depending on the use case, corresponding to question-answering or text-summarization, MLflow provides predefined metrics to simplify the evaluation process.
Custom Metric Creation: Allows users to define and implement custom metrics to suit specific evaluation needs, enhancing the flexibleness and depth of model evaluation.
Evaluation with Static Datasets: Enables evaluation of static datasets without specifying a model, which is beneficial for quick assessments without rerunning model inference.

Deployment and Integration

MLflow also supports seamless deployment and integration of LLMs:

MLflow Deployments Server: Acts as a unified interface for interacting with multiple LLM providers. It simplifies integrations, manages credentials securely, and offers a consistent API experience. This server supports a spread of foundational models from popular SaaS vendors in addition to self-hosted models.
Unified Endpoint: Facilitates easy switching between providers without code changes, minimizing downtime and enhancing flexibility.
Integrated Results View: Provides comprehensive evaluation results, which could be accessed directly within the code or through the MLflow UI for detailed evaluation.

MLflow is a comprehensive suite of tools and integrations makes it a useful asset for engineers and data scientists working with advanced NLP models.

Setting Up Your Environment

Before we dive into tracking LLMs with MLflow, let’s arrange our development environment. We’ll need to put in MLflow and a number of other other key libraries:

pip install mlflow>=2.8.1
pip install openai
pip install chromadb==0.4.15
pip install langchain==0.0.348
pip install tiktoken
pip install 'mlflow[genai]'
pip install databricks-sdk --upgrade

After installation, it’s a very good practice to restart your Python environment to make sure all libraries are properly loaded. In a Jupyter notebook, you should use:

import mlflow
import chromadb
print(f"MLflow version: {mlflow.__version__}")
print(f"ChromaDB version: {chromadb.__version__}")

This may confirm the versions of key libraries we’ll be using.

Understanding MLflow’s LLM Tracking Capabilities

MLflow’s LLM tracking system builds upon its existing tracking capabilities, adding features specifically designed for the unique elements of LLMs. Let’s break down the important thing components:

Runs and Experiments

In MLflow, a “run” represents a single execution of your model code, while an “experiment” is a group of related runs. For LLMs, a run might represent a single query or a batch of prompts processed by the model.

Key Tracking Components

Parameters: These are input configurations in your LLM, corresponding to temperature, top_k, or max_tokens. You possibly can log these using mlflow.log_param() or mlflow.log_params().
Metrics: Quantitative measures of your LLM’s performance, like accuracy, latency, or custom scores. Use mlflow.log_metric() or mlflow.log_metrics() to trace these.
Predictions: For LLMs, it’s crucial to log each the input prompts and the model’s outputs. MLflow stores these as artifacts in CSV format using mlflow.log_table().
Artifacts: Any additional files or data related to your LLM run, corresponding to model checkpoints, visualizations, or dataset samples. Use mlflow.log_artifact() to store these.

Let’s take a look at a basic example of logging an LLM run:

This instance demonstrates logging parameters, metrics, and the input/output as a table artifact.

import mlflow
import openai
def query_llm(prompt, max_tokens=100):
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        max_tokens=max_tokens
    )
    return response.selections[0].text.strip()
with mlflow.start_run():
    prompt = "Explain the concept of machine learning in easy terms."
    
    # Log parameters
    mlflow.log_param("model", "text-davinci-002")
    mlflow.log_param("max_tokens", 100)
    
    # Query the LLM and log the result
    result = query_llm(prompt)
    mlflow.log_metric("response_length", len(result))
    
    # Log the prompt and response
    mlflow.log_table("prompt_responses", {"prompt": [prompt], "response": [result]})
    
    print(f"Response: {result}")

Deploying LLMs with MLflow

MLflow provides powerful capabilities for deploying LLMs, making it easier to serve your models in production environments. Let’s explore deploy an LLM using MLflow’s deployment features.

Creating an Endpoint

First, we’ll create an endpoint for our LLM using MLflow’s deployment client:

import mlflow
from mlflow.deployments import get_deploy_client
# Initialize the deployment client
client = get_deploy_client("databricks")
# Define the endpoint configuration
endpoint_name = "llm-endpoint"
endpoint_config = {
    "served_entities": [{
        "name": "gpt-model",
        "external_model": {
            "name": "gpt-3.5-turbo",
            "provider": "openai",
            "task": "llm/v1/completions",
            "openai_config": {
                "openai_api_type": "azure",
                "openai_api_key": "{{secrets/scope/openai_api_key}}",
                "openai_api_base": "{{secrets/scope/openai_api_base}}",
                "openai_deployment_name": "gpt-35-turbo",
                "openai_api_version": "2023-05-15",
            },
        },
    }],
}
# Create the endpoint
client.create_endpoint(name=endpoint_name, config=endpoint_config)

This code sets up an endpoint for a GPT-3.5-turbo model using Azure OpenAI. Note the usage of Databricks secrets for secure API key management.

Testing the Endpoint

Once the endpoint is created, we are able to test it:


response = client.predict(
endpoint=endpoint_name,
inputs={"prompt": "Explain the concept of neural networks briefly.","max_tokens": 100,},)
print(response)
This may send a prompt to our deployed model and return the generated response.
Evaluating LLMs with MLflow
Evaluation is crucial for understanding the performance and behavior of your LLMs. MLflow provides comprehensive tools for evaluating LLMs, including each built-in and custom metrics.
Preparing Your LLM for Evaluation
To guage your LLM with mlflow.evaluate(), your model must be in certainly one of these forms:
An mlflow.pyfunc.PyFuncModel instance or a URI pointing to a logged MLflow model.
A Python function that takes string inputs and outputs a single string.
An MLflow Deployments endpoint URI.
Set model=None and include model outputs within the evaluation data.
Let's take a look at an example using a logged MLflow model:
import mlflow
import openai
with mlflow.start_run():
    system_prompt = "Answer the next query concisely."
    logged_model_info = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
# Prepare evaluation data
eval_data = pd.DataFrame({
    "query": ["What is machine learning?", "Explain neural networks."],
    "ground_truth": [
        "Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming.",
        "Neural networks are computing systems inspired by biological neural networks, consisting of interconnected nodes that process and transmit information."
    ]
})
# Evaluate the model
results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)
print(f"Evaluation metrics: {results.metrics}")

This instance logs an OpenAI model, prepares evaluation data, after which evaluates the model using MLflow's built-in metrics for question-answering tasks.
Custom Evaluation Metrics
MLflow lets you define custom metrics for LLM evaluation. Here's an example of making a custom metric for evaluating the professionalism of responses:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric
professionalism = make_genai_metric(
    name="professionalism",
    definition="Measure of formal and appropriate communication style.",
    grading_prompt=(
        "Rating the professionalism of the reply on a scale of 0-4:n"
        "0: Extremely casual or inappropriaten"
        "1: Casual but respectfuln"
        "2: Moderately formaln"
        "3: Skilled and appropriaten"
        "4: Highly formal and expertly crafted"
    ),
    examples=[
        EvaluationExample(
            input="What is MLflow?",
            output="MLflow is like your friendly neighborhood toolkit for managing ML projects. It's super cool!",
            score=1,
            justification="The response is casual and uses informal language."
        ),
        EvaluationExample(
            input="What is MLflow?",
            output="MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, and deployment.",
            score=4,
            justification="The response is formal, concise, and professionally worded."
        )
    ],
    model="openai:/gpt-3.5-turbo-16k",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)
# Use the custom metric in evaluation
results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[professionalism]
)
print(f"Professionalism rating: {results.metrics['professionalism_mean']}")

This practice metric uses GPT-3.5-turbo to attain the professionalism of responses, demonstrating how you'll be able to leverage LLMs themselves for evaluation.
Advanced LLM Evaluation Techniques
As LLMs develop into more sophisticated, so do the techniques for evaluating them. Let's explore some advanced evaluation methods using MLflow.
Retrieval-Augmented Generation (RAG) Evaluation
RAG systems mix the ability of retrieval-based and generative models. Evaluating RAG systems requires assessing each the retrieval and generation components. Here's how you'll be able to arrange a RAG system and evaluate it using MLflow:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Load and preprocess documents
loader = WebBaseLoader(["https://mlflow.org/docs/latest/index.html"])
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(texts, embeddings)
# Create RAG chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)
# Evaluation function
def evaluate_rag(query):
    result = qa_chain({"query": query})
    return result["result"], [doc.page_content for doc in result["source_documents"]]
# Prepare evaluation data
eval_questions = [
    "What is MLflow?",
    "How does MLflow handle experiment tracking?",
    "What are the main components of MLflow?"
]
# Evaluate using MLflow
with mlflow.start_run():
    for query in eval_questions:
        answer, sources = evaluate_rag(query)
        
        mlflow.log_param(f"query", query)
        mlflow.log_metric("num_sources", len(sources))
        mlflow.log_text(answer, f"answer_{query}.txt")
        
        for i, source in enumerate(sources):
            mlflow.log_text(source, f"source_{query}_{i}.txt")
    # Log custom metrics
    mlflow.log_metric("avg_sources_per_question", sum(len(evaluate_rag(q)[1]) for q in eval_questions) / len(eval_questions))

This instance sets up a RAG system using LangChain and Chroma, then evaluates it by logging questions, answers, retrieved sources, and custom metrics to MLflow.

The way in which you chunk your documents can significantly impact RAG performance. MLflow can assist you to evaluate different chunking strategies:

This script evaluates different combos of chunk sizes, overlaps, and splitting methods, logging the outcomes to MLflow for straightforward comparison.

MLflow provides various ways to visualise your LLM evaluation results. Listed below are some techniques:

You possibly can create custom visualizations of your evaluation results using libraries like Matplotlib or Plotly, then log them as artifacts:

This function creates a line plot comparing a selected metric across multiple runs and logs it as an artifact.

Tracking Large Language Models (LLM) with MLflow : A Complete Guide

Functionality of MLflow in Large Language Models (LLMs)

Tracking and Managing LLM Interactions

Evaluation of LLMs

Deployment and Integration

Setting Up Your Environment

Understanding MLflow’s LLM Tracking Capabilities

Runs and Experiments

Key Tracking Components

Deploying LLMs with MLflow

Creating an Endpoint

Testing the Endpoint

Evaluating LLMs with MLflow

Preparing Your LLM for Evaluation

Custom Evaluation Metrics

Advanced LLM Evaluation Techniques

Retrieval-Augmented Generation (RAG) Evaluation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Nous Research's NousCoder-14B is an open-source coding model landing right within the Claude Code moment

Stone Center on Inequality and Shaping the Way forward for Work Launches at MIT

Redefining Secure AI Infrastructure with NVIDIA BlueField Astra for NVIDIA Vera Rubin NVL72

Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B

HNSW at Scale: Why Your RAG System Gets Worse because the Vector Database Grows

Tracking Large Language Models (LLM) with MLflow : A Complete Guide

Functionality of MLflow in Large Language Models (LLMs)

Tracking and Managing LLM Interactions

Evaluation of LLMs

Deployment and Integration

Setting Up Your Environment

Understanding MLflow’s LLM Tracking Capabilities

Runs and Experiments

Key Tracking Components

Deploying LLMs with MLflow

Creating an Endpoint

Testing the Endpoint

Evaluating LLMs with MLflow

Preparing Your LLM for Evaluation

Custom Evaluation Metrics

Advanced LLM Evaluation Techniques

Retrieval-Augmented Generation (RAG) Evaluation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.