Tracking Large Language Models (LLM) with MLflow : A Complete Guide

-

As Large Language Models (LLMs) grow in complexity and scale, tracking their performance, experiments, and deployments becomes increasingly difficult. That is where MLflow is available in – providing a comprehensive platform for managing your entire lifecycle of machine learning models, including LLMs.

On this in-depth guide, we’ll explore leverage MLflow for tracking, evaluating, and deploying LLMs. We’ll cover all the pieces from establishing your environment to advanced evaluation techniques, with loads of code examples and best practices along the way in which.

Functionality of MLflow in Large Language Models (LLMs)

MLflow has develop into a pivotal tool within the machine learning and data science community, especially for managing the lifecycle of machine learning models. With regards to Large Language Models (LLMs), MLflow offers a strong suite of tools that significantly streamline the technique of developing, tracking, evaluating, and deploying these models. Here’s an outline of how MLflow functions inside the LLM space and the advantages it provides to engineers and data scientists.

Tracking and Managing LLM Interactions

MLflow’s LLM tracking system is an enhancement of its existing tracking capabilities, tailored to the unique needs of LLMs. It allows for comprehensive tracking of model interactions, including the next key elements:

  • Parameters: Logging key-value pairs that detail the input parameters for the LLM, corresponding to model-specific parameters like top_k and temperature. This provides context and configuration for every run, ensuring that each one elements of the model’s configuration are captured.
  • Metrics: Quantitative measures that provide insights into the performance and accuracy of the LLM. These could be updated dynamically because the run progresses, offering real-time or post-process insights.
  • Predictions: Capturing the inputs sent to the LLM and the corresponding outputs, that are stored as artifacts in a structured format for straightforward retrieval and evaluation.
  • Artifacts: Beyond predictions, MLflow can store various output files corresponding to visualizations, serialized models, and structured data files, allowing for detailed documentation and evaluation of the model’s performance.

This structured approach ensures that each one interactions with the LLM are meticulously recorded, providing a comprehensive lineage and quality tracking for text-generating models​.

Evaluation of LLMs

Evaluating LLMs presents unique challenges resulting from their generative nature and the dearth of a single ground truth. MLflow simplifies this with specialized evaluation tools designed for LLMs. Key features include:

  • Versatile Model Evaluation: Supports evaluating various kinds of LLMs, whether it’s an MLflow pyfunc model, a URI pointing to a registered MLflow model, or any Python callable representing your model.
  • Comprehensive Metrics: Offers a spread of metrics tailored for LLM evaluation, including each SaaS model-dependent metrics (e.g., answer relevance) and function-based metrics (e.g., ROUGE, Flesch Kincaid).
  • Predefined Metric Collections: Depending on the use case, corresponding to question-answering or text-summarization, MLflow provides predefined metrics to simplify the evaluation process.
  • Custom Metric Creation: Allows users to define and implement custom metrics to suit specific evaluation needs, enhancing the flexibleness and depth of model evaluation.
  • Evaluation with Static Datasets: Enables evaluation of static datasets without specifying a model, which is beneficial for quick assessments without rerunning model inference.

Deployment and Integration

MLflow also supports seamless deployment and integration of LLMs:

  • MLflow Deployments Server: Acts as a unified interface for interacting with multiple LLM providers. It simplifies integrations, manages credentials securely, and offers a consistent API experience. This server supports a spread of foundational models from popular SaaS vendors in addition to self-hosted models.
  • Unified Endpoint: Facilitates easy switching between providers without code changes, minimizing downtime and enhancing flexibility.
  • Integrated Results View: Provides comprehensive evaluation results, which could be accessed directly within the code or through the MLflow UI for detailed evaluation.

MLflow is a comprehensive suite of tools and integrations makes it a useful asset for engineers and data scientists working with advanced NLP models.

Setting Up Your Environment

Before we dive into tracking LLMs with MLflow, let’s arrange our development environment. We’ll need to put in MLflow and a number of other other key libraries:

pip install mlflow>=2.8.1
pip install openai
pip install chromadb==0.4.15
pip install langchain==0.0.348
pip install tiktoken
pip install 'mlflow[genai]'
pip install databricks-sdk --upgrade

After installation, it’s a very good practice to restart your Python environment to make sure all libraries are properly loaded. In a Jupyter notebook, you should use:

import mlflow
import chromadb
print(f"MLflow version: {mlflow.__version__}")
print(f"ChromaDB version: {chromadb.__version__}")

This may confirm the versions of key libraries we’ll be using.

Understanding MLflow’s LLM Tracking Capabilities

MLflow’s LLM tracking system builds upon its existing tracking capabilities, adding features specifically designed for the unique elements of LLMs. Let’s break down the important thing components:

Runs and Experiments

In MLflow, a “run” represents a single execution of your model code, while an “experiment” is a group of related runs. For LLMs, a run might represent a single query or a batch of prompts processed by the model.

Key Tracking Components

  1. Parameters: These are input configurations in your LLM, corresponding to temperature, top_k, or max_tokens. You possibly can log these using mlflow.log_param() or mlflow.log_params().
  2. Metrics: Quantitative measures of your LLM’s performance, like accuracy, latency, or custom scores. Use mlflow.log_metric() or mlflow.log_metrics() to trace these.
  3. Predictions: For LLMs, it’s crucial to log each the input prompts and the model’s outputs. MLflow stores these as artifacts in CSV format using mlflow.log_table().
  4. Artifacts: Any additional files or data related to your LLM run, corresponding to model checkpoints, visualizations, or dataset samples. Use mlflow.log_artifact() to store these.

Let’s take a look at a basic example of logging an LLM run:

This instance demonstrates logging parameters, metrics, and the input/output as a table artifact.

import mlflow
import openai
def query_llm(prompt, max_tokens=100):
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        max_tokens=max_tokens
    )
    return response.selections[0].text.strip()
with mlflow.start_run():
    prompt = "Explain the concept of machine learning in easy terms."
    
    # Log parameters
    mlflow.log_param("model", "text-davinci-002")
    mlflow.log_param("max_tokens", 100)
    
    # Query the LLM and log the result
    result = query_llm(prompt)
    mlflow.log_metric("response_length", len(result))
    
    # Log the prompt and response
    mlflow.log_table("prompt_responses", {"prompt": [prompt], "response": [result]})
    
    print(f"Response: {result}")

Deploying LLMs with MLflow

MLflow provides powerful capabilities for deploying LLMs, making it easier to serve your models in production environments. Let’s explore deploy an LLM using MLflow’s deployment features.

Creating an Endpoint

First, we’ll create an endpoint for our LLM using MLflow’s deployment client:

import mlflow
from mlflow.deployments import get_deploy_client
# Initialize the deployment client
client = get_deploy_client("databricks")
# Define the endpoint configuration
endpoint_name = "llm-endpoint"
endpoint_config = {
    "served_entities": [{
        "name": "gpt-model",
        "external_model": {
            "name": "gpt-3.5-turbo",
            "provider": "openai",
            "task": "llm/v1/completions",
            "openai_config": {
                "openai_api_type": "azure",
                "openai_api_key": "{{secrets/scope/openai_api_key}}",
                "openai_api_base": "{{secrets/scope/openai_api_base}}",
                "openai_deployment_name": "gpt-35-turbo",
                "openai_api_version": "2023-05-15",
            },
        },
    }],
}
# Create the endpoint
client.create_endpoint(name=endpoint_name, config=endpoint_config)

This code sets up an endpoint for a GPT-3.5-turbo model using Azure OpenAI. Note the usage of Databricks secrets for secure API key management.

Testing the Endpoint

Once the endpoint is created, we are able to test it:

response = client.predict( endpoint=endpoint_name, inputs={"prompt": "Explain the concept of neural networks briefly.","max_tokens": 100,},) print(response)

This may send a prompt to our deployed model and return the generated response.

Evaluating LLMs with MLflow

Evaluation is crucial for understanding the performance and behavior of your LLMs. MLflow provides comprehensive tools for evaluating LLMs, including each built-in and custom metrics.

Preparing Your LLM for Evaluation

To guage your LLM with mlflow.evaluate(), your model must be in certainly one of these forms:

  1. An mlflow.pyfunc.PyFuncModel instance or a URI pointing to a logged MLflow model.
  2. A Python function that takes string inputs and outputs a single string.
  3. An MLflow Deployments endpoint URI.
  4. Set model=None and include model outputs within the evaluation data.

Let's take a look at an example using a logged MLflow model:

import mlflow
import openai
with mlflow.start_run():
    system_prompt = "Answer the next query concisely."
    logged_model_info = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
# Prepare evaluation data
eval_data = pd.DataFrame({
    "query": ["What is machine learning?", "Explain neural networks."],
    "ground_truth": [
        "Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming.",
        "Neural networks are computing systems inspired by biological neural networks, consisting of interconnected nodes that process and transmit information."
    ]
})
# Evaluate the model
results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)
print(f"Evaluation metrics: {results.metrics}")

This instance logs an OpenAI model, prepares evaluation data, after which evaluates the model using MLflow's built-in metrics for question-answering tasks.

Custom Evaluation Metrics

MLflow lets you define custom metrics for LLM evaluation. Here's an example of making a custom metric for evaluating the professionalism of responses:

from mlflow.metrics.genai import EvaluationExample, make_genai_metric
professionalism = make_genai_metric(
    name="professionalism",
    definition="Measure of formal and appropriate communication style.",
    grading_prompt=(
        "Rating the professionalism of the reply on a scale of 0-4:n"
        "0: Extremely casual or inappropriaten"
        "1: Casual but respectfuln"
        "2: Moderately formaln"
        "3: Skilled and appropriaten"
        "4: Highly formal and expertly crafted"
    ),
    examples=[
        EvaluationExample(
            input="What is MLflow?",
            output="MLflow is like your friendly neighborhood toolkit for managing ML projects. It's super cool!",
            score=1,
            justification="The response is casual and uses informal language."
        ),
        EvaluationExample(
            input="What is MLflow?",
            output="MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, and deployment.",
            score=4,
            justification="The response is formal, concise, and professionally worded."
        )
    ],
    model="openai:/gpt-3.5-turbo-16k",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)
# Use the custom metric in evaluation
results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[professionalism]
)
print(f"Professionalism rating: {results.metrics['professionalism_mean']}")

This practice metric uses GPT-3.5-turbo to attain the professionalism of responses, demonstrating how you'll be able to leverage LLMs themselves for evaluation.

Advanced LLM Evaluation Techniques

As LLMs develop into more sophisticated, so do the techniques for evaluating them. Let's explore some advanced evaluation methods using MLflow.

Retrieval-Augmented Generation (RAG) Evaluation

RAG systems mix the ability of retrieval-based and generative models. Evaluating RAG systems requires assessing each the retrieval and generation components. Here's how you'll be able to arrange a RAG system and evaluate it using MLflow:

from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Load and preprocess documents
loader = WebBaseLoader(["https://mlflow.org/docs/latest/index.html"])
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(texts, embeddings)
# Create RAG chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)
# Evaluation function
def evaluate_rag(query):
    result = qa_chain({"query": query})
    return result["result"], [doc.page_content for doc in result["source_documents"]]
# Prepare evaluation data
eval_questions = [
    "What is MLflow?",
    "How does MLflow handle experiment tracking?",
    "What are the main components of MLflow?"
]
# Evaluate using MLflow
with mlflow.start_run():
    for query in eval_questions:
        answer, sources = evaluate_rag(query)
        
        mlflow.log_param(f"query", query)
        mlflow.log_metric("num_sources", len(sources))
        mlflow.log_text(answer, f"answer_{query}.txt")
        
        for i, source in enumerate(sources):
            mlflow.log_text(source, f"source_{query}_{i}.txt")
    # Log custom metrics
    mlflow.log_metric("avg_sources_per_question", sum(len(evaluate_rag(q)[1]) for q in eval_questions) / len(eval_questions))

This instance sets up a RAG system using LangChain and Chroma, then evaluates it by logging questions, answers, retrieved sources, and custom metrics to MLflow.

The way in which you chunk your documents can significantly impact RAG performance. MLflow can assist you to evaluate different chunking strategies:

This script evaluates different combos of chunk sizes, overlaps, and splitting methods, logging the outcomes to MLflow for straightforward comparison.

MLflow provides various ways to visualise your LLM evaluation results. Listed below are some techniques:

You possibly can create custom visualizations of your evaluation results using libraries like Matplotlib or Plotly, then log them as artifacts:

This function creates a line plot comparing a selected metric across multiple runs and logs it as an artifact.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x