As Large Language Models (LLMs) grow in complexity and scale, tracking their performance, experiments, and deployments becomes increasingly difficult. That is where MLflow is available in – providing a comprehensive platform for managing your entire lifecycle of machine learning models, including LLMs.
On this in-depth guide, we’ll explore leverage MLflow for tracking, evaluating, and deploying LLMs. We’ll cover all the pieces from establishing your environment to advanced evaluation techniques, with loads of code examples and best practices along the way in which.
Setting Up Your Environment
Before we dive into tracking LLMs with MLflow, let’s arrange our development environment. We’ll need to put in MLflow and a number of other other key libraries:
pip install mlflow>=2.8.1 pip install openai pip install chromadb==0.4.15 pip install langchain==0.0.348 pip install tiktoken pip install 'mlflow[genai]' pip install databricks-sdk --upgrade
After installation, it’s a very good practice to restart your Python environment to make sure all libraries are properly loaded. In a Jupyter notebook, you should use:
import mlflow import chromadb print(f"MLflow version: {mlflow.__version__}") print(f"ChromaDB version: {chromadb.__version__}")
This may confirm the versions of key libraries we’ll be using.
Understanding MLflow’s LLM Tracking Capabilities
MLflow’s LLM tracking system builds upon its existing tracking capabilities, adding features specifically designed for the unique elements of LLMs. Let’s break down the important thing components:
Runs and Experiments
In MLflow, a “run” represents a single execution of your model code, while an “experiment” is a group of related runs. For LLMs, a run might represent a single query or a batch of prompts processed by the model.
Key Tracking Components
- Parameters: These are input configurations in your LLM, corresponding to temperature, top_k, or max_tokens. You possibly can log these using
mlflow.log_param()
ormlflow.log_params()
. - Metrics: Quantitative measures of your LLM’s performance, like accuracy, latency, or custom scores. Use
mlflow.log_metric()
ormlflow.log_metrics()
to trace these. - Predictions: For LLMs, it’s crucial to log each the input prompts and the model’s outputs. MLflow stores these as artifacts in CSV format using
mlflow.log_table()
. - Artifacts: Any additional files or data related to your LLM run, corresponding to model checkpoints, visualizations, or dataset samples. Use
mlflow.log_artifact()
to store these.
Let’s take a look at a basic example of logging an LLM run:
This instance demonstrates logging parameters, metrics, and the input/output as a table artifact.
import mlflow import openai def query_llm(prompt, max_tokens=100): response = openai.Completion.create( engine="text-davinci-002", prompt=prompt, max_tokens=max_tokens ) return response.selections[0].text.strip() with mlflow.start_run(): prompt = "Explain the concept of machine learning in easy terms." # Log parameters mlflow.log_param("model", "text-davinci-002") mlflow.log_param("max_tokens", 100) # Query the LLM and log the result result = query_llm(prompt) mlflow.log_metric("response_length", len(result)) # Log the prompt and response mlflow.log_table("prompt_responses", {"prompt": [prompt], "response": [result]}) print(f"Response: {result}")
Deploying LLMs with MLflow
MLflow provides powerful capabilities for deploying LLMs, making it easier to serve your models in production environments. Let’s explore deploy an LLM using MLflow’s deployment features.
Creating an Endpoint
First, we’ll create an endpoint for our LLM using MLflow’s deployment client:
import mlflow from mlflow.deployments import get_deploy_client # Initialize the deployment client client = get_deploy_client("databricks") # Define the endpoint configuration endpoint_name = "llm-endpoint" endpoint_config = { "served_entities": [{ "name": "gpt-model", "external_model": { "name": "gpt-3.5-turbo", "provider": "openai", "task": "llm/v1/completions", "openai_config": { "openai_api_type": "azure", "openai_api_key": "{{secrets/scope/openai_api_key}}", "openai_api_base": "{{secrets/scope/openai_api_base}}", "openai_deployment_name": "gpt-35-turbo", "openai_api_version": "2023-05-15", }, }, }], } # Create the endpoint client.create_endpoint(name=endpoint_name, config=endpoint_config)
This code sets up an endpoint for a GPT-3.5-turbo model using Azure OpenAI. Note the usage of Databricks secrets for secure API key management.
Testing the Endpoint
Once the endpoint is created, we are able to test it:
response = client.predict( endpoint=endpoint_name, inputs={"prompt": "Explain the concept of neural networks briefly.","max_tokens": 100,},) print(response)
This may send a prompt to our deployed model and return the generated response.
Evaluating LLMs with MLflow
Evaluation is crucial for understanding the performance and behavior of your LLMs. MLflow provides comprehensive tools for evaluating LLMs, including each built-in and custom metrics.
Preparing Your LLM for Evaluation
To guage your LLM with
mlflow.evaluate()
, your model must be in certainly one of these forms:
- An
mlflow.pyfunc.PyFuncModel
instance or a URI pointing to a logged MLflow model.- A Python function that takes string inputs and outputs a single string.
- An MLflow Deployments endpoint URI.
- Set
model=None
and include model outputs within the evaluation data.Let's take a look at an example using a logged MLflow model:
import mlflow import openai with mlflow.start_run(): system_prompt = "Answer the next query concisely." logged_model_info = mlflow.openai.log_model( model="gpt-3.5-turbo", task=openai.chat.completions, artifact_path="model", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "{question}"}, ], ) # Prepare evaluation data eval_data = pd.DataFrame({ "query": ["What is machine learning?", "Explain neural networks."], "ground_truth": [ "Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming.", "Neural networks are computing systems inspired by biological neural networks, consisting of interconnected nodes that process and transmit information." ] }) # Evaluate the model results = mlflow.evaluate( logged_model_info.model_uri, eval_data, targets="ground_truth", model_type="question-answering", ) print(f"Evaluation metrics: {results.metrics}")This instance logs an OpenAI model, prepares evaluation data, after which evaluates the model using MLflow's built-in metrics for question-answering tasks.
Custom Evaluation Metrics
MLflow lets you define custom metrics for LLM evaluation. Here's an example of making a custom metric for evaluating the professionalism of responses:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric professionalism = make_genai_metric( name="professionalism", definition="Measure of formal and appropriate communication style.", grading_prompt=( "Rating the professionalism of the reply on a scale of 0-4:n" "0: Extremely casual or inappropriaten" "1: Casual but respectfuln" "2: Moderately formaln" "3: Skilled and appropriaten" "4: Highly formal and expertly crafted" ), examples=[ EvaluationExample( input="What is MLflow?", output="MLflow is like your friendly neighborhood toolkit for managing ML projects. It's super cool!", score=1, justification="The response is casual and uses informal language." ), EvaluationExample( input="What is MLflow?", output="MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, and deployment.", score=4, justification="The response is formal, concise, and professionally worded." ) ], model="openai:/gpt-3.5-turbo-16k", parameters={"temperature": 0.0}, aggregations=["mean", "variance"], greater_is_better=True, ) # Use the custom metric in evaluation results = mlflow.evaluate( logged_model_info.model_uri, eval_data, targets="ground_truth", model_type="question-answering", extra_metrics=[professionalism] ) print(f"Professionalism rating: {results.metrics['professionalism_mean']}")This practice metric uses GPT-3.5-turbo to attain the professionalism of responses, demonstrating how you'll be able to leverage LLMs themselves for evaluation.
Advanced LLM Evaluation Techniques
As LLMs develop into more sophisticated, so do the techniques for evaluating them. Let's explore some advanced evaluation methods using MLflow.
Retrieval-Augmented Generation (RAG) Evaluation
RAG systems mix the ability of retrieval-based and generative models. Evaluating RAG systems requires assessing each the retrieval and generation components. Here's how you'll be able to arrange a RAG system and evaluate it using MLflow:
from langchain.document_loaders import WebBaseLoader from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.llms import OpenAI # Load and preprocess documents loader = WebBaseLoader(["https://mlflow.org/docs/latest/index.html"]) documents = loader.load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(documents) # Create vector store embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(texts, embeddings) # Create RAG chain llm = OpenAI(temperature=0) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(), return_source_documents=True ) # Evaluation function def evaluate_rag(query): result = qa_chain({"query": query}) return result["result"], [doc.page_content for doc in result["source_documents"]] # Prepare evaluation data eval_questions = [ "What is MLflow?", "How does MLflow handle experiment tracking?", "What are the main components of MLflow?" ] # Evaluate using MLflow with mlflow.start_run(): for query in eval_questions: answer, sources = evaluate_rag(query) mlflow.log_param(f"query", query) mlflow.log_metric("num_sources", len(sources)) mlflow.log_text(answer, f"answer_{query}.txt") for i, source in enumerate(sources): mlflow.log_text(source, f"source_{query}_{i}.txt") # Log custom metrics mlflow.log_metric("avg_sources_per_question", sum(len(evaluate_rag(q)[1]) for q in eval_questions) / len(eval_questions))
This instance sets up a RAG system using LangChain and Chroma, then evaluates it by logging questions, answers, retrieved sources, and custom metrics to MLflow.
The way in which you chunk your documents can significantly impact RAG performance. MLflow can assist you to evaluate different chunking strategies:
This script evaluates different combos of chunk sizes, overlaps, and splitting methods, logging the outcomes to MLflow for straightforward comparison.
MLflow provides various ways to visualise your LLM evaluation results. Listed below are some techniques:
You possibly can create custom visualizations of your evaluation results using libraries like Matplotlib or Plotly, then log them as artifacts:
This function creates a line plot comparing a selected metric across multiple runs and logs it as an artifact.