Traceability & Reproducibility Our motivation: Things can go incorrect Our solution: Traceability by design Solution design for real-time inference model: Traceability on real-time inference model: Reproducibility: Roll-back

Within the context of MLOps, traceability is the flexibility to trace the history of knowledge, code for training and prediction, model artifacts, environment utilized in development and deployment. Reproducibility is the flexibility to breed the identical results by tracking the history of knowledge and code version.

Traceability allows us to debug the code easily when there are unexpected results seen in production. Reproducibility allows us to generate the identical model with the identical data and code version, by guaranteeing the consistency, thus rolling back to the previous model version is all the time easy. Each are key facets of MLOps that ensure transparency and reliability of ML models.

In our MLOps maturity assessment, we also include traceability amongst our checklists. As we describe there, what we wish to implement is the next:

For any given machine learning model run/deployment in any environment it is feasible to look up unambiguously:

corresponding code/commit on git
infrastructure used for training and serving
environment used for training and serving
ML model artifacts
what data was used to coach the model.

Machine learning models running on production can fail in other ways. They will provide incorrect predictions, produce biased results. Often, those unexpected behaviours are difficult to detect, especially if the operation appears to be running successfully. In case of an incident, traceability allows us to discover the foundation reason behind the issue and take quick motion. We will easily find which code version is chargeable for training and prediction, which data is used.

On the grocery web shop, the suggested products section is perhaps recommending cat food for patrons who purchased dog toys.
Demand forecast algorithms might calculate high demand for ice cream during a winter storm, leading to empty shelves for winter essentials like bread and milk.
Personalised offers recommend meat products to all customers, also for many who has been vegan for years and never purchased meat or dairy.

In our last article we mentioned the toolkit for MLOps with crucial components:

Our solution landscape consists of GitHub (version control), GitHub Actions (CI/CD), Databricks Workflows (Orchestrator), MLflow (Model registry), ACR (container registry), Databricks workflow (compute), AKS (serving).

Below, our solution is defined step-by-step. Traceability components at each step are described in the following section.

1. We use GitHub as our code base where we create and release recent versions.

2. CI pipelines are triggered robotically when there may be a PR to foremost branches. CD pipeline releases a new edition, creates/updates Databricks job for training and/or prediction. Azure Blob Storage is our data lake, and mounted to Databricks workspace.

3. After training, model artifacts are saved in mlflow, and a recent model version is registered.

4. Github motion CD pipeline retrieves the model artifact from mlflow, and creates a docker image with it. The docker image is pushed to Azure Container Registry.

5. The last step is to deploy the container from ACR to AKS to run our application.

At each release, within the GitHub motion pipeline, we export GIT_SHA as an environment variable, use jinja2 to update Databricks job json definition in order that it becomes available for python jobs running on Databricks. This means which version of code is utilized in that specific job.

We deploy training jobs to Databricks via json configuration using GitHub Actions.

# example of GitHub motion    
steps:
- name: Setup env vars
id: setup_env_vars
run: |
echo "GIT_SHA=${{ github.sha }}" >> $GITHUB_ENV
echo "DATABRICKS_TOKEN=${{ secrets.DATABRICKS_TOKEN }}" >> $GITHUB_ENV
echo "DATABRICKS_HOST=${{ secrets.DATABRICKS_HOST }}" >> $GITHUB_ENV

# example of job cluster definition in Databricks job json
"job_clusters": [
{
"job_cluster_key": "recommender_cluster",
"new_cluster": {
"spark_version": "12.2.x-cpu-ml-scala2.12",
"node_type_id": "Standard_D4s_v5",
"spark_conf": {
"spark.speculation": true
},
"azure_attributes": {
"availability": "SPOT_WITH_FALLBACK_AZURE"
},
"autoscale": {
"min_workers": 2,
"max_workers": 4
},
"spark_env_vars": {
"DATABRICKS_HOST": "{{ DATABRICKS_HOST }}",
"GIT_SHA": "{{ GIT_SHA }}",
}
}
}
]

Each Databricks job and job run has job_id and run_id as unique identifiers. run_id and job_id may be made available inside a python script that runs via Databricks job through the duty parameter variable {{run_id}}.

It is usually possible to retrieve run_id using dbutils. We don’t use it since we wish python script to be runnable locally too.

{{job_id}} : The unique identifier assigned to a job

{{run_id}} : The unique identifier assigned to a job run

{{parent_run_id}} : The unique identifier assigned to the run of a job with multiple tasks.

{{task_key}} : The unique name assigned to a task that’s a part of a job with multiple tasks.

Databricks task parameters

"spark_python_task": {
"python_file": "recommender/train_recommender.py",
"parameters": [
"--job_id",
"{{job_id}}",
"--run_id",
"{{parent_run_id}}"
]
}

Example python code to catch run_id and job_id from Databricks job run:

import argparse
import osdef get_arguments():
parser = argparse.ArgumentParser(description=’reads default arguments’)
parser.add_argument('--run_id', metavar='run_id', type=str, help='Databricks run id')
parser.add_argument('--job_id', metavar='job_id', type=str, help='Databricks job id')
args = parser.parse_args()
return args.run_id, args.job_id
run_id, job_id = get_arguments()
git_sha = os.environ['GIT_SHA']
project_name = 'amazon-recommender'

When registering models to mlflow model registry, we add the attributes GIT_SHA, DBR_RUN_ID, DBR_JOB_ID as tags.

import mlflowmlflow.set_experiment(experiment_name='/Shared/Amazon_recommender')
with mlflow.start_run(run_name="amazon-recommender") as run:
recom_model = AmazonProductRecommender(spark_df=spark_df).train()
wrapped_model = AmazonProductRecommenderWrapper(recom_model)
mlflow_run_id = run.info.run_id
tags = {
"GIT_SHA": git_sha,
"MLFLOW_RUN_ID": mlflow_run_id,
"DBR_JOB_ID": job_id,
"DBR_RUN_ID": run_id,
}
mlflow.set_tags(tags)
mlflow.pyfunc.log_model("model", python_model=wrapped_model)
model_uri = f"runs:/{mlflow_run_id}/model"
mlflow.register_model(model_uri, project_name, tags=tags)

The model registered with tags looks like this:

Model may be downloaded from the MLflow registry by run id and copied to the docker image at docker construct step.

Example code:

from mlflow.tracking.client import MlflowClient
from mlflow.store.artifact.models_artifact_repo import ModelsArtifactRepositoryclient = MlflowClient()
model_version = client.search_model_versions(
f"name='{project_name}' and tag.DBR_RUN_ID = ‘{run_id}’
)[0].version
ModelsArtifactRepository(
f"models:/{project_name}/{model_version}"
).download_artifacts(artifact_path="", dst_path=download_path)

The attributes saved to the model registry are also included within the docker image as environment variables; git_sha, run_id. For full traceability, each response body is saved into the logging system with the attributes available within the environment git_sha, run_id.

Example code: git_sha and run_id are passed as environment variables via deployment manifest

apiVersion: apps/v1 kind: Deployment
metadata:
name: "amazon-recommender"
spec:
containers:
- name: amazon-recommender-api
image: /amazon-recommender:-
env:
- name: git_sha
value: 
- name: run_id
value:

Example code: Environment variables are then accessible in a FastAPI app

app = FastAPI()@app.get("/predict/{query}")
def read_item (basket: List[str] = Query(None)):
return {
"recommended_items": model.predict(basket),
"run_id": os.environ["run_id"],
"git_sha": os.environ["git_sha"]
}

To re-deploy the previous version or a selected version for API deployment, we run a CD-rollback pipeline manually with a selected image name.

Each docker image saved to ACR (Azure Container Registry) is known as with a singular identifier, which is Databricks run_id of corresponding training job run. If we would really like to re-deploy a selected model, we manually trigger the deployment pipeline for roll-back (GitHub motion) by providing the particular run_id. If there isn’t a run_id provided, then it re-deploys the previous image.

Traceability & Reproducibility Our motivation: Things can go incorrect Our solution: Traceability by design Solution design for real-time inference model: Traceability on real-time inference model: Reproducibility: Roll-back

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

UK declares hiring of AI staff, but criticism continues

Radical Simplicity in Data Engineering

Traceability & Reproducibility Our motivation: Things can go incorrect Our solution: Traceability by design Solution design for real-time inference model: Traceability on real-time inference model: Reproducibility: Roll-back

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.