Learnings from a Machine Learning Engineer — Part 5: The Training

On this fifth a part of my series, I’ll outline the steps for making a Docker container for training your image classification model, evaluating performance, and preparing for deployment.

AI/ML engineers would like to deal with model training and data engineering, but the truth is that we also need to grasp the infrastructure and mechanics behind the scenes.

I hope to share some suggestions, not only to get your training run running, but easy methods to streamline the method in a price efficient manner on cloud resources similar to Kubernetes.

I’ll reference elements from my previous articles for getting the most effective model performance, so you should definitely take a look at Part 1 and Part 2 on the information sets, in addition to Part 3 and Part 4 on model evaluation.

Listed here are the learnings that I’ll share with you, once we lay the groundwork on the infrastructure:

Constructing your Docker container
Executing your training run
Deploying your model

Infrastructure overview

First, let me provide a temporary description of the setup that I created, specifically around Kubernetes. Your setup could also be entirely different, and that’s just effective. I simply wish to set the stage on the infrastructure in order that the remainder of the discussion is smart.

Image management system

It is a server you deploy that gives a user interface to on your material experts to label and evaluate images for the image classification application. The server can run as a pod in your Kubernetes cluster, but you might find that running a dedicated server with faster disk could also be higher.

Image files are stored in a directory structure like the next, which is self-documenting and simply modified.

Image_Library/
  - cats/
    - image1001.png
  - dogs/
    - image2001.png

Ideally, these files would reside on local server storage (as a substitute of cloud or cluster storage) for higher performance. The rationale for it will grow to be clear as we see what happens because the image library grows.

Cloud storage

Cloud Storage allows for a virtually limitless and convenient strategy to share files between systems. On this case, the image library in your management system could access the identical files as your Kubernetes cluster or Docker engine.

Nevertheless, the downside of cloud storage is the latency to open a file. Your image library can have hundreds and hundreds of images, and the latency to read each file can have a major impact in your training run time. Longer training runs means more cost for using the expensive GPU processors!

The best way that I discovered to hurry things up is to create a file of your image library in your management system and replica them to cloud storage. Even higher can be to create multiple tar files in parallel, each containing 10,000 to twenty,000 images.

This fashion you simply have network latency on a handful of files (which contain hundreds, once extracted) and also you start your training run much sooner.

Kubernetes or Docker engine

A Kubernetes cluster, with proper configuration, will let you dynamically scale up/down nodes, so you’ll be able to perform your model training on GPU hardware as needed. Kubernetes is a quite heavy setup, and there are other container engines that may work.

The technology options change always!

The major idea is that you must spin up the resources you would like — for less than so long as you would like them — then scale all the way down to reduce your time (and due to this fact cost) of running expensive GPU resources.

Once your GPU node is began and your Docker container is running, you’ll be able to extract the files above to local storage, similar to an , in your node. The node typically has high-speed SSD disk, ideal for the sort of workload. There may be one caveat — the storage capability in your node must have the opportunity to handle your image library.

Assuming we’re good, let’s speak about constructing your Docker container so that you would be able to train your model in your image library.

Constructing your Docker container

With the ability to execute a training run in a consistent manner lends itself perfectly to constructing a Docker container. You possibly can “pin” the version of libraries so exactly how your scripts will run each time. You possibly can version control your containers as well, and revert to a known good image in a pinch. What is de facto nice about Docker is you’ll be able to run the container just about anywhere.

The tradeoff when running in a container, especially with an Image Classification model, is the speed of file storage. You possibly can attach any variety of volumes to your container, but they are often attached, so there’s latency on each file read. This may increasingly not be an issue if you’ve gotten a small variety of files. But when coping with a whole bunch of hundreds of files like image data, that latency adds up!

For this reason using the file method outlined above will be useful.

Also, bear in mind that Docker containers may very well be terminated unexpectedly, so it is best to be sure to store necessary information outside the container, on cloud storage or a database. I’ll show you the way below.

Dockerfile

Knowing that you’ll need to run on GPU hardware (here I’ll assume Nvidia), you should definitely select the correct base image on your Dockerfile, similar to nvidia/cuda with the “devel” flavor that may contain the correct drivers.

Next, you’ll add the script files to your container, together with a “batch” script to coordinate the execution. Here is an example Dockerfile, after which I’ll describe what each of the scripts shall be doing.

#####   Dockerfile   #####
FROM nvidia/cuda:12.8.0-devel-ubuntu24.04

# Install system software
RUN apt-get -y update && apg-get -y upgrade
RUN apt-get install -y python3-pip python3-dev

# Setup python
WORKDIR /app
COPY requirements.txt
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install -r requirements.txt

# Pythong and batch scripts
COPY ExtractImageLibrary.py .
COPY Training.py .
COPY Evaluation.py .
COPY ScorePerformance.py .
COPY ExportModel.py .
COPY BulkIdentification.py .
COPY BatchControl.sh .

# Allow for interactive shell
CMD tail -f /dev/null

Dockerfiles are declarative, almost like a cookbook for constructing a small server — what you’ll get each time. Python libraries profit, too, from this declarative approach. Here’s a sample file that loads the TensorFlow libraries with CUDA support for GPU acceleration.

#####   requirements.txt   #####
numpy==1.26.3
pandas==2.1.4
scipy==1.11.4
keras==2.15.0
tensorflow[and-cuda]

Extract Image Library script

In Kubernetes, the Docker container can access local, high speed storage on the physical node. This will be achieved via the volume type. As mentioned before, it will only work if the local storage in your node can handle the scale of your library.

#####   sample 25GB emptyDir volume in Kubernetes   #####
containers:
  - name: training-container
    volumeMounts:
      - name: image-library
        mountPath: /mnt/image-library
volumes:
  - name: image-library
    emptyDir:
      sizeLimit: 25Gi

You’d wish to have one other to your cloud storage where you’ve gotten the files. What this looks like will rely upon your provider, or if you happen to are using a persistent volume claim, so I won’t go into detail here.

Now you’ll be able to extract the files — ideally in parallel for an added performance boost — to the local mount point.

Training script

As AI/ML engineers, the model training is where we would like to spend most of our time.

That is where the magic happens!

Along with your image library now extracted, we are able to create our train-validation-test sets, load a pre-trained model or construct a brand new one, fit the model, and save the outcomes.

One key technique that has served me well is to load essentially the most recently trained model as my base. I discuss this in additional detail in Part 4 under “Nice tuning”, this leads to faster training time and significantly improved model performance.

Be sure you make the most of the local storage to checkpoint your model during training because the models are quite large and you’re paying for the GPU even while it sits idle writing to disk.

This in fact raises a priority about what happens if the Docker container dies part-way though the training. The danger is (hopefully) low from a cloud provider, and you might not want an incomplete training anyway. But when that does occur, you’ll at the least want to grasp why, and that is where saving the major log file to cloud storage (described below) or to a package like MLflow is useful.

Evaluation script

After your training run has accomplished and you’ve gotten taken proper precaution on saving your work, it’s time to see how well it performed.

Normally this evaluation script will pick up on the model that just finished. But you might resolve to point it at a previous model version through an interactive session. For this reason have the script as stand-alone.

With it being a separate script, meaning it would must read the finished model from disk — ideally local disk for speed. I like having two separate scripts (training and evaluation), but you may find it higher to mix these to avoid reloading the model.

Now that the model is loaded, the evaluation script should generate predictions on every image within the training, validation, test, and benchmark sets. I save the outcomes as a huge matrix with the softmax confidence rating for every class label. So, if there are 1,000 classes and 100,000 images, that’s a table with 100 million scores!

I save these leads to files which can be then utilized in the rating generation next.

Rating generation script

Taking the matrix of scores produced by the evaluation script above, we are able to now create various metrics of model performance. Again, this process may very well be combined with the evaluation script above, but my preference is for independent scripts. For instance, I would wish to regenerate scores on previous training runs. See what works for you.

Listed here are a number of the functions that produce useful insights like F1, log loss, AUC-ROC, Matthews correlation coefficient.

from sklearn.metrics import average_precision_score, classification_report
from sklearn.metrics import log_loss, matthews_corrcoef, roc_auc_score

Apart from these basic statistical analyses for every dataset (train, validation, test, and benchmark), additionally it is useful to discover:

Which ground truth labels get essentially the most variety of errors?
Which predicted labels get essentially the most variety of incorrect guesses?
What number of ground-truth-to-predicted label pairs are there? In other words, which classes are easily confused?
What’s the accuracy when applying a minimum softmax confidence rating threshold?
What’s the error rate above that softmax threshold?
For the “difficult” benchmark sets, do you get a sufficiently high rating?
For the “out-of-scope” benchmark sets, do you get a sufficiently low rating?

As you’ll be able to see, there are multiple calculations and it’s hard to provide you with a single evaluation to come to a decision if the trained model is sweet enough to be moved to production.

In reality, for a picture classification model, it is useful to manually review the photographs that the model got improper, in addition to those that got a low softmax confidence rating. Use the scores from this script to create a listing of images to manually review, after which get a for a way well the model performs.

Try Part 3 for more in-depth discussion on evaluation and scoring.

Export script

The entire heavy lifting is finished by this point. Since your Docker container shall be shutdown soon, now could be the time to repeat the model artifacts to cloud storage and prepare them for being put to make use of.

The instance Python code snippet below is more geared to Keras and TensorFlow. It will take the trained model and export it as a . Later, I’ll show how that is utilized by TensorFlow Serving within the Deploy section below.

# Increment current version of model and create latest directory
next_version_dir, version_number = create_new_version_folder()

# Copy model artifacts to the brand new directory
copy_model_artifacts(next_version_dir)

# Create the directory to save lots of the model export
saved_model_dir = os.path.join(next_version_dir, str(version_number))

# Save the model export to be used with TensorFlow Serving
tf.keras.backend.set_learning_phase(0)
model = tf.keras.models.load_model(keras_model_file)
tf.saved_model.save(model, export_dir=saved_model_dir)

This script also copies the opposite training run artifacts similar to the model evaluation results, rating summaries, and log files generated from model training. Don’t ignore your label map so you’ll be able to give human readable names to your classes!

Bulk identification script

Your training run is complete, your model has been scored, and a new edition is exported and able to be served. Now could be the time to make use of this latest model to help you on attempting to discover unlabeled images.

As I described in Part 4, you will have a group of “unknowns” — really good pictures, but no idea what they’re. Let your latest model provide a best guess on these and record the outcomes to a file or a database. Now you’ll be able to create filters based on closest match and by high/low scores. This enables your material experts to leverage these filters to seek out latest image classes, add to existing classes, or to remove images which have very low scores and aren’t any good.

By the best way, I put this step contained in the GPU container since you will have hundreds of “unknown” images to process and the accelerated hardware will make light work of it. Nevertheless, if you happen to are usually not in a rush, you possibly can perform this step on a separate CPU node, and shutdown your GPU node sooner to save lots of cost. This might especially make sense in case your “unknowns” folder is on slower cloud storage.

Batch script

The entire scripts described above perform a particular task — from extracting your image library, executing model training, performing evaluation and scoring, exporting the model artifacts for deployment, and maybe even bulk identification.

One script to rule all of them

To coordinate your complete show, this batch script gives you the entry point on your container and a straightforward strategy to trigger every thing. Be sure you produce a log file in case you should analyze any failures along the best way. Also, you should definitely write the log to your cloud storage in case the container dies unexpectedly.

#!/bin/bash
# Essential batch control script

# Redirect standard output and standard error to a log file
exec > /cloud_storage/batch-logfile.txt 2>&1

/app/ExtractImageLibrary.py
/app/Training.py
/app/Evaluation.py
/app/ScorePerformance.py
/app/ExportModel.py
/app/BulkIdentification.py

Executing your training run

So, now it’s time to place every thing in motion…

Start your engines!

Let’s undergo the steps to arrange your image library, fan the flames of your Docker container to coach your model, after which examine the outcomes.

Image library ‘tar’ files

Your image management system should now create a file backup of your data. Since is a single-threaded function, you’ll get significant speed improvement by creating multiple tar files in parallel, each with a portion of you data.

Now these files will be copied to your shared cloud storage for the subsequent step.

Start Docker container

All of the exertions you place into creating your container (described above) shall be put to the test. In case you are running Kubernetes, you’ll be able to create a Job that may execute the script.

Contained in the Kubernetes Job definition, you’ll be able to pass environment variables to regulate the execution of your script. For instance, the batch size and variety of epochs are set here after which pulled into your Python scripts, so you’ll be able to alter the behavior without changing your code.

#####   sample Job in Kubernetes   #####
containers:
  - name: training-job
    env:
      - name: BATCH_SIZE
        value: 50
      - name: NUM_EPOCHS
        value: 30
    command: ["/app/BatchControl.sh"]

Once the Job is accomplished, you should definitely confirm that the GPU node properly scales back all the way down to zero in line with your scaling configuration in Kubernetes — you don’t wish to be saddled with an enormous bill over a straightforward configuration error.

Manually review results

With the training run complete, it is best to now have model artifacts saved and may examine the performance. Leaf through the metrics, similar to F1 and log loss, and benchmark accuracy for top softmax confidence scores.

As mentioned earlier, the reports only tell a part of the story. It’s definitely worth the effort and time to manually review the photographs that the model got improper or where it produced a low confidence rating.

Don’t forget concerning the bulk identification. Be sure you leverage these to locate latest images to fill out your data set, or to seek out latest classes.

Deploying your model

Once you’ve gotten reviewed your model performance and are satisfied with the outcomes, it’s time to change your TensorFlow Serving container to place the brand new model into production.

TensorFlow Serving is obtainable as a Docker container and provides a really quick and convenient strategy to serve your model. This container can listen and reply to API calls on your model.

Let’s say your latest model is version 7, and your Export script (see above) has saved the model in your cloud share as . You possibly can start the TensorFlow Serving container with that volume mount. In this instance, the points to folder for version 007.

#####   sample TensorFlow pod in Kubernetes   #####
containers:
  - name: tensorflow-serving
    image: bitnami/tensorflow-serving:2.18.0
    ports:
      - containerPort: 8501
    env:
      - name: TENSORFLOW_SERVING_MODEL_NAME
        value: "image_application"
    volumeMounts:
      - name: models-subfolder
        mountPath: "/bitnami/model-data"

volumes:
  - name: models-subfolder
    azureFile:
      shareName: "image_application/models/007"

A subtle note here — the export script should create a sub-folder, named 007 (same as the bottom folder), with the saved model export. This may increasingly seem somewhat confusing, but TensorFlow Serving will mount this share folder as and detect the numbered sub-folder inside it for the version to serve. It will let you query the API for the model version in addition to the identification.

Conclusion

As I discussed firstly of this text, this setup has worked for my situation. That is actually not the one strategy to approach this challenge, and I invite you to customize your individual solution.

I desired to share my hard-fought learnings as I embraced cloud services in Kubernetes, with the will to maintain costs under control. In fact, doing all this while maintaining a high level of model performance is an added challenge, but one that you would be able to achieve.

I hope I actually have provided enough information here to enable you to along with your own endeavors. Blissful learnings!

Learnings from a Machine Learning Engineer — Part 5: The Training

Infrastructure overview

Image management system

Cloud storage

Kubernetes or Docker engine

Constructing your Docker container

Dockerfile

Extract Image Library script

Training script

Evaluation script

Rating generation script

Export script

Bulk identification script

Batch script

Executing your training run

Image library ‘tar’ files

Start Docker container

Manually review results

Deploying your model

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Learnings from a Machine Learning Engineer — Part 5: The Training

Infrastructure overview

Image management system

Cloud storage

Kubernetes or Docker engine

Constructing your Docker container

Dockerfile

Extract Image Library script

Training script

Evaluation script

Rating generation script

Export script

Bulk identification script

Batch script

Executing your training run

Image library ‘tar’ files

Start Docker container

Manually review results

Deploying your model

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.