Maximizing the Utility of Scarce AI Resources: A Kubernetes Approach

Artificial Intelligence

Maximizing the Utility of Scarce AI Resources: A Kubernetes Approach

admin

February 15, 2024

Maximizing the Utility of Scarce AI Resources: A Kubernetes Approach

Optimizing the usage of limited AI training accelerators

Photo by Roman Derrick Okello on Unsplash

Within the ever-evolving landscape of AI development, nothing rings truer than the old saying (attributed to Heraclitus), “the one constant in life is change”. Within the case of AI, plainly change is indeed constant, however the pace of change is without end increasing. Staying relevant in these unique and exciting times amounts to an unprecedented test of the capability of AI teams to consistently adapt and adjust their development processes. AI development teams that fail to adapt, or are slow to adapt, may quickly turn out to be obsolete.

One of the difficult developments of the past few years in AI development has been the increasing difficulty to achieve the hardware required to coach AI models. Whether or not it’s resulting from an ongoing crisis in the worldwide supply chain or a major increase within the demand for AI chips, getting your hands on the GPUs (or alternative training accelerators) that you just need for AI development, has gotten much harder. That is evidenced by the massive wait time for brand spanking new GPU orders and by the undeniable fact that cloud service providers (CSPs) that after offered virtually infinite capability of GPU machines, now struggle to maintain up with the demand.

The changing times are forcing AI development teams which will have once relied on limitless capability of AI accelerators to adapt to a world with reduced accessibility and, in some cases, higher costs. Development processes that after took as a right the power to spin up a latest GPU machine at will, have to be modified to fulfill the demands of a world of scarce AI resources which can be often shared by multiple projects and/or teams. Those who fail to adapt risk annihilation.

On this post we’ll display the usage of Kubernetes within the orchestration of AI-model training workloads in a world of scarce AI resources. We’ll start by specifying the goals we wish to realize. We’ll then describe why Kubernetes is an appropriate tool for addressing this challenge. Last, we’ll provide a straightforward demonstration of how Kubernetes might be used to maximise the usage of a scarce AI compute resource. In subsequent posts, we plan to boost the Kubernetes-based solution and show how you can apply it to a cloud-based training environment.

Disclaimers

While this post doesn’t assume prior experience with Kubernetes, some basic familiarity would definitely be helpful. This post shouldn’t, in any way, be viewed as a Kubernetes tutorial. To study Kubernetes, we refer the reader to the various great online resources on the topic. Here we’ll discuss just a number of properties of Kubernetes as they pertain to the subject of maximizing and prioritizing resource utilization.

There are lots of alternative tools and techniques to the tactic we put forth here, each with their very own pros and cons. Our intention on this post is only educational; Please don’t view any of the alternatives we make as an endorsement.

Lastly, the Kubernetes platform stays under constant development, as do lots of the frameworks and tools in the sector of AI development. Please have in mind the likelihood that a few of the statements, examples, and/or external links on this post may turn out to be outdated by the point you read this and be sure you have in mind the latest solutions available before making your individual design decisions.

To simplify our discussion, let’s assume that now we have a single employee node at our disposal for training our models. This may very well be a neighborhood machine with a GPU or a reserved compute-accelerated instance within the cloud, corresponding to a p5.48xlarge instance in AWS or a TPU node in GCP. In our example below we’ll check with this node as “my precious”. Typically, we could have spent quite a lot of money on this machine. We’ll further assume that now we have multiple training workloads all competing for our single compute resource where each workload could take anywhere from a number of minutes to a number of days. Naturally, we would really like to maximise the utility of our compute resource by ensuring that it’s in constant use and that a very powerful jobs get prioritized. What we’d like is a few type of a priority queue and an associated priority-based scheduling algorithm. Let’s attempt to be a bit more specific concerning the behaviors that we desire.

Scheduling Requirements

Maximize Utilization: We would really like for our resource to be in constant use. Specifically, as soon because it completes a workload, it’s going to promptly (and routinely) start working on a latest one.
Queue Pending Workloads: We require the existence of a queue of coaching workloads which can be waiting to be processed by our unique resource. We also require associated APIs for creating and submitting latest jobs to the queue, in addition to monitoring and managing the state of the queue.
Support Prioritization: We would really like each training job to have an associated priority such that workloads with higher priority will likely be run before workloads with a lower priority.
Preemption: Furthermore, within the case that an urgent job is submitted to the queue while our resource is working on a lower priority job, we would really like for the running job to be preempted and replaced by the urgent job. The preempted job needs to be returned to the queue.

One approach to developing an answer that satisfies these requirements may very well be to take an existing API for submitting jobs to a training resource and wrap it with a customized implementation of a priority queue with the specified properties. At a minimum, this approach would require an information structure for storing a listing of pending jobs, a dedicated process for selecting and submitting jobs from the queue to the training resource, and a few type of mechanism for identifying when a job has been accomplished and the resource has turn out to be available.

An alternate approach and the one we absorb this post, is to leverage an existing solution for priority-based scheduling that fulfils our requirements and align our training development workflow to its use. The default scheduler that comes with Kubernetes is an example of 1 such solution. In the following sections we’ll display how it will probably be used to handle the issue of optimizing the usage of scarce AI training resources.

On this section we’ll get a bit philosophical about the applying of Kubernetes to the orchestration of ML training workloads. If you’ve gotten no patience for such discussions (totally fair) and need to get straight to the sensible examples, please be at liberty to skip to the following section.

Kubernetes is (one other) considered one of those software/technological solutions that are likely to elicit strong reactions in lots of developers. There are some that swear by it and use it extensively, and others that find it overbearing, clumsy, and unnecessary (e.g., see here for a few of the arguments for and against using Kubernetes). As with many other heated debates, it’s the creator’s opinion that the reality lies somewhere in between — there are situations where Kubernetes provides a super framework that may significantly increase productivity, and other situations where its use borders on an insult to the SW development career. The large query is, where on the spectrum does ML development lie? Is Kubernetes the suitable framework for training ML models? Although a cursory online search might give the impression that the final consensus is an emphatic “yes”, we’ll make some arguments for why that might not be the case. But first, we should be clear about what we mean by “ML training orchestration using Kubernetes”.

While there are a lot of online resources that address the subject of ML using Kubernetes, it’s important to pay attention to the undeniable fact that they usually are not all the time referring to the identical mode of use. Some resources (e.g., here) use Kubernetes just for deploying a cluster; once the cluster is up and running they begin the training job outside the context of Kubernetes. Others (e.g., here) use Kubernetes to define a pipeline through which a dedicated module starts up a training job (and associated resources) using a totally different system. In contrast to those two examples, many other resources define the training workload as a Kubernetes Job artifact that runs on a Kubernetes Node. Nonetheless, they too vary greatly in the actual attributes on which they focus. Some (e.g., here) emphasize the auto-scaling properties and others (e.g., here) the Multi-Instance GPU (MIG) support. In addition they vary greatly in the small print of implementation, corresponding to the precise artifact (Job extension) for representing a training job (e.g., ElasticJob, TrainingWorkload, JobSet, VolcanoJob, etc.). Within the context of this post, we too will assume that the training workload is defined as a Kubernetes Job. Nonetheless, so as to simplify the discussion, we’ll follow the core Kubernetes objects and leave the discussion of Kubernetes extensions for ML for a future post.

Arguments Against Kubernetes for ML

Listed here are some arguments that may very well be made against the usage of Kubernetes for training ML models.

Complexity: Even its best proponents need to admit that Kubernetes might be hard. Using Kubernetes effectively, requires a high level of experience, has a steep learning curve, and, realistically speaking, typically requires a dedicated devops team. Designing a training solution based on Kubernetes increases dependencies on dedicated experts and by extension, increases the danger that things could go improper, and that development may very well be delayed. Many different ML training solutions enable a greater level of developer independence and freedom and entail a reduced risk of bugs in the event process.
Fixed Resource Requirements: One of the touted properties of Kubernetes is its scalability — its ability to routinely and seamlessly scale its pool of compute resources up and down in accordance with the variety of jobs, the variety of clients (within the case of a service application), resource capability, etc. Nonetheless, one could argue that within the case of an ML training workload, where the variety of resources which can be required is (often) fixed throughout training, auto-scaling is unnecessary.
Fixed Instance Type: As a consequence of the undeniable fact that Kubernetes orchestrates containerized applications, Kubernetes enables an excellent deal of flexibility relating to the sorts of machines in its node pool. Nonetheless, relating to ML, we typically require very specific machinery with dedicated accelerators (corresponding to GPUs). Furthermore, our workloads are sometimes tuned to run optimally on one very specific instance type.
Monolithic Application Architecture: It’s common practice in the event of modern-day applications to interrupt them down into small elements called microservices. Kubernetes is commonly seen as a key component on this design. ML training applications are likely to be quite monolithic of their design and, one could argue, that they don’t lend themselves naturally to a microservice architecture.
Resource Overhead: The dedicated processes which can be required to run Kubernetes requires some system resources on each of the nodes in its pool. Consequently, it could incur a certain performance penalty on our training jobs. Given the expense of the resources required for training, we may prefer to avoid this.

Granted, now we have taken a really one-sided view within the Kubernetes-for-ML debate. Based solely on the arguments above, you would possibly conclude that we would wish a darn good reason for selecting Kubernetes as a framework for ML training. It’s our opinion that the challenge put forth on this post, i.e., the need to maximise the utility of scarce AI compute resources, is strictly the style of justification that warrants the usage of Kubernetes despite the arguments made above. As we’ll display, the default scheduler that’s built-in to Kubernetes, combined with its support for priority and preemption makes it a front-runner for fulfilling the necessities stated above.

On this section we’ll share a temporary example that demonstrates the priority scheduling support that’s inbuilt to Kubernetes. For the needs of our demonstration, we’ll use Minikube (version v1.32.0). Minikube is a tool that allows you to run a Kubernetes cluster in a neighborhood environment and is a super playground for experimenting with Kubernetes. Please see the official documentation on installing and getting began with Minikube.

Cluster Creation

Let’s start by making a two-node cluster using the Minikube start command:

minikube start --nodes 2

The result’s a neighborhood Kubernetes cluster consisting of a master (“control-plane”) node named minikube, and a single employee node, named minikube-m02, which is able to simulate our single AI resource. Let’s apply the label my-precious to discover it as a singular resource type:

kubectl label nodes minikube-m02 node-type=my-precious

We will use the Minikube dashboard to visualise the outcomes. In a separate shell run the command below and open the generated browser link.

minikube dashboard

If you happen to press on the Nodes tab on the left-hand pane, you need to see a summary of our cluster’s nodes:

Nodes List in Minikube Dashboard (Captured by Creator)

PriorityClass Definitions

Next, we define two PriorityClasses, low-priority and high-priority, as within the priorities.yaml file displayed below. Recent jobs will receive the low-priority task, by default.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 0
globalDefault: true---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false

To use our latest classes to our cluster, we run:

kubectl apply -f priorities.yaml

Create a Job

We define a straightforward job using a job.yaml file displayed within the code block below. For the aim of our demonstration, we define a Kubernetes Job that does nothing greater than sleep for 100 seconds. We use busybox as its Docker image. In practice, this could get replaced with a training script and an appropriate ML Docker image. We define the job to run on our special instance, my-precious, using the nodeSelector field, and specify the resource requirements in order that only a single instance of the job can run on the instance at a time. The priority of the job defaults to low-priority as defined above.

apiVersion: batch/v1
kind: Job
metadata:
name: test
spec:
template:
spec:
containers:
- name: test
image: busybox
command: # easy sleep command
- sleep
- '100'
resources: # require all available resources
limits:
cpu: "2"
requests:
cpu: "2"
nodeSelector: # specify our unique resource
node-type: my-precious
restartPolicy: Never

We submit the job with the next command:

kubectl apply -f job.yaml

Create a Queue of Jobs

To display the style through which Kubernetes queues jobs for processing, we create three equivalent copies of the job defined above, named test1, test2, and test3. We group the three jobs in a single file, jobs.yaml, and submit them for processing:

kubectl apply -f jobs.yaml

The image below captures the Workload Status of our cluster within the Minikube dashboard shortly after the submission. You may see that my-precious has begun processing test1, while the opposite jobs are pending as they wait their turn.

Cluster Workload Status (Captured by Creator)

Once test1 is accomplished, processing of test2 begins:

Cluster Workload Status — Automated Scheduling (Captured by Creator)

As long as no other jobs with higher priority are submitted, our jobs would proceed to be processed one after the other until they’re all accomplished.

Job Preemption

We now display Kubernetes’ built-in support for job preemption by showing what happens after we submit a fourth job, this time with the high-priority setting:

apiVersion: batch/v1
kind: Job
metadata:
name: test-p1
spec:
template:
spec:
containers:
- name: test-p1
image: busybox
command:
- sleep
- '100'
resources:
limits:
cpu: "2"
requests:
cpu: "2"
restartPolicy: Never
priorityClassName: high-priority # high priority job
nodeSelector:
node-type: my-precious

The impact on the Workload Status is displayed within the image below:

Cluster Workload Status — Preemption (Captured by Creator)

The test2 job has been preempted — its processing has been stopped and it has returned to the pending state. In its stead, my-precious has begun processing the upper priority test-p1 job. Just once test-p1 is accomplished will processing of the lower priority jobs resume. (Within the case where the preempted job is a ML training workload, we might program it to resume from probably the most recent saved model model checkpoint).

The image below displays the Workload Status once all jobs have been accomplished.

Cluster Workload Status — Completion (Captured by Creator)

The answer we demonstrated for priority-based scheduling and preemption relied only on core components of Kubernetes. In practice, chances are you’ll decide to reap the benefits of enhancements to the essential functionality introduced by extensions corresponding to Kueue and/or dedicated, ML-specific features offered by platforms construct on top of Kubernetes, corresponding to Run:AI or Volcano. But take into account that to meet the essential requirements for maximizing the utility of a scarce AI compute resource all we’d like is the core Kubernetes.

The reduced availability of dedicated AI silicon has forced ML teams to regulate their development processes. Unlike previously, when developers could spin up latest AI resources at will, they now face limitations on AI compute capability. This necessitates the procurement of AI instances through means corresponding to purchasing dedicated units and/or reserving cloud instances. Furthermore, developers must come to terms with the likelihood of needing to share these resources with other users and projects. To be sure that the scarce AI compute power is appropriated towards maximum utility, dedicated scheduling algorithms have to be defined that minimize idle time and prioritize critical workloads. On this post now we have demonstrated how the Kubernetes scheduler might be used to perform these goals. As emphasized above, that is just considered one of many approaches to handle the challenge of maximizing the utility of scarce AI resources. Naturally, the approach you select, and the small print of your implementation will depend upon the particular needs of your AI development.