Training AI Models on CPU

-

Revisiting CPU for ML in an Era of GPU Scarcity

13 min read

23 hours ago

Photo by Quino Al on Unsplash

The recent successes in AI are sometimes attributed to the emergence and evolutions of the GPU. The GPU’s architecture, which generally includes hundreds of multi-processors, high-speed memory, dedicated tensor cores, and more, is especially well-suited to satisfy the intensive demands of AI/ML workloads. Unfortunately, the rapid growth in AI development has led to a surge within the demand for GPUs, making them difficult to acquire. Consequently, ML developers are increasingly exploring alternative hardware options for training and running their models. In previous posts, we discussed the potential of training on dedicated AI ASICs comparable to Google Cloud TPU, Haban Gaudi, and AWS Trainium. While these options offer significant cost-saving opportunities, they don’t suit all ML models and may, just like the GPU, also suffer from limited availability. On this post we return to the nice old-fashioned CPU and revisit its relevance to ML applications. Although CPUs are generally less suited to ML workloads in comparison with GPUs, they’re much easier to accumulate. The power to run (at the very least a few of) our workloads on CPU could have significant implications on development productivity.

In previous posts (e.g., here) we emphasized the importance of analyzing and optimizing the runtime performance of AI/ML workloads as a way of accelerating development and minimizing costs. While that is crucial whatever the compute engine used, the profiling tools and optimization techniques can vary greatly between platforms. On this post, we are going to discuss among the performance optimization options that pertain to CPU. Our focus shall be on Intel® Xeon® CPU processors (with Intel® AVX-512) and on the PyTorch (version 2.4) framework (although similar techniques may be applied to other CPUs and frameworks, as well). More specifically, we are going to run our experiments on an Amazon EC2 c7i instance with an AWS Deep Learning AMI. Please don’t view our selection of Cloud platform, CPU version, ML framework, or every other tool or library we should always mention, as an endorsement over their alternatives.

Our goal shall be to reveal that although ML development on CPU is probably not our first selection, there are methods to “soften the blow” and — in some cases — even perhaps make it a viable alternative.

Disclaimers

Our intention on this post is to reveal just a number of of the ML optimization opportunities available on CPU. Contrary to many of the online tutorials on the subject of ML optimization on CPU, we are going to give attention to a training workload quite than an inference workload. There are quite a few optimization tools focused specifically on inference that we are going to not cover (e.g., see here and here).

Please don’t view this post as a alternative of the official documentation on any of the tools or techniques that we mention. Remember that given the rapid pace of AI/ML development, among the content, libraries, and/or instructions that we mention may grow to be outdated by the point you read this. Please make sure you seek advice from the most-up-to-date documentation available.

Importantly, the impact of the optimizations that we discuss on runtime performance is more likely to vary greatly based on the model and the main points of the environment (e.g., see the high degree of variance between models on the official PyTorch TouchInductor CPU Inference Performance Dashboard). The comparative performance numbers we are going to share are specific to the toy model and runtime environment that we are going to use. Be sure you reevaluate all the proposed optimizations on your personal model and runtime environment.

Lastly, our focus shall be solely on throughput performance (as measured in samples per second) — not on training convergence. Nevertheless, it needs to be noted that some optimization techniques (e.g., batch size tuning, mixed precision, and more) could have a negative effect on the convergence of certain models. In some cases, this may be overcome through appropriate hyperparameter tuning.

We’ll run our experiments on a straightforward image classification model with a ResNet-50 backbone (from Deep Residual Learning for Image Recognition). We’ll train the model on a fake dataset. The complete training script appears within the code block below (loosely based on this instance):

import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import time

# A dataset with random images and labels
class FakeDataset(Dataset):
def __len__(self):
return 1000000

def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(data=index % 10, dtype=torch.uint8)
return rand_image, label

train_set = FakeDataset()

batch_size=128
num_workers=0

train_loader = DataLoader(
dataset=train_set,
batch_size=batch_size,
num_workers=num_workers
)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()

t0 = time.perf_counter()
summ = 0
count = 0

for idx, (data, goal) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, goal)
loss.backward()
optimizer.step()
batch_time = time.perf_counter() - t0
if idx > 10: # skip first steps
summ += batch_time
count += 1
t0 = time.perf_counter()
if idx > 100:
break

print(f'average step time: {summ/count}')
print(f'throughput: {count*batch_size/summ}')

Running this script on a c7i.2xlarge (with 8 vCPUs) and the CPU version of PyTorch 2.4, leads to a throughput of 9.12 samples per second. For the sake of comparison, we note that the throughput of the identical (unoptimized script) on an Amazon EC2 g5.2xlarge instance (with 1 GPU and eight vCPUs) is 340 samples per second. Bearing in mind the comparative costs of those two instance types ($0.357 per hour for a c7i.2xlarge and $1.212 for a g5.2xlarge, as of the time of this writing), we discover that training on the GPU instance to present roughly eleven(!!) times higher price performance. Based on these results, the preference for using GPUs to coach ML models could be very well founded. Let’s assess among the possibilities for reducing this gap.

On this section we are going to explore some basic methods for increasing the runtime performance of our training workload. Although you could recognize a few of these from our post on GPU optimization, it can be crucial to focus on a big difference between training optimization on CPU and GPU platforms. On GPU platforms much of our effort was dedicated to maximizing the parallelization between (the training data preprocessing on) the CPU and (the model training on) the GPU. On CPU platforms all the processing occurs on the CPU and our goal shall be to allocate its resources most effectively.

Batch Size

Increasing the training batch size can potentially increase performance by reducing the frequency of the model parameter updates. (On GPUs it has the additional benefit of reducing the overhead of CPU-GPU transactions comparable to kernel loading). Nevertheless, while on GPU we aimed for a batch size that may maximize the utilization of the GPU memory, the identical strategy might hurt performance on CPU. For reasons beyond the scope of this post, CPU memory is more complicated and one of the best approach for locating essentially the most optimal batch size could also be through trial and error. Remember that changing the batch size could affect training convergence.

The table below summarizes the throughput of our training workload for a number of (arbitrary) decisions of batch size:

Training Throughput as Function of Batch Size (by Creator)

Contrary to our findings on GPU, on the c7i.2xlarge instance type our model appears to prefer lower batch sizes.

Multi-process Data Loading

A typical technique on GPUs is to assign multiple processes to the info loader in order to scale back the likelihood of starvation of the GPU. On GPU platforms, a general rule of thumb is to set the variety of staff in keeping with the variety of CPU cores. Nevertheless, on CPU platforms, where the model training uses the identical resources as the info loader, this approach could backfire. Once more, one of the best approach for selecting the optimal variety of staff could also be trial and error. The table below shows the typical throughput for various decisions of num_workers:

Training Throughput as Function of the Variety of Data Loading Staff (by Creator)

Mixed Precision

One other popular technique is to make use of lower precision floating point datatypes comparable to torch.float16 or torch.bfloat16 with the dynamic range of torch.bfloat16 generally considered to be more amiable to ML training. Naturally, reducing the datatype precision can have opposed effects on convergence and needs to be done fastidiously. PyTorch comes with torch.amp, an automatic mixed precision package for optimizing using these datatypes. Intel® AVX-512 includes support for the bfloat16 datatype. The modified training step appears below:

for idx, (data, goal) in enumerate(train_loader):
optimizer.zero_grad()
with torch.amp.autocast('cpu',dtype=torch.bfloat16):
output = model(data)
loss = criterion(output, goal)
loss.backward()
optimizer.step()

The throughput following this optimization is 24.34 samples per second, a rise of 86%!!

Channels Last Memory Format

Channels last memory format is a beta-level optimization (on the time of this writing), pertaining primarily to vision models, that supports storing 4 dimensional (NCHW) tensors in memory such that the channels are the last dimension. This leads to all of the info of every pixel being stored together. This optimization pertains primarily to vision models. Considered to be more “friendly to Intel platforms”, this memory format is reported boost the performance of a ResNet-50 on an Intel® Xeon® CPU. The adjusted training step appears below:

for idx, (data, goal) in enumerate(train_loader):
data = data.to(memory_format=torch.channels_last)
optimizer.zero_grad()
with torch.amp.autocast('cpu',dtype=torch.bfloat16):
output = model(data)
loss = criterion(output, goal)
loss.backward()
optimizer.step()

The resulting throughput is 37.93 samples per second — a further 56% improvement and a complete of 415% in comparison with our baseline experiment. We’re on a job!!

Torch Compilation

In a previous post we covered the virtues of PyTorch’s support for graph compilation and its potential impact on runtime performance. Contrary to the default eager execution mode during which each operation is run independently (a.k.a., “eagerly”), the compile API converts the model into an intermediate computation graph which is then JIT-compiled into low-level machine code in a way that is perfect for the underlying training engine. The API supports compilation via different backend libraries and with multiple configuration options. Here we are going to limit our evaluation to the default (TorchInductor) backend and the ipex backend from the Intel® Extension for PyTorch, a library with dedicated optimizations for Intel hardware. Please see the documentation for appropriate installation and usage instructions. The updated model definition appears below:

import intel_extension_for_pytorch as ipex

model = torchvision.models.resnet50()
backend='inductor' # optionally change to 'ipex'
model = torch.compile(model, backend=backend)

Within the case of our toy model, the impact of torch compilation is just apparent when the “channels last” optimization is disabled (and increase of ~27% for every of the backends). When “channels last” is applied, the performance actually drops. Consequently, we drop this optimization from our subsequent experiments.

There are quite a few opportunities for optimizing using the underlying CPU resources. These include optimizing memory management and thread allocation to the structure of the underlying CPU hardware. Memory management may be improved through using advanced memory allocators (comparable to Jemalloc and TCMalloc) and/or reducing memory accesses which can be slower (i.e., across NUMA nodes). Threading allocation may be improved through appropriate configuration of the OpenMP threading library and/or use of Intel’s Open MP library.

Generally speaking, these sorts of optimizations require a deep level understanding of the CPU architecture and the features of its supporting SW stack. To simplify matters, PyTorch offers the torch.backends.xeon.run_cpu script for routinely configuring the memory and threading libraries in order to optimize runtime performance. The command below will lead to using the dedicated memory and threading libraries. We’ll return to the subject of NUMA nodes after we discuss the choice of distributed training.

We confirm appropriate installation of TCMalloc (conda install conda-forge::gperftools) and Intel’s Open MP library (pip install intel-openmp), and run the next command.

python -m torch.backends.xeon.run_cpu train.py

The usage of the run_cpu script further boosts our runtime performance to 39.05 samples per second. Note that the run_cpu script includes many controls for further tuning performance. Be sure you try the documentation with the intention to maximize its use.

The Intel® Extension for PyTorch includes additional opportunities for training optimization via its ipex.optimize function. Here we reveal its default use. Please see the documentation to learn of its full capabilities.

 model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
model, optimizer = ipex.optimize(
model,
optimizer=optimizer,
dtype=torch.bfloat16
)

Combined with the memory and thread optimizations discussed above, the resultant throughput is 40.73 samples per second. (Note that an analogous result’s reached when disabling the “channels last” configuration.)

Intel® Xeon® processors are designed with Non-Uniform Memory Access (NUMA) during which the CPU memory is split into groups, a.k.a., NUMA nodes, and every of the CPU cores is assigned to 1 node. Although any CPU core can access the memory of any NUMA node, the access to its own node (i.e., its local memory) is far faster. This offers rise to the notion of distributing training across NUMA nodes, where the CPU cores assigned to every NUMA node act as a single process in a distributed process group and data distribution across nodes is managed by Intel® oneCCL, Intel’s dedicated collective communications library.

We will run data distributed training across NUMA nodes easily using the ipexrun utility. In the next code block (loosely based on this instance) we adapt our script to run data distributed training (in keeping with usage detailed here):

import os, time
import torch
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torchvision
import oneccl_bindings_for_pytorch as torch_ccl
import intel_extension_for_pytorch as ipex

os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "29500"
os.environ["RANK"] = os.environ.get("PMI_RANK", "0")
os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", "1")
dist.init_process_group(backend="ccl", init_method="env://")
rank = os.environ["RANK"]
world_size = os.environ["WORLD_SIZE"]

batch_size = 128
num_workers = 0

# define dataset and dataloader
class FakeDataset(Dataset):
def __len__(self):
return 1000000

def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(data=index % 10, dtype=torch.uint8)
return rand_image, label

train_dataset = FakeDataset()
dist_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
num_workers=num_workers,
sampler=dist_sampler
)

# define model artifacts
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
model, optimizer = ipex.optimize(
model,
optimizer=optimizer,
dtype=torch.bfloat16
)

# configure DDP
model = torch.nn.parallel.DistributedDataParallel(model)

# run training loop

# destroy the method group
dist.destroy_process_group()

Unfortunately, as of the time of this writing, the Amazon EC2 c7i instance family doesn’t include a multi-NUMA instance type. To check our distributed training script, we revert back to a Amazon EC2 c6i.32xlarge instance with 64 vCPUs and a pair of NUMA nodes. We confirm the installation of Intel® oneCCL Bindings for PyTorch and run the next command (as documented here):

source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh

# This instance command would utilize all of the numa sockets of the processor, taking each socket as a rank.
ipexrun cpu --nnodes 1 --omp_runtime intel train.py

The next table compares the performance results on the c6i.32xlarge instance with and without distributed training:

Distributed Training Across NUMA Nodes (by Creator)

In our experiment, data distribution did not boost the runtime performance. Please see ipexrun documentation for added performance tuning options.

In previous posts (e.g., here) we discussed the PyTorch/XLA library and its use of XLA compilation to enable PyTorch based training on XLA devices comparable to TPU, GPU, and CPU. Much like torch compilation, XLA uses graph compilation to generate machine code that’s optimized for the goal device. With the establishment of the OpenXLA Project, one in all the stated goals was to support high performance across all hardware backends, including CPU (see the CPU RFC here). The code block below demonstrates the adjustments to our original (unoptimized) script required to coach using PyTorch/XLA:

import torch
import torchvision
import timeimport torch_xla
import torch_xla.core.xla_model as xm

device = xm.xla_device()

model = torchvision.models.resnet50().to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()

for idx, (data, goal) in enumerate(train_loader):
data = data.to(device)
goal = goal.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, goal)
loss.backward()
optimizer.step()
xm.mark_step()

Unfortunately, (as of the time of this writing) the XLA results on our toy model seem far inferior to the (unoptimized) results we saw above (— by as much as 7X). We expect this to enhance as PyTorch/XLA’s CPU support matures.

We summarize the outcomes of a subset of our experiments within the table below. For the sake of comparison, we add the throughput of coaching our model on Amazon EC2 g5.2xlarge GPU instance following the optimization steps discussed on this post. The samples per dollar was calculated based on the Amazon EC2 On-demand pricing page ($0.357 per hour for a c7i.2xlarge and $1.212 for a g5.2xlarge, as of the time of this writing).

Performance Optimization Results (by Creator)

Although we succeeded in boosting the training performance of our toy model on the CPU instance by a substantial margin (446%), it stays inferior to the (optimized) performance on the GPU instance. Based on our results, training on GPU can be ~6.7 times cheaper. It is probably going that with additional performance tuning and/or applying additional optimizations strategies, we could further close the gap. Once more, we emphasize that the comparative performance results now we have reached are unique to this model and runtime environment.

Amazon EC2 Spot Instances Discounts

The increased availability of cloud-based CPU instance types (in comparison with GPU instance types) may imply greater opportunity for obtaining compute power at discounted rates, e.g., through Spot Instance utilization. Amazon EC2 Spot Instances are instances from surplus cloud service capability which can be offered for a reduction of as much as 90% off the On-Demand pricing. In exchange for the discounted price, AWS maintains the best to preempt the instance with little to no warning. Given the high demand for GPUs, you could find CPU spot instances easier to get ahold of than their GPU counterparts. On the time of this writing, c7i.2xlarge Spot Instance price is $0.1291 which might improve our samples per dollar result to 1135.76 and further reduces the gap between the optimized GPU and CPU price performances (to 2.43X).

While the runtime performance results of the optimized CPU training of our toy model (and our chosen environment) were lower than the GPU results, it is probably going that the identical optimization steps applied to other model architectures (e.g., ones that include components that aren’t supported by GPU) may lead to the CPU performance matching or beating that of the GPU. And even in cases where the performance gap will not be bridged, there may thoroughly be cases where the shortage of GPU compute capability would justify running a few of our ML workloads on CPU.

Given the ubiquity of the CPU, the power to make use of them effectively for training and/or running ML workloads could have huge implications on development productivity and on end-product deployment strategy. While the character of the CPU architecture is less amiable to many ML applications compared to the GPU, there are lots of tools and techniques available for reinforcing its performance — a select few of which now we have discussed and demonstrated on this post.

On this post we focused optimizing training on CPU. Please make sure you try our many other posts on medium covering a wide selection of topics pertaining to performance evaluation and optimization of machine learning workloads.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x