A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline

in the info input pipeline of a machine learning model running on a GPU may be particularly frustrating. In most workloads, the host (CPU) and the device (GPU) work in tandem: the CPU is chargeable for preparing and feeding data, while the GPU handles the heavy lifting — executing the model, performing backpropagation during training, and updating weights.

In a perfect situation, we would like the GPU — the most costly component of our AI/ML infrastructure — to be highly utilized. This results in faster development cycles, lower training costs, and reduced latency in deployment. To attain this, the GPU should be repeatedly fed with input data. Specifically, we would really like to forestall the onset of “GPU starvation” — a situation by which our most costly resource lays idle while it waits for input data. Unfortunately, “GPU starvation” resulting from bottlenecks in the info input pipeline is kind of common and might dramatically reduce system efficiency. As such, it’s vital for AI/ML developers to have reliable tools and techniques for diagnosing and addressing such issues.

This post — the eighth in our series on the subject of PyTorch Model Performance Evaluation and Optimization — introduces an easy caching strategy for identifying bottlenecks in the info input pipeline. As in earlier posts, we aim to bolster two key ideas:

AI/ML developers must take responsibility for the runtime performance of their models.
You do not want to be a CUDA or systems expert to implement significant performance optimizations.

We’ll start by outlining a number of the common causes of GPU starvation. Then we’ll introduce our caching-based strategy for identifying and analyzing input pipeline performance issues. We’ll close by reviewing a set of practical tools, tricks, and techniques (TTTs) for overcoming performance bottlenecks in the info input pipeline.

To facilitate our discussion we’ll define a toy PyTorch model and an associated data input pipeline. The code that we are going to share is meant for demonstrative purposes — please don’t depend on its correctness or optimality. Moreover, please don’t our mention of any tool, or technique as an endorsement of its use.

A Toy PyTorch Model

We define an easy PyTorch-based image classification model model:

undefined

We define an artificial dataset with quite a lot of transformations — intentionally designed to incorporate a severe input pipeline bottleneck. For more details on the dataset definition please see this post.

import numpy as np
from PIL import Image
from torchvision.datasets.vision import VisionDataset
import torchvision.transforms as T

class FakeDataset(VisionDataset):
    def __init__(self, transform):
        super().__init__(root=None, transform=transform)
        self.size = 10000

    def __getitem__(self, index):
        # create a random 1024x1024 image
        img = Image.fromarray(np.random.randint(
            low=0,
            high=256,
            size=(input_img_size[0], input_img_size[1], 3),
            dtype=np.uint8
        ))
        # create a random label
        goal = np.random.randint(low=0, high=num_classes, 
                                   dtype=np.uint8).item()
        # Apply tranformations
        img = self.transform(img)
        return img, goal

    def __len__(self):
        return self.size

class RandomMask(torch.nn.Module):
    def __init__(self, ratio=0.25):
        super().__init__()
        self.ratio=ratio

    def dilate_mask(self, mask):
        # perform 4 neighbor dilation on mask
        from scipy.signal import convolve2d
        dilated = convolve2d(mask, [[0, 1, 0],
                                    [1, 1, 1],
                                    [0, 1, 0]], mode='same').astype(bool)
        return dilated

    def forward(self, img):
        mask = np.random.uniform(size=(img_size, img_size)) < self.ratio
        dilated_mask = torch.unsqueeze(torch.tensor(self.dilate_mask(mask)),0)
        dilated_mask = dilated_mask.expand(3,-1,-1)
        img[dilated_mask] = 0.
        return img

class ConvertColor(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.A=torch.tensor(
            [[0.299, 0.587, 0.114],
             [-0.16874, -0.33126, 0.5],
             [0.5, -0.41869, -0.08131]]
        )
        self.b=torch.tensor([0.,128.,128.])

    def forward(self, img):
        img = img.to(dtype=torch.get_default_dtype())
        img = torch.matmul(self.A,img.view([3,-1])).view(img.shape)
        img = img + self.b[:,None,None]
        return img

class Scale(object):
    def __call__(self, img):
        return img.to(dtype=torch.get_default_dtype()).div(255)

transform = T.Compose(
    [T.PILToTensor(),
     T.RandomCrop(img_size),
     RandomMask(),
     ConvertColor(),
     Scale()])

train_set = FakeDataset(transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=256,
                                           num_workers=4, pin_memory=True)

Next, we define the model, loss function, optimizer, training step, and training loop, which we wrap with a PyTorch Profiler context manager to capture performance data.

from statistics import mean, variance
from time import time

device = torch.device("cuda:0")
model = Net().cuda(device)
criterion = nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

def train_step(model, criterion, optimizer, inputs, labels):
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


model.train()

t0 = time()
times = []

with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=10, warmup=2, lively=10, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('/tmp/prof'),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for step, data in enumerate(train_loader):
        # copy data to device
        inputs = data[0].to(device=device, non_blocking=True)
        labels = data[1].to(device=device, non_blocking=True)

        # run train step
        train_step(model, criterion, optimizer, inputs, labels)
        prof.step()
        times.append(time()-t0)
        t0 = time()
        if step >= 100:
            break

print(f'average time: {mean(times[1:])}, variance: {variance(times[1:])}')

For our experiments, we use an Amazon EC2 g5.xlarge instance (containing an NVIDIA A10G GPU and 4 vCPUs) running a PyTorch (2.6) Deep Learning AMI (DLAMI). Running our toy script on this environment ends in a mean throughput of 0.89 steps per second, an underwhelming GPU utilization of twenty-two%, and in the next profiling trace:

Profiling Trace of GPU Starvation (by Creator)

As discussed intimately in a previous post, the profiling trace shows a transparent pattern of GPU starvation — where the GPU spends most of its time waiting for data from the PyTorch DataLoader. This means that there's a performance bottleneck in the info input pipeline, which prevents input batches from being prepared quickly enough to maintain the GPU fully occupied. Importantly, input pipeline performance issues can stem from quite a lot of sources. Within the case of our toy example, the reason behind the bottleneck isn't apparent from the trace captured above.

A transient note for readers/developers that (despite all of our lecturing) remain aversive to using PyTorch Profiler: The info caching-based technique we'll discuss below will present an alternate way of identifying GPU starvation — so don't despair.

GPU Starvation — Finding the Root Cause

On this section, we briefly review common causes of performance bottlenecks on the input data pipeline.

Recall, that in a typical model execution flow:

Raw data is is loaded or streamed from storage (e.g., local RAM or disk, a distant network file system, or a cloud-based object store reminiscent of Amazon S3 or Google Cloud Storage).
It's then preprocessed on the CPU.
Finally, the processed data is copied to the GPU for inference or training.

Correspondingly, bottlenecks can emerge at each of the next stages:

Slow data retrieval: There are multiple aspects that may limit how quickly raw data may be retrieved by the CPU, including: the selection of storage backend (e.g., cloud storage vs. local SSD), the available network bandwidth, the info format, and more.
CPU resource exhaustion or misuse: Preprocessing tasks — reminiscent of data augmentation, image transformations, or decompression — may be CPU-intensive. When the number or complexity of those operations exceeds the available CPU capability, or if the CPU resources are managed inefficiently (e.g., an in-optimal selection of variety of staff), a bottleneck can occur. It’s value noting that CPUs are also chargeable for other model-related duties like loading GPU kernels, memory management, metric reporting, and more.
Host-to-device transfer bottlenecks: Once data is processed, it should be transferred to the GPU. This may turn out to be a bottleneck if data batches are large relative to the CPU-GPU memory bandwidth, or if the memory copying is performed inefficiently (e.g., individual samples are copied relatively than full batches).

The Limitation of Performance Profilers

A typical method to discover data pipeline bottlenecks is through the use of a performance profiler. Partially 4 of this series, Solving Bottlenecks on the Data Input Pipeline with PyTorch Profiler and TensorBoard, we demonstrated methods to do that using PyTorch’s built-in profiler. Nevertheless, on condition that the input data pipeline runs on the CPU, any Python profiler may very well be used.

The issue with this approach is that we typically use multiple employee processes for data loading, making performance profiling particularly complex. In our previous post, we overcame this by running the data-loading and the model execution in a single process (i.e., we set the argument of the constructor to zero). Nevertheless, it is a highly intrusive configuration change that may have a major impact on the general performance of our model.

The caching-based method we present on this post goals to pinpoint the source of the performance bottleneck in a far less intrusive manner. Specifically, it would enable us to measure the model performance without altering the multi-worker data-loading behavior.

Bottleneck Detection via Caching

On this section, we propose a multi-step approach for analyzing the performance of the input data pipeline. We’ll reveal how this method may be applied to our toy training workload to discover the causes of the GPU starvation.

Step 1: Cache a Batch on the Device

We start by making a single input batch, copying it to the GPU, after which measuring the runtime performance of the model when iterating over just that batch. This provides a theoretical upper certain on the model’s throughput — i.e., the utmost throughput achievable when the GPU isn't data-starved.

In the next code block we modify the training loop of our toy script in order that it runs on a single batch that's cached on the GPU:

data = next(iter(train_loader))
inputs = data[0].to(device=device, non_blocking=True)
labels = data[1].to(device=device, non_blocking=True)
t0 = time()
times = []
for step in range(100):
    train_step(model, criterion, optimizer, inputs, labels)
    times.append(time()-t0)
    t0 = time()

The resultant average throughput is 3.45 steps per second — nearly 4 times higher than our baseline result. Not only does this confirm a major data pipeline bottleneck, however it also quantifies its impact.

Bonus Tip: Profile and Optimize with Device-Cached Data
Running a profiler on a single batch cached on the GPU isolates the model execution from the input pipeline. This helps you discover inefficiencies within the model’s raw compute path. Ideally, GPU utilization here should approach 100%. In our case, utilization is around 95%, which is appropriate.

Step 2: Cache a Batch on the Host (CPU)

Next, we cache a single input batch on the host (CPU) as a substitute of the device. Now, each step includes each a memory copy from CPU to GPU and the model execution.

Since PyTorch’s memory pinning allows for asynchronous data transfers, we expect the host-to-device memory copy for batch to overlap with the model execution on batch . Consequently, our expectation is that the throughput can be in the identical ballpark as within the device-cached case. If not, this is able to be a transparent indication of a bottleneck within the host to device memory copy.

The next block of code accommodates our application of this step to our toy model:

data = next(iter(train_loader))
t0 = time()
times = []
for step in range(100):
    inputs = data[0].to(device=device, non_blocking=True)
    labels = data[1].to(device=device, non_blocking=True)
    train_step(model, criterion, optimizer, inputs, labels)
    times.append(time()-t0)
    t0 = time()

The resultant throughput following this transformation is 3.33 steps per second — a minor drop from the previous result — indicating that the host-to-device transfer isn't a bottleneck. We want to maintain in search of the source of our performance bottleneck.

Steps 3 and on: Cache at Intermediate Stages within the Data Pipeline

We proceed our search by “climbing” up the info input pipeline, caching at various intermediate points to pinpoint the bottleneck. The precise application of this process will vary based on the main points of the pipeline. Suppose the pipeline may be broken into stages. If caching after stage yields a significantly worse throughput when caching after stage we are able to deduce that that the inclusion of the processing of stage is what's slowing us down.

Step 3a: Cache a Single Processed Sample
Within the code block below, we modify our dataset to cache one fully processed sample. This simulates a pipeline that features data collation and the CPU to GPU data copy.

class FakeDataset(VisionDataset):
    def __init__(self, transform):
        super().__init__(root=None, transform=transform)
        self.size = 10000
        self.cache = None

    def __getitem__(self, index):
        if self.cache is None:
            # create a random 1024x1024 image
            img = Image.fromarray(np.random.randint(
                low=0,
                high=256,
                size=(input_img_size[0], input_img_size[1], 3),
                dtype=np.uint8
            ))
            # create a random label
            goal = np.random.randint(low=0, high=num_classes,
                                       dtype=np.uint8).item()
            # Apply tranformations
            img = self.transform(img)
            self.cache = img, goal
        return self.cache

The resultant throughput is 3.23 steps per second— still far higher than our baseline of 0.89. We still haven't found the perpetrator.

Step 3b: Cache Raw Data (Before Transformation)
Next, we modify the dataset in order to cache the raw data (e.g., unprocessed image files). The input data pipeline now includes the info transformations, data collation, and the CPU to GPU data copy.

class FakeDataset(VisionDataset):
    def __init__(self, transform):
        super().__init__(root=None, transform=transform)
        self.size = 10000
        self.cache = None

    def __getitem__(self, index):
        if self.cache is None:
            # create a random 1024x1024 image
            img = Image.fromarray(np.random.randint(
                low=0,
                high=256,
                size=(input_img_size[0], input_img_size[1], 3),
                dtype=np.uint8
            ))
            # create a random label
            goal = np.random.randint(low=0, high=num_classes,
                                       dtype=np.uint8).item()
            self.cache = img, goal
        # Apply tranformations
        img = self.transform(self.cache[0])
        return img, self.cache[1]

This time, the throughput drops sharply — all the way in which right down to 1.72 steps per second. We now have found our first perpetrator: the info transformation function.

Interim Results

Here’s a summary of the experiments thus far:

The outcomes point to a major slowdown introduced by the info transformation step. The gap between the raw data caching result and the baseline also suggests that raw data loading could also be one other perpetrator. Let’s begin with the info processing bottleneck.

Optimizing the Data Transformation

We now proceed with our newfound discovery of a performance bottleneck in the info processing function. The following logical step can be to interrupt the function into individual components and apply our caching technique to every one with a view to derive more insight into the precise sources of our GPU starvation. For the sake of brevity, we'll skip ahead and apply the info processing optimizations discussed in our previous post, Solving Bottlenecks on the Data Input Pipeline with PyTorch Profiler and TensorBoard. Please see there for details.

Following the info transformation optimizations, the throughput of the cached raw data experiment shoots as much as 3.23. We now have eliminated the bottleneck in the info processing function.

Nevertheless, our recent baseline throughput (without caching) becomes 1.28 steps per second, indicating that there stays a bottleneck within the raw data loading. This is analogous to the tip result we reached in our previous post.

Throughput Following Transform Optimization (by Creator)

Optimizing Raw Data Loading

To resolve the remaining bottleneck, we simulate the optimization demonstrated partly 5 of this series, The best way to Optimize Your DL Data-Input Pipeline with a Custom PyTorch Operator. We do that by reducing the dimensions of our initial random image from 1024×1024 to 256×256. Following, this transformation the tip to finish (un-cached) training step increases to three.23. Problem solved!

Necessary Caveats

We conclude with just a few vital notes and caveats.

A drop in throughput resulting from inclusion of a certain data-processing step in the info pipeline, doesn't necessarily mean that it's that specific step that requires optimization. It's entirely possible that it's one other step CPU utilization near the limit, and the brand new step just tipped it over.
In case your input data varies in size, throughput from a single cached data sample or batch of samples may not reflect real-world performance.
The identical caveat applies if the AI model includes dynamic, data-dependent , features, e.g., if components of the model graph are depending on the input data.

Suggestions, Tricks, and Techniques for Addressing Bottlenecks on the Data Input Pipeline

We conclude this post with an inventory of suggestions, tricks, and techniques for optimizing the info input pipeline of PyTorch-based AI models. This list is certainly not exhaustive — quite a few additional optimizations exist depending in your specific use case and infrastructure. We divide the optimizations into three categories:

Optimizing Raw Data Entry/Retrieval
Optimizing Data Processing
Optimizing Host-to-Device Data Transfer

Optimizing Raw Data Entry/Retrieval

Efficient data loading starts with fast and reliable access to raw data. The next suggestions may help:

Select an instance type with sufficient network ingress bandwidth.
Use a quick and cost-effective data storage solution. Local SSDs are fast but expensive. Cloud-based solutions like S3 offer scalability, but may introduce latency.
Maximize storage network egress. Consider partitioning datasets in S3 or tuning parallel downloads to cut back throttling.
Consider raw data compression. Compressing files can reduce transfer time — but be careful for increased CPU cost during decompression.
Group small samples into larger files. This may reduce overhead related to opening and shutting many files.
Use optimized data transfer tools. For instance, s5cmd can significantly outperforms AWS CLI for bulk S3 downloads.
Tune data retrieval parameters. Adjusting chunk size or concurrency settings can greatly impact read performance.

Addressing Data Processing Bottlenecks

Tune the number of information loading staff and prefetch factor.
Every time possible, offload data-processing to the info preparation phase.
Select an instance type with an optimal CPU/GPU compute ratio.
Optimize the order of transformations. For instance, applying a crop before blurring can be faster blurring the complete sized image and only then cropping.
Leverage Python acceleration libraries. For instance, Numba and JAX can speed up pure Python operations via JIT compilation.
Create custom PyTorch CPU operators where appropriate (e.g., see here).
Consider adding auxiliary CPUs (data servers) — (e.g., see here).
Move GPU-friendly transforms to the GPU graph. Some transforms (e.g., normalization) may be done post-loading on the GPU for higher overlap.
Tune OS-level thread and memory configurations.

Optimizing the Host to Device Data Copy

Use memory pinning and non-blocking data copies to prefetch data directly onto the GPU. Also see the dedicated CudaDataPrefetcher offered by TorchTNT.
Postpone int8 to float32 datatype conversions to the GPU to cut back memory copy payload by an element of 4.
In case your model is using lower precision floats (e.g., fp16/bfloat16) solid the floats on the CPU to cut back payload by half.
Postpone unpacking of one-hot vectors to the GPU — i.e., keep them as label ids until the last possible moment.
If you will have many binary values, think about using bitmasks to compress the payload. For instance, if you will have 8 binary maps, consider compressing them right into a single uint8.
In case your input data is sparse, think about using sparse data representations.
Avoid unnecessary padding. While zero-padding is a preferred technique for coping with variable sized input samples, it may possibly significantly increase the dimensions of the memory copy. Consider alternative options (e.g., see here).
Make certain you aren't copying data that you simply don't really need on the GPU!!

Summary

While GPUs are considered essential for contemporary day AI/ML development they arrive at a steep price. When you’ve decided to make the essential investment into their acquisition, it would be best to be certain that they're getting used as much as possible. The final thing you would like is on your GPU to sit down idle, waiting for input data resulting from a preventable bottleneck elsewhere within the pipeline.

Unfortunately, such inefficiencies are all too common. On this post, we introduced an easy technique for diagnosing these issues by iteratively caching data at different stages of the input pipeline. By isolating the runtime impact of every pipeline component, this method helps discover specific bottlenecks — whether in raw data loading, preprocessing, or host-to-device transfer.

In fact, the precise implementation will vary across projects and pipelines, but we hope this strategy provides a useful framework for diagnosing and resolving performance issues in your individual AI/ML workflows.

A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline

A Toy PyTorch Model

GPU Starvation — Finding the Root Cause

The Limitation of Performance Profilers

Bottleneck Detection via Caching

Step 1: Cache a Batch on the Device

Step 2: Cache a Batch on the Host (CPU)

Steps 3 and on: Cache at Intermediate Stages within the Data Pipeline

Interim Results

Optimizing the Data Transformation

Optimizing Raw Data Loading

Necessary Caveats

Suggestions, Tricks, and Techniques for Addressing Bottlenecks on the Data Input Pipeline

Optimizing Raw Data Entry/Retrieval

Addressing Data Processing Bottlenecks

Optimizing the Host to Device Data Copy

Summary

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

“Llama 3.2 in Keras”

Enhancing Model Security for the ML Community

WebGPU Support, Recent Models & Tasks, and More…

The Machine Learning “Advent Calendar” Day 24: Transformers for Text in Excel

Diffusers welcomes Stable Diffusion 3.5 Large

A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline

A Toy PyTorch Model

GPU Starvation — Finding the Root Cause

The Limitation of Performance Profilers

Bottleneck Detection via Caching

Step 1: Cache a Batch on the Device

Step 2: Cache a Batch on the Host (CPU)

Steps 3 and on: Cache at Intermediate Stages within the Data Pipeline

Interim Results

Optimizing the Data Transformation

Optimizing Raw Data Loading

Necessary Caveats

Suggestions, Tricks, and Techniques for Addressing Bottlenecks on the Data Input Pipeline

Optimizing Raw Data Entry/Retrieval

Addressing Data Processing Bottlenecks

Optimizing the Host to Device Data Copy

Summary

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.