Optimizing Data Transfer in AI/ML Workloads

-

a , a deep learning model is executed on a dedicated GPU accelerator using input data batches it receives from a CPU host. Ideally, the GPU — the dearer resource — needs to be maximally utilized, with minimal periods of idle time. Particularly, which means each time it completes its execution on a batch, the next batch will probably be “ripe and prepared” for processing. When this doesn’t occur, the GPU idles while waiting for input data — a standard performance bottleneck also known as GPU starvation.

In previous posts, (e.g., see A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline), we discussed common causes of this issue, including: inefficient storage retrieval, CPU resource exhaustion, and host-to-device transfer bottlenecks. On this post, we zoom in on data transfer bottlenecks and revisit their identification and backbone — this time with the assistance of NVIDIA Nsight™ Systems (nsys), a performance profiler designed for analyzing the system-wide activity of workloads running on NVIDIA GPUs.

NVIDIA Nsight vs. PyTorch Profiler

Readers aware of our work could also be surprised on the mention of NVIDIA Nsight profiler reasonably than PyTorch Profiler. In our previous posts we have now advocated strongly for using PyTorch Profiler in AI/ML model development as a tool for identifying and optimizing runtime performance. Repeatedly, we have now demonstrated its application to a wide selection of performance issues. Its use doesn’t require any special installations and may be run without special OS permissions. NVIDIA Nsight profiler, alternatively, requires a dedicated system setup (or a dedicated NVIDIA container) and — for a few of its features — elevated permissions, making its use less accessible and more complicated than PyTorch Profiler.

The 2 profilers differ of their focus: PyTorch profiler is a profiler tightly coupled with PyTorch and heavily focused on how models use the PyTorch software stack and supporting libraries. NVIDIA Nsight profiler is a -level profiler; it doesn’t know the main points of the model being run or which framework is getting used, but reasonably how the components of your complete system are getting used and utilized. While PyTorch Profiler excels at tracing the low-level operations of a PyTorch model execution, nsys provides an in depth view of the activities of your complete system (GPU hardware, CUDA streams, OS interrupts, Network, PCIe, etc.). For a lot of performance issues PyTorch profiler is sufficient for identifying and solving the source of the bottleneck; But some situations call for nsys profiler, the “big guns”, for deriving deeper insights into the inner workings of the underlying system.

On this post we intend to reveal among the unique capabilities of nsys profiler and their application to the common data-transfer bottleneck.

Outline

To facilitate our discussion we’ll define a toy ML workload with a data-transfer performance bottleneck and proceed to introduce a lot of successive optimizations in an try to solve it. Throughout the method, we’ll use the nsys profiler in an effort to analyze the system performance and assess the impact of the code modifications.

Setup

We are going to run our experiments on an Amazon EC2 g6e.2xlarge instance with an NVIDIA L40S GPU running an AWS Deep Learning (Ubuntu 24.04) AMI with PyTorch (2.8). To put in the nsys-cli profiler (version 2025.6.1) we follow the official NVIDIA guidelines:

wget https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_6/NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb
sudo apt install ./NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb

The NVIDIA Tools Extension (NVTX) library allows us to annotate our code with human-readable labels to extend the readability and comprehension of the performance trace. While PyTorch offers built-in NVTX support via its torch.cuda.nvtx APIs, we’ll use the standalone nvtx package (version 0.2.14) which supports color-coding the trace timeline for higher visual evaluation:

pip install nvtx

Disclaimers

The code we’ll share is meant for demonstrative purposes; please don’t depend on its correctness or optimality. Please don’t interpret our use of any library, tool, or platform, as an endorsement of its use. The impact of the optimizations we’ll cover can vary greatly based on the main points of the model and the runtime environment. Please make sure to assess their effect on your personal use case before integrating their use.

Many because of Yitzhak Levi and Gilad Wasserman for his or her contributions to this post.

A Toy PyTorch Model

We introduce a training script intentionally designed to consist of a bottleneck on the data-input pipeline.

Within the code block below we define an easy image classification model with a ResNet-18 backbone.

import time, torch, torchvision

DEVICE = "cuda"
model = torchvision.models.resnet18().to(DEVICE).train()
optimizer = torch.optim.Adam(model.parameters())

Next, we define an artificial dataset which we’ll use to coach our toy model.

from torch.utils.data import Dataset, DataLoader

WARMUP_STEPS = 10
PROFILE_STEPS = 3
COOLDOWN_STEPS = 1
TOTAL_STEPS = WARMUP_STEPS + PROFILE_STEPS + COOLDOWN_STEPS
BATCH_SIZE = 64
TOTAL_SAMPLES = TOTAL_STEPS * BATCH_SIZE
IMG_SIZE = 512

# An artificial Dataset with random images and labels
class FakeDataset(Dataset):

    def __len__(self):
        return TOTAL_SAMPLES

    def __getitem__(self, index):
        img = torch.randn((3, IMG_SIZE, IMG_SIZE))
        label = torch.tensor(index % 10)
        return img, label

train_loader = DataLoader(
    FakeDataset(),
    batch_size=BATCH_SIZE
)

Lastly, we define a regular training step programmed to run nsys-profiler for 3 steps using the torch.cuda.profiler.start and stop commands — intended to be used at the side of the nsys cli. We highlight the components of the training step using the nvtx.annotate utility. Please consult with the official documentation for more details on profiling with nsys in PyTorch.

import nvtx
from torch.cuda import profiler

def copy_data(batch):
    data, targets = batch
    data_gpu = data.to(DEVICE)
    targets_gpu = targets.to(DEVICE)
    return data_gpu, targets_gpu


def compute_step(model, batch, optimizer):
    data, targets = batch
    output = model(data)
    loss = torch.nn.functional.cross_entropy(output, targets)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    return loss


data_iter = iter(train_loader)

for i in range(TOTAL_STEPS):

    if i == WARMUP_STEPS:
        # start nsys profiler
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        profiler.start()
    elif i == WARMUP_STEPS + PROFILE_STEPS:
        # stop nsys profiler
        torch.cuda.synchronize()
        profiler.stop()
        end_time = time.perf_counter()

    with nvtx.annotate(f"Batch {i}", color="blue"):
        with nvtx.annotate("get batch", color="red"):
            batch = next(data_iter)
        with nvtx.annotate("copy batch", color="yellow"):
            batch = copy_data(batch)
        with nvtx.annotate("Compute", color="green"):
            compute_step(model, batch, optimizer)

total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")

We run our script using the  option to start out and stop the profiler programmatically. Please see the official documentation for full details on profiling from the nsys cli.

nsys profile 
  --capture-range=cudaProfilerApi 
  --trace=cuda,nvtx,osrt 
  --output=baseline 
  python train.py

This ends in a  trace file that we copy over to our development machine for evaluation.

With a view to draw a comparison to PyTorch profiler, we define an alternate training loop programmed with PyTorch Profiler and annotated with the torch.profiler.record_function utility:

from torch.profiler import (
    profile, record_function, schedule, tensorboard_trace_handler
)

with profile(
    schedule=schedule(wait=0, warmup=WARMUP_STEPS, 
                      energetic=PROFILE_STEPS, repeat=1),
    on_trace_ready=tensorboard_trace_handler('./baseline'),
    record_shapes=True,
    with_stack=True
) as prof:
    for i in range(TOTAL_STEPS):
        with record_function("get batch"):
            batch = next(data_iter)
        with record_function("copy batch"):
            batch = copy_data(batch)
        with record_function("compute"):
            compute_step(model, batch, optimizer)
        prof.step()

The throughput of our baseline experiment is 2.97 steps-per-second. In the subsequent sections we’ll use the profile traces to discover performance bottlenecks in our training step and take a look at to enhance on this result.

Baseline Performance Evaluation

To investigate the resultant nsys trace file, we open it within the Nsight Systems GUI application. Within the image below we zoom in on the timeline of two of the training steps captured by the profiler:

Baseline Nsight Systems Profiler Trace (by Writer)

The trace accommodates a wealth of data, only a subset of which we’ll touch on on this post. Please see the nsys documentation for added functionalities and features.

The timeline is split into two parts: the CUDA section which reports GPU activity and the threads section which reports the CPU activity. The CUDA section makes a transparent distinction between the GPU kernel (compute) activity (90.9%) and memory activity (9.1%). The highest bars in each section report the utilization of every of the resources and each sections include an NVTX section with the coloured annotations we included in our training step. We note the next observations:

  1. The GPU is idle for roughly 50% of every training step. This may be seen by the portion of time taken by each batch (in blue) within the GPU NVTX bar and the massive blocks of whitespace in between them.
  2. The GPU activity for every batch starts immediately after the “get batch” activity has accomplished on the CPU. It starts with the host-to-device memory copy, marked in light green and continues with the kernel computations, marked in light blue.
  3. Once the CPU has launched the GPU memory and compute commands for batch , it proceeds to the subsequent batch within the training loop — resulting in a partial overlap of batch  on the CPU with batch  on the GPU.
  4. The overwhelming majority of the CPU thread is spent on the “get batch” activity. This constitutes the first bottleneck in our baseline experiment.

The profiling trace points to a transparent perpetrator — the dataloader. By default, PyTorch performs single process data loading — a single CPU process is used to load the subsequent data input batch, copy it to the GPU, and launch the compute kernels — all in a sequential manner. This typically ends in severe under-utilization of the CPU resources by: 1) limiting dataloading to only a single process, and a pair of) making the loading of the subsequent batch contingent on the completion of the CPU processing (i.e., kernel loading) of the previous batch. Our irresponsible use of our CPU resources has resulted in our GPU being starved for input data.

The identical conclusion might have been reached using PyTorch Profiler trace shown below:

Baseline PyTorch Profiler Trace (by Writer)

Here too, we are able to see long periods of GPU underutilization which can be brought on by the long “get batch” blocks on the CPU side.

Optimization 1: Multi-Process Data Loading

Step one is to change the info input pipeline to make use of multi-process data loading. We set the variety of staff to match the 8 vCPUs available on our Amazon EC2 g6e.2xlarge instance. In a real-world scenario, this value needs to be tuned for optimal throughput:

NUM_WORKERS = 8

train_loader = DataLoader(
    FakeDataset(),
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS
)

Following this alteration our throughput jumps to 4.81 steps per second — a 62% improvement over our baseline result. The corresponding nsys profiler trace is shown below:

Multiproc Dataloading Nsight Systems Profiler Timeline (by Writer)

Note that the red “get batch” segment has grow to be only a tiny sliver of every step within the NVTX bar. Instead, the yellow “copy batch” block now takes center stage. In consequence of our use of multi-process dataloading, there may be now at all times a brand new batch ready for processing — but can we do higher?

Taking a better have a look at the GPU section we see that there remains to be a good portion (~290 milliseconds) of idle time in between the memory operation and the kernel compute. This idle time is perfectly aligned with an “munmap” operation within the OS runtime bar. The “munmap” block is a CPU-side memory cleanup operation performed just after the CUDA memory copy is complete. It occurs on the tail-end of the long yellow “copy batch” operation. The compute kernels are launched onto the GPU only after the memory cleanup has accomplished. It is a clear pattern of synchronous host-to-device memory copy: The CPU cannot proceed with kernel loading until the info copy operation has been fully accomplished and the GPU stays idle until the CPU loads the kernels.

The PyTorch profiler trace shows the identical GPU idle time but it surely doesn’t provide the identical “munmap” hint. That is our first example of the advantage of the system-wide visibility of the nsys profiler.

Multiproc Dataloading PyTorch Profiler Trace (by Writer)

With our finding of the data-copy performance bottleneck in hand, we proceed to our next optimization.

Optimization 2: Asynchronous Data Transfer

The answer to the bottleneck we have now found is to program our training step to load data asynchronously. This allows the CPU to launch the compute kernels immediately after sending the memory copy command — without waiting for the memory copy to be accomplished. This fashion the GPU can begin processing the kernels as soon because the CUDA memory copy is finished. Enabling asynchronous data copy requires two changes: First we must program the dataloader to make use of pinned memory (as a substitute of pageable memory), and second, we must pass non_blocking=True argument to the to() operations:

NUM_WORKERS = 8
ASYNC_DATATRANSFER = True


train_loader = DataLoader(
    FakeDataset(),
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    pin_memory=ASYNC_DATATRANSFER
)

def copy_data(batch):
    data, targets = batch
    data_gpu = data.to(DEVICE, non_blocking=ASYNC_DATATRANSFER)
    targets_gpu = targets.to(DEVICE, non_blocking=ASYNC_DATATRANSFER)
    return data_gpu, targets_gpu

Using asynchronous dataloading ends in a throughput of 5.91 steps per second — an extra 23% improvement and 99% improvement overall. The resultant profiling trace is shown below:

Async Dataloading Nsight Systems Profiler Timeline (by Writer)

We now see the entire CPU operations bunched together in the beginning of the trace. We now have removed all performance obstacles on the CPU side allowing it to freely load the info and kernels to the GPU. Within the GPU section, we see continuous activity with none idle time. We do, nevertheless, see a transparent separation between CUDA memory activities (in light green) and CUDA kernel activities (in light blue). PyTorch profiler, in contrast, doesn’t make this distinction clear. That is one other advantage of the hardware-centric profiler and, within the case of our toy experiment, is what informs the subsequent steps of our optimization.

Async Dataloading PyTorch Profiler Trace (by Writer)

Optimization 3: Pipelining With CUDA Streams

Our final optimizations derive from the incontrovertible fact that modern GPUs, equivalent to the NVIDIA L40S, use independent engines for copying memory (the DMA) and executing compute kernels (the SMs). We will make the most of this by parallelizing the distinct memory and kernel activities we saw within the nsys profiler trace. We are going to program this through using CUDA streams.

In a previous post, we expanded on the chance for optimizing AI/ML workloads using CUDA Streams. Here, we apply the same pipelining strategy: We define two distinct “copy” and “compute” CUDA streams and program the “copy” stream to repeat batch  at the identical time that the “compute” stream is processing batch :

# define two CUDA streams
compute_stream = torch.cuda.Stream()
copy_stream = torch.cuda.Stream()


# extract first batch
next_batch = next(data_iter)
with torch.cuda.stream(copy_stream):
    next_batch = copy_data(next_batch)

for i in range(TOTAL_STEPS):

    if i == WARMUP_STEPS:
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        profiler.start()
    elif i == WARMUP_STEPS + PROFILE_STEPS:
        torch.cuda.synchronize()
        profiler.stop()
        end_time = time.perf_counter()

    with nvtx.annotate(f"Batch {i}", color="blue"):
        # wait for copy stream to finish copy of batch N
        compute_stream.wait_stream(copy_stream)
        batch = next_batch

        # execute model on batch N+1 compute stream
        try:
            with nvtx.annotate("get batch", color="red"):
                next_batch = next(data_iter)
            with torch.cuda.stream(copy_stream):
                with nvtx.annotate("copy batch", color="yellow"):
                    next_batch = copy_data(next_batch)
        except:
            # reached end of dataset
            next_batch = None

        # execute model on batch N compute stream
        with torch.cuda.stream(compute_stream):
            with nvtx.annotate("Compute", color="green"):
                compute_step(model, batch, optimizer)

total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")

This optimization ends in a throughput of 6.44 steps per second — a 9% improvement over our previous experiment. We note that the impact of this optimization is capped by the duration of the longer of the 2 operation types. In our previous profile trace, the memory block took 15.5 milliseconds and the kernel block took 155 milliseconds. In the present profile trace, your complete GPU steps takes 155 milliseconds, which suggests that the memory copy time is accomplished hidden by the kernel compute time and that our optimization reaches the utmost possible result.

The usage of the CUDA streams and its impact on GPU utilization may be seen within the traces of each profilers:

Pipelined Nsight Systems Profiler Timeline (by Writer)
Pipelined PyTorch Profiler Trace (by Writer)

Optimization 4: Prefetching to CUDA

For our final step, we move the info copying from the predominant training loop process to the info loading process: Somewhat than explicitly calling the copy function contained in the training loop, we assume that the batches returned from the info iterator are already placed on the GPU.

Within the code block below, we wrap our dataloader with a CUDA-prefetching iterator class. Note, that it is a simplified implementation intended for the needs of demonstration. More work could also be required for more complex scenarios (e.g., DDP training). Alternatively, you might consider a third-party implementation equivalent to torchtnt.utils.data.data_prefetcher.CudaDataPrefetcher:

class DataPrefetcher:
    def __init__(self, loader):
        self.loader = iter(loader)
        self.stream = torch.cuda.Stream()
        self.next_batch = None
        self.preload()

    def preload(self):
        try:
            data, targets = next(self.loader)

            with torch.cuda.stream(self.stream):
                with nvtx.annotate("copy batch", color="yellow"):
                    next_data = data.to(DEVICE, non_blocking=True)
                    next_targets = targets.to(DEVICE, non_blocking=True)
            self.next_batch = (next_data, next_targets)        
        except:
            self.next_batch = (None, None)

    def __iter__(self):
        return self

    def __next__(self):
        torch.cuda.current_stream().wait_stream(self.stream)
        data, targets = self.next_batch
        self.preload()
        return data, targets


data_iter = DataPrefetcher(train_loader)

for i in range(TOTAL_STEPS):
    if i == WARMUP_STEPS:
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        profiler.start()
    elif i == WARMUP_STEPS + PROFILE_STEPS:
        torch.cuda.synchronize()
        profiler.stop()
        end_time = time.perf_counter()

    with nvtx.annotate(f"Batch {i}", color="blue"):
        with nvtx.annotate("get batch", color="red"):
            batch = next(data_iter)
        with nvtx.annotate("Compute", color="green"):
            loss = compute_step(model, batch, optimizer)

total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")

This optimization ends in a throughput of 6.44 steps per second — the identical as our previous experiment. This mustn’t surprise us since we have now already seen that the throughput is certain by the 155 millisecond GPU compute and our optimization has not done anything to cut back the kernel compute time.

More generally, despite the removal of the copy call from the predominant loop, you’ll have a tough time finding a situation where this may have a meaningful impact on performance because the call is already being called asynchronously. Nevertheless, given the minimal changes to the training loop, you might find this solution to be cleaner and/or to be more applicable to be used with high-level libraries that don’t enable fine-grained control of the training loop.

Unsurprisingly, the profile traces for this experiment appear nearly similar to the previous ones. The predominant difference is the position of the yellow “copy data” block within the NVTX row of the CPU section.

Data Prefetching Nsight Systems Profiler Timeline (by Writer)
Data Prefetching PyTorch Profiler Trace (by Writer)

Results

The table below summarizes the outcomes of our experiments:

Experiment Results (by Writer)

The optimizations, which were driven by means of Nsight Systems profiler, resulted in an overall increase of  to the runtime performance.

Summary

GPU starvation is a standard performance bottleneck that may have a devastating impact on the efficiency and costs of AI/ML workloads. On this post, we demonstrated how you can use the Nsight Systems profiler to review the causes of the performance bottleneck and take informed steps towards their resolution. Along the way in which, we emphasized the unique capabilities of Nsight Systems profiler compared to the built-in framework-centric PyTorch Profiler — specifically its deep system-level visibility.

Our focus, on this post has been on the host-to-device data copy that typically occurs in the beginning of the training step. Nevertheless, data-transfer bottlenecks can appear at different stages of coaching. In a sequel to this post we intend to repeat our nsys profiling evaluation on data copies getting into the wrong way — from the device to the host. Stay tuned!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x