Optimizing Data Transfer in Distributed AI/ML Training Workloads

-

a part of a series of posts on optimizing data transfer using NVIDIA Nsight™ Systems (nsys) profiler. Part one focused on CPU-to-GPU data copies, and part two on GPU-to-CPU copies. On this post, we turn our attention to data transfer between GPUs.

Nowadays, it is kind of common for AI/ML training — particularly of huge models — to be distributed across multiple GPUs. While there are various different schemes for performing such distribution, what all of them have in common is their reliance on the constant transfer of information — comparable to gradients, weights, statistics, and/or metrics — between the GPUs, throughout training. As with the opposite sorts of data transfer we analyzed in our previous posts, here too, a poor implementation could easily result in under-utilization of compute resources and the unjustified inflation of coaching costs. Optimizing GPU-to-GPU communication is an energetic area of research and innovation involving each hardware and software development.

On this post, we’ll deal with essentially the most common type of distributed training — data-distributed training. In data-distributed training, similar copies of the ML model are maintained on each GPU. Each input batch is evenly distributed among the many GPUs, each of which executes a training step to calculate the local gradients. The local gradients are then shared and averaged across the GPUs, leading to a similar gradient update to every of the model copies. Using NVIDIA Nsight™ Systems (nsys) profiler, we’ll analyze the effect of the GPU-to-GPU transfer of the gradients on the runtime performance of coaching a toy model and assess a couple of techniques for reducing its overhead.

Disclaimers

The code we’ll share on this post is meant for demonstrative purposes; please don’t depend on its accuracy or optimality. Please don’t interpret our mention of any tool, framework, library, service, or platform as an endorsement of its use.

Due to Yitzhak Levi for his contributions to this post.

Instance Selection for Distributed Training

In a previous post, Instance Selection for Deep Learning, we discussed the importance of selecting an instance type that’s best suited to your AI/ML workload and the potential impact of that selection on the success of your project. When selecting an instance type for a workload that features a number of GPU-to-GPU traffic, you’ll want to concentrate to how such communication is carried out, including: the instance topology, GPU interconnects, maximal throughput, and latency.

On this post, we’ll limit our discussion to distribution between GPUs on a single instance. We’ll experiment with two instance types: an Amazon EC2 g6e.48xlarge with 8 NVIDIA L40S GPUs and an Amazon EC2 p4d.24xlarge with 8 NVIDIA A100 GPUs. Each will run an AWS Deep Learning (Ubuntu 24.04) AMI with PyTorch (2.8)nsys-cli profiler (version 2025.6.1), and the NVIDIA Tools Extension (NVTX) library.

Considered one of the first differences between the 2 instance types is how the GPUs are connected: On the g6e.48xlarge communication between the GPUs is over PCI Express (PCIe), whereas p4d.24xlarge includes NVIDIA NVLink™ — dedicated hardware for enabling high-throughput GPU-to-GPU communication. Communication over the PCIe bus is significantly slower than NVLink. While this will likely be sufficient for workloads with a low communication-to-compute ratio, it could possibly be a performance killer for workloads with high communication rates.

To find the topology of the instance types, we run the next command:

nvidia-smi topo -m

On the g6e.48xlarge, we get the next results:

GPU Topology of g6e.48xlarge (by Writer)

GPUs on the identical NUMA node are connected by a “NODE” link and GPUs on different NUMA nodes by a “SYS” link. Each links traverse the PCIe in addition to one other interconnect; neither is a direct connection (a.k.a., a “PIX” link). We’ll see in a while how this may impact throughput performance.

On the p4d.24xlarge, every pair of GPUs is linked by a dedicated NVLink connection:

GPU Topology of p4d.24xlarge (by Writer)

A Toy Model

To facilitate our discussion, we construct a toy data-distributed training experiment.

We decide a model with a comparatively high communication-to-compute ratio — a Vision Transformer (ViT)-L/32 image-classification model with roughly 306 million parameters — and an artificial dataset that we’ll use to coach it:

import os, time, torch, nvtx
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import Dataset, DataLoader
from torchvision.models import vit_l_32

WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))
BATCH_SIZE = 32
IMG_SIZE = 224
WARMUP_STEPS = 10
PROFILE_STEPS = 3
COOLDOWN_STEPS = 1
TOTAL_STEPS = WARMUP_STEPS + PROFILE_STEPS + COOLDOWN_STEPS
N_WORKERS = 8

def get_model():
    return vit_l_32(weights=None)

# An artificial dataset with random images and labels
class FakeDataset(Dataset):
    def __len__(self):
        return TOTAL_STEPS * BATCH_SIZE * WORLD_SIZE

    def __getitem__(self, index):
        img = torch.randn((3, IMG_SIZE, IMG_SIZE))
        label = torch.randint(0, 1000, (1,)).item()
        return img, label

def get_data_iter(rank, world_size):
    dataset = FakeDataset()

    sampler = DistributedSampler(dataset, num_replicas=world_size,
                                 rank=rank, shuffle=True)

    train_loader = DataLoader(dataset, batch_size=BATCH_SIZE,
                              sampler=sampler, num_workers=N_WORKERS, 
                              pin_memory=True)

    return iter(train_loader)

We define a utility function that we’ll use to establish PyTorch DistributedDataParallel (DDP) training. We configure the PyTorch distributed communication package to make use of the NVIDIA Collective Communications Library (NCCL) as its communication backend and wrap the model in a DistributedDataParallel container.

def configure_ddp(model, rank):
    dist.init_process_group("nccl")
    model = DDP(model,
                device_ids=[rank],
                bucket_cap_mb=2000)
    return model

Note that we naively configure the DDP bucket capability to 2 GB. DDP groups the model’s gradients into buckets and performs gradient reduction of every bucket in a separate command. The implication of setting the bucket capability to 2 GB is that each one of the ~306 million (FP32) gradients will slot in a single bucket (306 million x 4 bytes per gradient = ~1.22 GB) and reduction will only occur once all gradients have been calculated.

As in our previous posts, we schedule nsys profiler programmatically and wrap different portions of the training step with color-coded NVTX annotations:

def train(use_ddp=False):
    # detect the env vars set by torchrun
    rank = int(os.environ.get("RANK", 0))
    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    torch.cuda.set_device(local_rank)

    model = get_model().to(local_rank)
    criterion = nn.CrossEntropyLoss().to(local_rank)

    if use_ddp:
        model = configure_ddp(model, rank)

    optimizer = optim.SGD(model.parameters())

    data_iter = get_data_iter(rank, WORLD_SIZE)

    model.train()

    for i in range(TOTAL_STEPS):

        # Schedule Profiling
        if i == WARMUP_STEPS:
            torch.cuda.synchronize()
            start_time = time.perf_counter()
            torch.cuda.profiler.start()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            torch.cuda.synchronize()
            torch.cuda.profiler.stop()
            end_time = time.perf_counter()

        with nvtx.annotate(f"Batch {i}", color="blue"):

            with nvtx.annotate("get batch", color="red"):
                data, goal = next(data_iter)
                data = data.to(local_rank, non_blocking=True)
                goal = goal.to(local_rank, non_blocking=True)
            with nvtx.annotate("forward", color="green"):
                output = model(data)
                loss = criterion(output, goal)

            with nvtx.annotate("backward", color="purple"):
                optimizer.zero_grad()
                loss.backward()

            with nvtx.annotate("optimizer step", color="yellow"):
                optimizer.step()

    if use_ddp:
        dist.destroy_process_group()

    if rank == 0:
        total_time = end_time - start_time
        print(f"Throughput: {PROFILE_STEPS/total_time:.2f} steps/sec")


if __name__ == "__main__":
    # enable ddp if run with torchrun
    train(use_ddp="RANK" in os.environ)

We set the NCCL_DEBUG environment variable for visibility into how NCCL sets up the transport links between the GPUs.

export NCCL_DEBUG=INFO

Single-GPU Performance

We start our experimentation by running our script on a single GPU without DDP:

nsys profile 
  --capture-range=cudaProfilerApi 
  --capture-range-end=stop 
  --trace=cuda,nvtx,osrt,nccl 
  --output=baseline 
  python train.py

Note the inclusion of ncclwithin the  section; this will likely be critical for analyzing the GPU-to-GPU communication within the multi-GPU experiments.

The throughput of the baseline experiment is 8.91 steps per second on the L40S GPU and 5.17 steps per second on the A100: The newer NVIDIA Ada Lovelace Architecture performs higher than the NVIDIA Ampere Architecture on our toy model. Within the image below we show the timeline view for the L40S experiment.

Single L40S GPU Nsight Systems Trace (by Writer)

On this post we’ll deal with the CUDA portion of the timeline. the NVTX row, we see the recurring red (CPU-to-GPU copy), green (forward), purple (backward), and yellow (optimizer update) pattern that makes up our train step. Note, nonetheless, that the purple appears as only a tiny blip while, in practice, the backward pass fills the whole gap between the forward and optimizer blocks: The NVTX library appears to have captured only the initial launch of the autograd graph.

We’ll use this trace as a comparative baseline for our next experiments.

DDP with 1 GPU

We assess the impact of the DDP wrapper by running torchrun on a single GPU:

torchrun --nproc_per_node=1 
    --no-python 
    nsys profile 
    --capture-range=cudaProfilerApi 
    --capture-range-end=stop 
    --trace=cuda,nvtx,nccl,osrt 
    --output=ddp-1gpu 
    python train.py

The resulting throughput drops to eight.40 steps per second on the L40S GPU and 5.04 steps per second on the A100. Even within the absence of any cross-GPU communication, DDP introduces overhead that may decrease throughput by ~3–7%. The essential lesson from that is to at all times make sure that single GPU training is run without DDP.

The nsys trace of the L40S experiment confirms the presence of the DDP overhead:

Single L40S GPU With DDP – Profiler Trace (by Writer)

The predominant change from the training step within the previous trace is a big chunk of device-to-device () memory copies at the top of the backward block and just before the optimizer block (highlighted within the trace above). That is the DDP in motion: Even within the absence of a cross GPU gradient reduction, DDP prepares the local gradients in a dedicated memory block for reduction after which copies the outcomes back into the grad property of every parameter in preparation for the gradient update. (Note that the copies are between two memory locations on the identical GPU — not between two different GPUs).

DDP With Multiple GPUs

Next, we assess the impact of gradient sharing between 2, 4, and eight GPUs:

torchrun --nproc_per_node=8 
    --no-python 
    nsys profile 
    --capture-range=cudaProfilerApi 
    --capture-range-end=stop 
    --trace=cuda,nvtx,nccl,osrt 
    --output=ddp-8gpu_%q{RANK} 
    python train.py

The table below captures impact on the training throughput on the g6e.48xlarge and the p4d.24xlarge:

Throughput Performance of DDP Training (by Writer)

DDP training performance plummets on the g6e.48xlarge instance: The 8-GPU throughput is greater than 6 times slower than the single-GPU result. The NCCL logs include multiple lines describing the communication paths between the GPUs, e.g.:

NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct

This means that the information transfer between each two GPUs passes through CPU shared memory which greatly limits the communication bandwidth.

On the p4d.24xlarge, in contrast, where the NVLink connections allow for direct peer-to-peer (P2P) communication, the 8-GPU throughput is just 8% lower than the baseline result:

NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read

Although each individual L40S outperforms the A100 on our toy model by 72%, the inclusion of NVLink makes the p4d.24xlarge more optimal than the g6e.48xlarge for running DDP over 8 GPUs.

The info transfer bottleneck on the g6e.48xlarge could be easily seen within the nsys trace. Here we use the “multiple view” choice to display the activity of all of the DDP processes in a single timeline:

8-GPU DDP on g6e.48xlarge – Profiler Trace (by Writer)

The gradient reduction occurs within the  call on the NCCL row. This comes just after the local gradient calculation has accomplished and just before the memory operation discussed above that copies the reduced gradients back to the  field of every parameter. On the g6e.48xlarge the NCCL operation dominates the backward pass, accounting for about 84% of the general step time (587 out of 701 milliseconds).

On the p4d.24xlarge, the NCCL call takes up a much smaller portion of the training step:

8-GPU DDP on p4d.24xlarge – Profiler Trace (by Writer)

In the subsequent sections we’ll discuss a couple of DDP optimization techniques and assess their impact on runtime performance. We’ll limit our scope to PyTorch-level optimizations, leaving other methods (e.g., NCCL/OS/network tuning) for an additional post.

A general approach for coping with communication bottlenecks is to scale back the frequency of communication. In DDP workloads this could be achieved by increasing the information batch size (i.e., increasing the compute per step) or, if the memory capability forbids this, apply gradient accumulation — as a substitute of applying the gradient reduction and update every step, accumulate the local gradients for  steps and apply the reduction every th step. Each techniques increase the  batch size — the variety of overall samples between gradient updates which might impact model convergence and will require tuning of the optimization parameters. In our discussion we assume the  batch size is fixed.

We’ll cover 4 techniques. The primary two deal with reducing DDP overhead. The ultimate two directly address the information transfer bottleneck.

Optimization 1: Static Graph Declaration

Our first change is to pass static_graph=True to the DDP container. Generally speaking, models might include parameters for which gradients are usually not calculated at every step — known as “unused parameters” within the DDP documentation. That is common in models with conditional logic. DDP includes dedicated mechanisms for identifying and handling unused parameters. Within the case of our toy model, the entire gradients are calculated in each step — our graph is “static”. Explicitly declaring our graph as static reduces the overhead related to handling a dynamic gradient set.

def configure_ddp(model, rank):
    dist.init_process_group("nccl")
    model = DDP(model,
                device_ids=[rank],
                static_graph=True,
                bucket_cap_mb=2000)
    return model

Within the case of our toy model, the impact of this transformation is negligible on each the g6e.48xlarge and the p4d.24xlarge. With none further delay, we proceed to the subsequent optimization.

Optimization 2: Increase Memory Efficiency

Our second technique addresses the big chunk of memory copying we identified above. As a substitute of copying the gradients to and from the NCCL communication memory blocks, we will explicitly set the parameter gradients to point on to the NCCL communication buffers. Consequently, the identical memory is used to store the local gradients, to perform the gradient reduction, and to use the gradient updates. That is configured by passing gradient_as_bucket_view=True to the DDP container:

def configure_ddp(model, rank):
    dist.init_process_group("nccl")
    model = DDP(model,
                device_ids=[rank],
                static_graph=True,
                gradient_as_bucket_view=True,
                bucket_cap_mb=2000)
    return model

Within the trace below, captured on the p4d.24xlarge, we now not see the block of memory copy between the all-reduce and (yellow) optimizer steps:

Memory-Optimized DDP on p4d.24xlarge – Profiler Trace (by Writer)

Within the case of our toy example, this optimization boosts performance by a modest 1%.

Optimization 3: Gradient Compression

A typical approach for addressing communication bottlenecks is to use compression algorithms to scale back the dimensions of the payload. PyTorch DDP provides numerous dedicated communication hooks that automate compression of gradients before NCCL reduction and decompression afterward. Within the code block below, we apply bfloat16 gradient compression:

import torch.distributed.algorithms.ddp_comm_hooks.default_hooks as ddp_hks

def configure_ddp(model, rank):
    dist.init_process_group("nccl")
    model = DDP(model,
                device_ids=[rank],
                static_graph=True,
                gradient_as_bucket_view=True,
                bucket_cap_mb=2000)

    model.register_comm_hook(state=None, hook=ddp_hks.bf16_compress_hook)
    return model

On the heavily bottlenecked g6e.48xlarge instance bfloat16 compression leads to a substantial 65% speed-up! The nsys trace shows the reduced-sized NCCL call in addition to the newly introduced compression operations:

BF16-Compressed DDP on g6e.48xlarge – Profiler Trace (by Writer)

On the p4d.24xlarge the overhead of the compression operations outweigh the gains within the communication speed, resulting in an overall reduction in throughput:

BF16-Compressed DDP on p4d.24xlarge – Profiler Trace (by Writer)

PyTorch offers a more aggressive compression algorithm than bfloat16, called PowerSGD. Below we present a naive usage — in practice this may require a number of tuning. Please see the documentation for details:

from torch.distributed.algorithms.ddp_comm_hooks import powerSGD_hook 

def configure_ddp(model, rank):
    dist.init_process_group("nccl")
    model = DDP(model,
                device_ids=[rank],
                static_graph=True,
                gradient_as_bucket_view=True,
                bucket_cap_mb=2000)

    state = powerSGD_hook.PowerSGDState(
        process_group=None
    )

    model.register_comm_hook(
        state,
        hook=powerSGD_hook.powerSGD_hook
    )
    return model

PowerSGD has a dramatic impact on the g6e.48xlarge instance, increasing throughput all the best way back as much as 7.5 steps per second — greater than five times faster than the baseline result! Note the reduction in NCCL kernel size within the resultant trace:

PowerSGD-Compressed DDP on g6e.48xlarge – Profiler Trace (by Writer)

It will be significant to notice that these compression algorithms are precision-lossy and must be used with caution, as they might impact your model convergence.

Optimization 4: Parallelize Gradient Reduction

DDP groups parameters into multiple buckets and triggers gradient reduction of every bucket independently — as soon because the bucket is filled. This enables for running gradient reduction of filled buckets while gradient calculation (of other buckets) continues to be ongoing. The degree of overlap depends upon the variety of DDP buckets which we control via the  setting of the DDP container. Recall that in our initial implementation, we explicitly set this to (2 GB) which (given the dimensions of our model) translated to a single DDP bucket. The table below shows the throughput for various values of . The optimal setting will vary based on the small print of the model and runtime environment.

DDP Throughput vs. Bucket Capability (by Writer)

Note the numerous 12% improvement when using multiple buckets on the g6e.48xlarge with BF16 compression. PowerSGD, however, works best when applied to a single bucket.

Within the image below, captured on a p4d.24xlarge with set to , we will see the impact of this optimization on the profiler trace:

DDP With Multiple Buckets on p4d.24xlarge – Profiler Trace (by Writer)

Instead of the only NCCL all-reduce operation, we now have 11 smaller blocks running in parallel with the local gradient computation.

Results

We summarize the outcomes of our experiments in the next table:

Experiment Results (by Writer)

On the p4d.24xlarge, where the information is transferred over NVLink, the general impact was a modest 4% speed-up. But on the g6e.48xlarge the gains were significant and mostly because of gradient compression — and 86% boost for the BF16 scheme and an over 5X improvement for (the naive implementation of) PowerSGD.

Importantly, these results can vary considerably based on the model architecture and runtime environment.

Summary

This concludes our three-part series on identifying and solving data transfer bottlenecks with NVIDIA Nsight™ Systems (nsys) profiler. In each of the posts we demonstrated the tendency of information transfers to introduce performance bottlenecks and resource under-utilization. In each case, we used nsys profiler to discover the bottlenecks and measure the impact of various optimization techniques.

The accelerations we achieved in each of the use-cases that we studied strengthen the importance of integrating the regular use of tools comparable to nsys profiler into the AI/ML development workflow and highlight the chance — even for non-CUDA specialists — to realize meaningful performance gains and AI/ML cost reductions.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x