is a component of a series about distributed AI across multiple GPUs:
Introduction
Before diving into advanced parallelism techniques, we want to know the important thing technologies that enable GPUs to speak with one another.
But why do GPUs need to speak in the primary place? When training AI models across multiple GPUs, each GPU processes different data batches but all of them have to stay synchronized by sharing gradients during backpropagation or exchanging model weights. The specifics of what gets communicated and when relies on your parallelism strategy, which we’ll explore in depth in the subsequent blog posts. For now, just know that modern AI training is communication-intensive, making efficient GPU-to-GPU data transfer critical for performance.
The Communication Stack
PCIe
PCIe (Peripheral Component Interconnect Express) connects expansion cards like GPUs to the motherboard using independent point-to-point serial lanes. Here’s what different PCIe generations offer for a GPU using 16 lanes:
- Gen4 x16: ~32 GB/s bidirectional
- Gen5 x16: ~64 GB/s bidirectional
- Gen6 x16: ~128 GB/s bidirectional (FYI 16 lanes × 8 GB/s/lane = 128 GB/s)
High-end server CPUs typically offer 128 PCIe lanes, and modern GPUs need 16 lanes for optimal bandwidth. For this reason you often see 8 GPUs per server (128 = 16 x 8). Power consumption and physical space in server chassis also make it impractical to transcend 8 GPUs in a single node.
NVLink
NVLink enables direct GPU-to-GPU communication throughout the same server (node), bypassing the CPU entirely. This NVIDIA-proprietary interconnect creates a direct memory-to-memory pathway between GPUs with huge bandwidth:
- NVLink 3 (A100): ~600 GB/s per GPU
- NVLink 4 (H100): ~900 GB/s per GPU
- NVLink 5 (Blackwell): As much as 1.8 TB/s per GPU
Note: on NVLink for CPU-GPU communication
Certain CPU architectures support NVLink as a PCIe substitute, dramatically accelerating CPU-GPU communication by overcoming the PCIe bottleneck in data transfers, corresponding to moving training batches from CPU to GPU. This CPU-GPU NVLink capability makes CPU-offloading (a way that saves VRAM by storing data in RAM as an alternative) practical for real-world AI applications. Since scaling RAM is often more cost effective than scaling VRAM, this approach offers significant economic benefits.
CPUs with NVLink support include IBM POWER8, POWER9, and NVIDIA Grace.
Nonetheless, there’s a catch. In a server with 8x H100s, each GPU needs to speak with 7 others, splitting that 900 GB/s into seven point-to-point connections of about 128 GB/s each. That’s where NVSwitch is available in.
NVSwitch
NVSwitch acts as a central hub for GPU communication, dynamically routing (switching for those who will) data between GPUs as needed. With NVSwitch, every Hopper GPU can communicate at 900 GB/s with all other Hopper GPUs concurrently, i.e. peak bandwidth doesn’t rely on what number of GPUs are communicating. That is what makes NVSwitch “non-blocking”. Each GPU connects to several NVSwitch chips via multiple NVLink connections, ensuring maximum bandwidth.
While NVSwitch began as an intra-node solution, it’s been prolonged to interconnect multiple nodes, creating GPU clusters that support as much as 256 GPUs with all-to-all communication at near-local NVLink speeds.
The generations of NVSwitch are:
- First-Generation: Supports as much as 16 GPUs per server (compatible with Tesla V100)
- Second-Generation: Also supports as much as 16 GPUs with improved bandwidth and lower latency
- Third-Generation: Designed for H100 GPUs, supports as much as 256 GPUs
InfiniBand
InfiniBand handles inter-node communication. While much slower (and cheaper) than NVSwitch, it’s commonly utilized in datacenters to scale to hundreds of GPUs. Modern InfiniBand supports NVIDIA GPUDirect® RDMA (Distant Direct Memory Access), letting network adapters access GPU memory directly without CPU involvement (no expensive copying to host RAM).
Current InfiniBand speeds include:
- HDR: ~25 GB/s per port
- NDR: ~50 GB/s per port
- NDR200: ~100 GB/s per port
These speeds are significantly slower than intra-node NVLink as a consequence of network protocol overhead and the necessity for 2 PCIe traversals (one on the sender and one on the receiver).
Key Design Principles
Understanding Linear Scaling
Linear scaling is the holy grail of distributed computing. In easy terms, it means doubling your GPUs should double your throughput and halve your training time. This happens when communication overhead is minimal in comparison with computation time, allowing each GPU to operate at full capability. Nonetheless, perfect linear scaling is rare in AI workloads because communication requirements grow with the variety of devices, and it’s often inconceivable to realize perfect compute-communication overlap (explained next).
The Importance of Compute-Communication Overlap
When a GPU sits idle waiting for data to be transferred before it might be processed, you’re wasting resources. Communication operations should overlap with computation as much as possible. When that’s impossible, we call that communication an “exposed operation”.
Intra-Node vs. Inter-Node: The Performance Cliff
Modern server-grade motherboards support as much as 8 GPUs. Inside this range, you’ll be able to often achieve near-linear scaling because of high-bandwidth, low-latency intra-node communication.
When you scale beyond 8 GPUs and begin using multiple nodes connected via InfiniBand, you’ll see a big performance degradation. Inter-node communication is way slower than intra-node NVLink, introducing network protocol overhead, higher latency, and bandwidth limitations. As you add more GPUs, each GPU must coordinate with more peers, spending more time idle waiting for data transfers to finish.
Conclusion
Follow me on X for more free AI content @l_cesconetto
Congratulations on making it to the tip! On this post you learned about:
- The CPU-GPU and GPU-GPU communication fundamentals:
- PCIe, NVLink, NVSwitch, and InfiniBand
- Key design principles for distributed GPU computing
- You’re now in a position to make way more informed decisions when designing your AI workloads
In the subsequent blog post, we’ll dive into our first parallelism technique, the Distributed Data Parallelism.
