AI in Multiple GPUs: Understanding the Host and Device Paradigm

is an element of a series about distributed AI across multiple GPUs:

Part 1: Understanding the Host and Device Paradigm (this text)
Part 2: Point-to-Point and Collective Operations
Part 3: How GPUs Communicate
Part 4: Gradient Accumulation & Distributed Data Parallelism (DDP)
Part 5: ZeRO
Part 6: Tensor Parallelism

Introduction

This guide explains the foundational concepts of how a CPU and a discrete graphics card (GPU) work together. It’s a high-level introduction designed to assist you to construct a mental model of the host-device paradigm. We’ll focus specifically on NVIDIA GPUs, that are probably the most commonly used for AI workloads.

The Big Picture: The Host and The Device

An important concept to know is the connection between the Host and the Device.

The Host: That is your CPU. It runs the operating system and executes your Python script line by line. The Host is the commander; it’s in control of the general logic and tells the Device what to do.
The Device: That is your GPU. It’s a robust but specialized coprocessor designed for massively parallel computations. The Device is the accelerator; it doesn’t do anything until the Host gives it a task.

Your program at all times starts on the CPU. Whenever you want the GPU to perform a task, like multiplying two large matrices, the CPU sends the instructions and the information over to the GPU.

The CPU-GPU Interaction

The Host talks to the Device through a queuing system.

CPU Initiates Commands: Your script, running on the CPU, encounters a line of code intended for the GPU (e.g., tensor.to('cuda')).
Commands are Queued: The CPU doesn’t wait. It simply places this command onto a special to-do list for the GPU called a CUDA Stream — more on this in the subsequent section.
Asynchronous Execution: The CPU doesn’t wait for the actual operation to be accomplished by the GPU, the host moves on to the subsequent line of your script. This is known as asynchronous execution, and it’s a key to achieving high performance. While the GPU is busy crunching numbers, the CPU can work on other tasks, like preparing the subsequent batch of knowledge.

CUDA Streams

A CUDA Stream is an ordered queue of GPU operations. Operations submitted to a single stream execute so as, one after one other. Nonetheless, operations across streams can execute concurrently — the GPU can juggle multiple independent workloads at the identical time.

By default, every PyTorch GPU operation is enqueued on the current lively stream (it’s often the default stream which is robotically created). This is straightforward and predictable: every operation waits for the previous one to complete before starting. For many code, you never notice this. But it surely leaves performance on the table when you have got work that overlap.

Multiple Streams: Concurrency

The classic use case for multiple streams is overlapping computation with data transfers. While the GPU processes batch N, you’ll be able to concurrently copy batch N+1 from CPU RAM to GPU VRAM:

Stream 0 (compute): [process batch 0]────[process batch 1]───
Stream 1 (data):   ────[copy batch 1]────[copy batch 2]───

This pipeline is feasible because compute and data transfer occur on separate hardware units contained in the GPU, enabling true parallelism. In PyTorch, you create streams and schedule work onto them with context managers:

compute_stream = torch.cuda.Stream()
transfer_stream = torch.cuda.Stream()

with torch.cuda.stream(transfer_stream):
    # Enqueue the transfer on transfer_stream
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)

with torch.cuda.stream(compute_stream):
    # This runs concurrently with the transfer above
    output = model(current_batch)

Note the non_blocking=True flag on .to(). Without it, the transfer would still block the CPU thread even once you intend it to run asynchronously.

Synchronization Between Streams

Since streams are independent, you might want to explicitly signal when one relies on one other. The blunt tool is:

torch.cuda.synchronize()  # waits for ALL streams on the device to complete

A more surgical approach uses CUDA Events. An event marks a selected point in a stream, and one other stream can wait on it without halting the CPU thread:

event = torch.cuda.Event()

with torch.cuda.stream(transfer_stream):
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)
    event.record()  # mark: transfer is completed

with torch.cuda.stream(compute_stream):
    compute_stream.wait_event(event)  # don't start until transfer completes
    output = model(next_batch)

That is more efficient than stream.synchronize() since it only stalls the dependent stream on the GPU side — the CPU thread stays free to maintain queuing work.

For day-to-day PyTorch training code you won’t need to administer streams manually. But features like DataLoader(pin_memory=True) and prefetching rely heavily on this mechanism under the hood. Understanding streams helps you recognize why those settings exist and offers you the tools to diagnose subtle performance bottlenecks once they appear.

PyTorch Tensors

PyTorch is a robust framework that abstracts away many details, but this abstraction can sometimes obscure what is going on under the hood.

Whenever you create a PyTorch tensor, it has two parts: metadata (like its shape and data type) and the actual numerical data. So once you run something like this t = torch.randn(100, 100, device=device), the tensor’s metadata is stored within the host’s RAM, while its data is stored within the GPU’s VRAM.

This distinction is significant. Whenever you run print(t.shape), the CPU can immediately access this information since the metadata is already in its own RAM. But what happens should you run print(t), which requires the actual data living in VRAM?

Host-Device Synchronization

Accessing GPU data from the CPU can trigger a Host-Device Synchronization, a standard performance bottleneck. This happens at any time when the CPU needs a result from the GPU that isn’t yet available within the CPU’s RAM.

For instance, consider the road print(gpu_tensor) which prints a tensor that continues to be being computed by the GPU. The CPU cannot print the tensor’s values until the GPU has finished all of the calculations to acquire the end result. When the script reaches this line, the CPU is forced to block, i.e. it stops and waits for the GPU to complete. Only after the GPU completes its work and copies the information from its VRAM to the CPU’s RAM can the CPU proceed.

As one other example, what’s the difference between torch.randn(100, 100).to(device) and torch.randn(100, 100, device=device)? The primary method is less efficient since it creates the information on the CPU after which transfers it to the GPU. The second method is more efficient since it creates the tensor directly on the GPU; the CPU only sends the creation command.

These synchronization points can severely impact performance. Effective GPU programming involves minimizing them to make sure each the Host and Device stay as busy as possible. In any case, you would like your GPUs to go .

Image by writer: generated with ChatGPT

Scaling Up: Distributed Computing and Ranks

Training large models, corresponding to Large Language Models (LLMs), often requires more compute power than a single GPU can offer. Coordinating work across multiple GPUs brings you into the world of distributed computing.

On this context, a brand new and vital concept emerges: the Rank.

Each rank is a CPU process which gets assigned a single device (GPU) and a singular ID. If you happen to launch a training script across two GPUs, you’ll create two processes: one with rank=0 and one other with rank=1.

This implies you’re launching two separate instances of your Python script. On a single machine with multiple GPUs (a single node), these processes run on the identical CPU but remain independent, without sharing memory or state. Rank 0 commands its assigned GPU (cuda:0), while Rank 1 commands one other GPU (cuda:1). Although each ranks run the identical code, you’ll be able to leverage a variable that holds the rank ID to assign different tasks to every GPU, like having each process a special portion of the information (we’ll see examples of this in the subsequent blog post of this series).

Conclusion

Congratulations for reading all of the method to the top! On this post, you learned about:

The Host/Device relationship
Asynchronous execution
CUDA Streams and the way they allow concurrent GPU work
Host-Device synchronization

Within the next blog post, we are going to dive deeper into Point-to-Point and Collective Operations, which enable multiple GPUs to coordinate complex workflows corresponding to distributed neural network training.

AI in Multiple GPUs: Understanding the Host and Device Paradigm

Introduction

The Big Picture: The Host and The Device

The CPU-GPU Interaction

CUDA Streams

Multiple Streams: Concurrency

Synchronization Between Streams

PyTorch Tensors

Host-Device Synchronization

Scaling Up: Distributed Computing and Ranks

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The best way to Create Production-Ready Code with Claude Code

What Makes Quantum Machine Learning “Quantum”?

Feds take notice of iOS vulnerabilities exploited under mysterious circumstances

Is the Pentagon allowed to surveil Americans with AI?

Conversational LLM Evaluations in Minutes with NVIDIA NeMo Evaluator Agent Skills

AI in Multiple GPUs: Understanding the Host and Device Paradigm

Introduction

The Big Picture: The Host and The Device

The CPU-GPU Interaction

CUDA Streams

Multiple Streams: Concurrency

Synchronization Between Streams

PyTorch Tensors

Host-Device Synchronization

Scaling Up: Distributed Computing and Ranks

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.