Visualize and understand GPU memory in PyTorch

You have to be conversant in this message 🤬:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.93 GiB total capability; 6.00 GiB already allocated; 14.88 MiB free; 6.00 GiB reserved in total by PyTorch)

While it is easy to see that GPU memory is full, understanding why and the way to fix it might probably be tougher. On this tutorial, we’ll go step-by-step on the way to visualize and understand GPU memory usage in PyTorch during training. We’ll also see the way to estimate memory requirements and optimize GPU memory usage.

🔎 The PyTorch visualizer

PyTorch provides a handy tool for visualizing GPU memory usage:

import torch
from torch import nn


torch.cuda.memory._record_memory_history(max_entries=100000)

model = nn.Linear(10_000, 50_000, device ="cuda")
for _ in range(3):
    inputs = torch.randn(5_000, 10_000, device="cuda")
    outputs = model(inputs)


torch.cuda.memory._dump_snapshot("profile.pkl")
torch.cuda.memory._record_memory_history(enabled=None)

Running this code generates a profile.pkl file that incorporates a history of GPU memory usage during execution. You may visualize this history at: https://pytorch.org/memory_viz.

By dragging and dropping your profile.pkl file, you will notice a graph like this:

Simple profile

Let’s break down this graph into key parts:

Simple profile partitioned

Model Creation: Memory increases by 2 GB, corresponding to the model’s size:

$10{,}000 times 50{,}000 text{ weights} + 50{,}000 text{ biases in } texttt{float32 }text{(4 bytes)} implies (5 times 10^8) times 4 , text{bytes} = 2 , text{GB}.$

This memory (in blue) persists throughout execution.
Input Tensor Creation (1st Loop): Memory increases by 200 MB matching the input tensor size:

$5{,}000 times 10{,}000 text{ elements in } texttt{float32 }text{(4 bytes)} implies (5 times 10^7) times 4 , text{bytes} = 0.2 , text{GB}.$
Forward Pass (1st Loop): Memory increases by 1 GB for the output tensor:

$5{,}000 times 50{,}000 text{ elements in } texttt{float32 }text{(4 bytes)} implies (25 times 10^7) times 4 , text{bytes} = 1 , text{GB}.$
Input Tensor Creation (2nd Loop): Memory increases by 200 MB for a brand new input tensor. At this point, you may expect the input tensor from step 2 to be freed. Still, it’s not: the model retains its activation, so even when the tensor isn’t any longer assigned to the variable inputs, it stays referenced by the model’s forward pass computation. The model retains its activations because these tensors are required for the backpropagation process in neural networks. Try with torch.no_grad() to see the difference.
Forward Pass (2nd Loop): Memory increases by 1 GB for the brand new output tensor, calculated as in step 3.
Release 1st Loop Activation: After the second loop’s forward pass, the input tensor from the primary loop (step 2) could be freed. The model’s activations, which hold the primary input tensor, are overwritten by the second loop’s input. Once the second loop completes, the primary tensor isn’t any longer referenced and its memory could be released.
Update output: The output tensor from step 3 is reassigned to the variable output. The previous tensor isn’t any longer referenced and is deleted, freeing its memory.
Input Tensor Creation (third Loop): Same as step 4.
Forward Pass (third Loop): Same as step 5.
Release 2nd Loop Activation: The input tensor from step 4 is freed.
Update output Again: The output tensor from step 5 is reassigned to the variable output, freeing the previous tensor.
End of Code Execution: All memory is released.

📊 Visualizing Memory During Training

The previous example was simplified. In real scenarios, we regularly train complex models reasonably than a single linear layer. Moreover, the sooner example didn’t include the training process. Here, we’ll examine how GPU memory behaves during an entire training loop for an actual large language model (LLM).

import torch
from transformers import AutoModelForCausalLM


torch.cuda.memory._record_memory_history(max_entries=100000)

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

for _ in range(3):
    inputs = torch.randint(0, 100, (16, 256), device="cuda")  
    loss = torch.mean(model(inputs).logits)  
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()


torch.cuda.memory._dump_snapshot("profile.pkl")
torch.cuda.memory._record_memory_history(enabled=None)

💡 Tip: When profiling, limit the variety of steps. Every GPU memory event is recorded, and the file can turn into very large. For instance, the above code generates an 8 MB file.

Here’s the memory profile for this instance:

Raw training profile

This graph is more complex than the previous example, but we will still break it down step-by-step. Notice the three spikes, each corresponding to an iteration of the training loop. Let’s simplify the graph to make it easier to interpret:

Colorized training profile

Model Initialization (model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")):
Step one involves loading the model onto the GPU. The model parameters (in blue) occupy memory and remain there until the training ends.
Forward Pass (model(inputs)):
Throughout the forward pass, the activations (intermediate outputs of every layer) are computed and stored in memory for backpropagation. These activations, represented in orange, grow layer by layer until the ultimate layer. The loss is calculated at the height of the orange zone.
Backward Pass (loss.backward()):
The gradients (in yellow) are computed and stored during this phase. Concurrently, the activations are discarded as they are not any longer needed, causing the orange zone to shrink. The yellow zone represents memory usage for gradient calculations.
Optimizer Step (optimizer.step()):
Gradients are used to update the model’s parameters. Initially, the optimizer itself is initialized (green zone). This initialization is simply done once. After that, the optimizer uses the gradients to update the model’s parameters. To update the parameters, the optimizer temporarily stores intermediate values (red zone). After the update, each the gradients (yellow) and the intermediate optimizer values (red) are discarded, freeing memory.

At this point, one training iteration is complete. The method repeats for the remaining iterations, producing the three memory spikes visible within the graph.

Training profiles like this typically follow a consistent pattern, which makes them useful for estimating GPU memory requirements for a given model and training loop.

📐 Estimating Memory Requirements

From the above section, estimating GPU memory requirements seems easy. The overall memory needed should correspond to the best peak within the memory profile, which occurs in the course of the forward pass. In that case, the memory requirement is (blue + green + orange):
$Model Parameters + Optimizer State + Activations text{Model Parameters} + text{Optimizer State} + text{Activations}$

Is it that straightforward? Actually, there’s a trap. The profile can look different depending on the training setup. For instance, reducing the batch size from 16 to 2 changes the image:

- inputs = torch.randint(0, 100, (16, 256), device="cuda")  # Dummy input
+ inputs = torch.randint(0, 100, (2, 256), device="cuda")  # Dummy input

Colorized training profile 2

Now, the best peaks occur in the course of the optimizer step reasonably than the forward pass. On this case, the memory requirement becomes (blue + green + yellow + red):
$Model Parameters + Optimizer State + Gradients + Optimizer Intermediates text{Model Parameters} + text{Optimizer State} + text{Gradients} + text{Optimizer Intermediates}$

To generalize the memory estimation, we want to account for all possible peaks, no matter whether or not they occur in the course of the forward pass or optimizer step.
$Model Parameters + Optimizer State + \max (Gradients + Optimizer Intermediates, Activations) text{Model Parameters} + text{Optimizer State} + max(text{Gradients} + {text{Optimizer Intermediates}, text{Activations}})$

Now that now we have the equation, let’s have a look at the way to estimate each component.

Model parameters

The model parameters are the best to estimate.
$Model Memory = N \times P text{Model Memory} = N times P$

Where:

For instance, a model with 1.5 billion parameters and a precision of 4 bytes requires:

Within the above example, the model size is:
$text{Model Memory} = 1.5 times 10^9 times 4 , text{bytes} = 6 , text{GB}$

Optimizer State

The memory required for the optimizer state depends upon the optimizer type and the model parameters. As an example, the AdamW optimizer stores two moments (first and second) per parameter. This makes the optimizer state size:
$Optimizer State Size = 2 \times N \times P text{Optimizer State Size} = 2 times N times P$

Activations

The memory required for activations is harder to estimate since it includes all of the intermediate values computed in the course of the forward pass. To calculate activation memory, we will use a forward hook to measure the scale of outputs:

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")

activation_sizes = []

def forward_hook(module, input, output):
    """
    Hook to calculate activation size for every module.
    """
    if isinstance(output, torch.Tensor):
        activation_sizes.append(output.numel() * output.element_size())
    elif isinstance(output, (tuple, list)):
        for tensor in output:
            if isinstance(tensor, torch.Tensor):
                activation_sizes.append(tensor.numel() * tensor.element_size())


hooks = []
for submodule in model.modules():
    hooks.append(submodule.register_forward_hook(forward_hook))


dummy_input = torch.zeros((1, 1), dtype=torch.int64, device="cuda")
model.eval()  
with torch.no_grad():
    model(dummy_input)


for hook in hooks:
    hook.remove()

print(sum(activation_sizes))

For the Qwen2.5-1.5B model, this offers 5,065,216 activations per input token. To estimate the overall activation memory for an input tensor, use:
$Activation Memory = A \times B \times L \times P text{Activation Memory} = A times B times L times P$

Where:

Nevertheless, using this method directly is not at all times practical. Ideally, we would love a heuristic to estimate activation memory without running the model. Plus, we will intuitively see that larger models have more activations. This results in the query: Is there a connection between the variety of model parameters and the variety of activations?

In a roundabout way, because the variety of activations per token depends upon the model architecture. Nevertheless, LLMs are likely to have similar structures. By analyzing different models, we observe a rough linear relationship between the variety of parameters and the variety of activations:

Activations vs. Parameters

This linear relationship allows us to estimate activations using the heuristic:
$A = 4.6894 times 10^{-4} times N + 1.8494 times 10^{6}$

Though that is an approximation, it provides a practical option to estimate activation memory while not having to perform complex calculations for every model.

Gradients

Gradients are easier to estimate. The memory required for gradients is identical because the model parameters:
$Gradients Memory = N \times P text{Gradients Memory} = N times P$

Optimizer Intermediates

When updating the model parameters, the optimizer stores intermediate values. The memory required for these values is identical because the model parameters:
$Optimizer Intermediates Memory = N \times P text{Optimizer Intermediates Memory} = N times P$

Total Memory

To summarize, the overall memory required to coach a model is:
$Total Memory = Model Memory + Optimizer State + \max (Gradients, Optimizer Intermediates, Activations) text{Total Memory} = text{Model Memory} + text{Optimizer State} + max(text{Gradients}, text{Optimizer Intermediates}, text{Activations})$

with the next components:

Model Memory: $N \times P$
Optimizer State: $2 \times N \times P$
Gradients: $N \times P$
Optimizer Intermediates: $N \times P$
Activations: $A \times B \times L \times P$

To make this calculation easier, I created a small tool for you:

🚀 Next steps

Your initial motivation to grasp memory usage was probably driven by the proven fact that sooner or later, you ran out of memory. Did this blog offer you a direct solution to repair that? Probably not. Nevertheless, now that you may have a greater understanding of how memory usage works and the way to profile it, you are higher equipped to search out ways to scale back it.

For a particular list of recommendations on optimizing memory usage in TRL, you possibly can check the Reducing Memory Usage section of the documentation. The following tips, though, are usually not limited to TRL and could be applied to any PyTorch-based training process.

🤝 Acknowledgements

Because of Kashif Rasul for his priceless feedback and suggestions on this blog post.

Source link

Visualize and understand GPU memory in PyTorch

🔎 The PyTorch visualizer

📊 Visualizing Memory During Training

📐 Estimating Memory Requirements

Model parameters

Optimizer State

Activations

Gradients

Optimizer Intermediates

Total Memory

🚀 Next steps

🤝 Acknowledgements

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

2025 Must-Reads: Agents, Python, LLMs, and More

easy agents that write actions in code.

Lessons Learned from Upgrading to LangChain 1.0 in Production

Insights from the Open LLM Leaderboard

Lessons Learned After 8 Years of Machine Learning

Visualize and understand GPU memory in PyTorch

🔎 The PyTorch visualizer

📊 Visualizing Memory During Training

📐 Estimating Memory Requirements

Model parameters

Optimizer State

Activations

Gradients

Optimizer Intermediates

Total Memory

🚀 Next steps

🤝 Acknowledgements

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.