You have to be conversant in this message 🤬:
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.93 GiB total capability; 6.00 GiB already allocated; 14.88 MiB free; 6.00 GiB reserved in total by PyTorch)
While it is easy to see that GPU memory is full, understanding why and the way to fix it might probably be tougher. On this tutorial, we’ll go step-by-step on the way to visualize and understand GPU memory usage in PyTorch during training. We’ll also see the way to estimate memory requirements and optimize GPU memory usage.
🔎 The PyTorch visualizer
PyTorch provides a handy tool for visualizing GPU memory usage:
import torch
from torch import nn
torch.cuda.memory._record_memory_history(max_entries=100000)
model = nn.Linear(10_000, 50_000, device ="cuda")
for _ in range(3):
inputs = torch.randn(5_000, 10_000, device="cuda")
outputs = model(inputs)
torch.cuda.memory._dump_snapshot("profile.pkl")
torch.cuda.memory._record_memory_history(enabled=None)
Running this code generates a profile.pkl file that incorporates a history of GPU memory usage during execution. You may visualize this history at: https://pytorch.org/memory_viz.
By dragging and dropping your profile.pkl file, you will notice a graph like this:

Let’s break down this graph into key parts:

-
Model Creation: Memory increases by 2 GB, corresponding to the model’s size:
This memory (in blue) persists throughout execution.
-
Input Tensor Creation (1st Loop): Memory increases by 200 MB matching the input tensor size:
-
Forward Pass (1st Loop): Memory increases by 1 GB for the output tensor:
-
Input Tensor Creation (2nd Loop): Memory increases by 200 MB for a brand new input tensor. At this point, you may expect the input tensor from step 2 to be freed. Still, it’s not: the model retains its activation, so even when the tensor isn’t any longer assigned to the variable
inputs, it stays referenced by the model’s forward pass computation. The model retains its activations because these tensors are required for the backpropagation process in neural networks. Try withtorch.no_grad()to see the difference. -
Forward Pass (2nd Loop): Memory increases by 1 GB for the brand new output tensor, calculated as in step 3.
-
Release 1st Loop Activation: After the second loop’s forward pass, the input tensor from the primary loop (step 2) could be freed. The model’s activations, which hold the primary input tensor, are overwritten by the second loop’s input. Once the second loop completes, the primary tensor isn’t any longer referenced and its memory could be released.
-
Update
output: The output tensor from step 3 is reassigned to the variableoutput. The previous tensor isn’t any longer referenced and is deleted, freeing its memory. -
Input Tensor Creation (third Loop): Same as step 4.
-
Forward Pass (third Loop): Same as step 5.
-
Release 2nd Loop Activation: The input tensor from step 4 is freed.
-
Update
outputAgain: The output tensor from step 5 is reassigned to the variableoutput, freeing the previous tensor. -
End of Code Execution: All memory is released.
📊 Visualizing Memory During Training
The previous example was simplified. In real scenarios, we regularly train complex models reasonably than a single linear layer. Moreover, the sooner example didn’t include the training process. Here, we’ll examine how GPU memory behaves during an entire training loop for an actual large language model (LLM).
import torch
from transformers import AutoModelForCausalLM
torch.cuda.memory._record_memory_history(max_entries=100000)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
for _ in range(3):
inputs = torch.randint(0, 100, (16, 256), device="cuda")
loss = torch.mean(model(inputs).logits)
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.memory._dump_snapshot("profile.pkl")
torch.cuda.memory._record_memory_history(enabled=None)
💡 Tip: When profiling, limit the variety of steps. Every GPU memory event is recorded, and the file can turn into very large. For instance, the above code generates an 8 MB file.
Here’s the memory profile for this instance:

This graph is more complex than the previous example, but we will still break it down step-by-step. Notice the three spikes, each corresponding to an iteration of the training loop. Let’s simplify the graph to make it easier to interpret:

-
Model Initialization (
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")):
Step one involves loading the model onto the GPU. The model parameters (in blue) occupy memory and remain there until the training ends. -
Forward Pass (
model(inputs)):
Throughout the forward pass, the activations (intermediate outputs of every layer) are computed and stored in memory for backpropagation. These activations, represented in orange, grow layer by layer until the ultimate layer. The loss is calculated at the height of the orange zone. -
Backward Pass (
loss.backward()):
The gradients (in yellow) are computed and stored during this phase. Concurrently, the activations are discarded as they are not any longer needed, causing the orange zone to shrink. The yellow zone represents memory usage for gradient calculations. -
Optimizer Step (
optimizer.step()):
Gradients are used to update the model’s parameters. Initially, the optimizer itself is initialized (green zone). This initialization is simply done once. After that, the optimizer uses the gradients to update the model’s parameters. To update the parameters, the optimizer temporarily stores intermediate values (red zone). After the update, each the gradients (yellow) and the intermediate optimizer values (red) are discarded, freeing memory.
At this point, one training iteration is complete. The method repeats for the remaining iterations, producing the three memory spikes visible within the graph.
Training profiles like this typically follow a consistent pattern, which makes them useful for estimating GPU memory requirements for a given model and training loop.
📐 Estimating Memory Requirements
From the above section, estimating GPU memory requirements seems easy. The overall memory needed should correspond to the best peak within the memory profile, which occurs in the course of the forward pass. In that case, the memory requirement is (blue + green + orange):
Is it that straightforward? Actually, there’s a trap. The profile can look different depending on the training setup. For instance, reducing the batch size from 16 to 2 changes the image:
- inputs = torch.randint(0, 100, (16, 256), device="cuda") # Dummy input
+ inputs = torch.randint(0, 100, (2, 256), device="cuda") # Dummy input

Now, the best peaks occur in the course of the optimizer step reasonably than the forward pass. On this case, the memory requirement becomes (blue + green + yellow + red):
To generalize the memory estimation, we want to account for all possible peaks, no matter whether or not they occur in the course of the forward pass or optimizer step.
Now that now we have the equation, let’s have a look at the way to estimate each component.
Model parameters
The model parameters are the best to estimate.
Where:
- is the variety of parameters.
- is the precision (in bytes, e.g., 4 for
float32).
For instance, a model with 1.5 billion parameters and a precision of 4 bytes requires:
Within the above example, the model size is:
Optimizer State
The memory required for the optimizer state depends upon the optimizer type and the model parameters. As an example, the AdamW optimizer stores two moments (first and second) per parameter. This makes the optimizer state size:
Activations
The memory required for activations is harder to estimate since it includes all of the intermediate values computed in the course of the forward pass. To calculate activation memory, we will use a forward hook to measure the scale of outputs:
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")
activation_sizes = []
def forward_hook(module, input, output):
"""
Hook to calculate activation size for every module.
"""
if isinstance(output, torch.Tensor):
activation_sizes.append(output.numel() * output.element_size())
elif isinstance(output, (tuple, list)):
for tensor in output:
if isinstance(tensor, torch.Tensor):
activation_sizes.append(tensor.numel() * tensor.element_size())
hooks = []
for submodule in model.modules():
hooks.append(submodule.register_forward_hook(forward_hook))
dummy_input = torch.zeros((1, 1), dtype=torch.int64, device="cuda")
model.eval()
with torch.no_grad():
model(dummy_input)
for hook in hooks:
hook.remove()
print(sum(activation_sizes))
For the Qwen2.5-1.5B model, this offers 5,065,216 activations per input token. To estimate the overall activation memory for an input tensor, use:
Where:
- is the variety of activations per token.
- is the batch size.
- is the sequence length.
Nevertheless, using this method directly is not at all times practical. Ideally, we would love a heuristic to estimate activation memory without running the model. Plus, we will intuitively see that larger models have more activations. This results in the query: Is there a connection between the variety of model parameters and the variety of activations?
In a roundabout way, because the variety of activations per token depends upon the model architecture. Nevertheless, LLMs are likely to have similar structures. By analyzing different models, we observe a rough linear relationship between the variety of parameters and the variety of activations:

This linear relationship allows us to estimate activations using the heuristic:
Though that is an approximation, it provides a practical option to estimate activation memory while not having to perform complex calculations for every model.
Gradients
Gradients are easier to estimate. The memory required for gradients is identical because the model parameters:
Optimizer Intermediates
When updating the model parameters, the optimizer stores intermediate values. The memory required for these values is identical because the model parameters:
Total Memory
To summarize, the overall memory required to coach a model is:
with the next components:
- Model Memory:
- Optimizer State:
- Gradients:
- Optimizer Intermediates:
- Activations: , estimated using the heuristic
To make this calculation easier, I created a small tool for you:
🚀 Next steps
Your initial motivation to grasp memory usage was probably driven by the proven fact that sooner or later, you ran out of memory. Did this blog offer you a direct solution to repair that? Probably not. Nevertheless, now that you may have a greater understanding of how memory usage works and the way to profile it, you are higher equipped to search out ways to scale back it.
For a particular list of recommendations on optimizing memory usage in TRL, you possibly can check the Reducing Memory Usage section of the documentation. The following tips, though, are usually not limited to TRL and could be applied to any PyTorch-based training process.
🤝 Acknowledgements
Because of Kashif Rasul for his priceless feedback and suggestions on this blog post.
