How 🤗 Speed up runs very large models due to PyTorch

Meta AI and BigScience recently open-sourced very large language models which won’t fit into memory (RAM or GPU) of most consumer hardware. At Hugging Face, a part of our mission is to make even those large models accessible, so we developed tools to will let you run those models even if you happen to don’t own a supercomputer. All of the examples picked on this blog post run on a free Colab instance (with limited RAM and disk space) if you’ve gotten access to more disk space, don’t hesitate to select larger checkpoints.

Here is how we will run OPT-6.7B:

import torch
from transformers import pipeline



checkpoint = "facebook/opt-6.7b"
generator = pipeline("text-generation", model=checkpoint, device_map="auto", torch_dtype=torch.float16)


generator("Increasingly large language models are opensourced so Hugging Face has")

We’ll explain what each of those arguments do in a moment, but first just consider the normal model loading pipeline in PyTorch: it often consists of:

Create the model
Load in memory its weights (in an object often called state_dict)
Load those weights within the created model
Move the model on the device for inference

While that has worked pretty much up to now years, very large models make this approach difficult. Here the model picked has 6.7 billion parameters. Within the default precision, it implies that just step 1 (creating the model) will take roughly 26.8GB in RAM (1 parameter in float32 takes 4 bytes in memory). This may’t even fit within the RAM you get on Colab.

Then step 2 will load in memory a second copy of the model (so one other 26.8GB in RAM in default precision). In the event you were attempting to load the most important models, for instance BLOOM or OPT-176B (which each have 176 billion parameters), like this, you would wish 1.4 terabytes of CPU RAM. That may be a bit excessive! And all of this to simply move the model on one (or several) GPU(s) at step 4.

Clearly we want something smarter. On this blog post, we’ll explain how Speed up leverages PyTorch features to load and run inference with very large models, even in the event that they don’t slot in RAM or one GPU. In a nutshell, it changes the method above like this:

Create an empty (e.g. without weights) model
Resolve where each layer goes to go (when multiple devices can be found)
Load in memory parts of its weights
Load those weights within the empty model
Move the weights on the device for inference
Repeat from step 3 for the subsequent weights until all of the weights are loaded

Creating an empty model

PyTorch 1.9 introduced a brand new type of device called the meta device. This enables us to create tensor with none data attached to them: a tensor on the meta device only needs a shape. So long as you might be on the meta device, you may thus create arbitrarily large tensors without having to fret about CPU (or GPU) RAM.

As an example, the next code will crash on Colab:

import torch

large_tensor = torch.randn(100000, 100000)

as this massive tensor requires 4 * 10**10 bytes (the default precision is FP32, so each element of the tensor takes 4 bytes) thus 40GB of RAM. The identical on the meta device works just high quality nonetheless:

import torch

large_tensor = torch.randn(100000, 100000, device="meta")

In the event you attempt to display this tensor, here’s what PyTorch will print:

tensor(..., device="meta", size=(100000, 100000))

As we said before, there isn’t any data related to this tensor, only a shape.

You may instantiate a model directly on the meta device:

large_model = torch.nn.Linear(100000, 100000, device="meta")

But for an existing model, this syntax would require you to rewrite all of your modeling code in order that each submodule accepts and passes along a device keyword argument. Since this was impractical for the 150 models of the Transformers library, we developed a context manager that can instantiate an empty model for you.

Here is how you may instantiate an empty version of BLOOM:

from speed up import init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("bigscience/bloom")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

This works on any model, but you get back a shell you may’t use directly: some operations are implemented for the meta device, but not all yet. Here for example, you need to use the large_model defined above with an input, but not the BLOOM model. Even when using it, the output shall be a tensor of the meta device, so you’ll get the form of the result, but nothing more.

As further work on this, the PyTorch team is working on a brand new class FakeTensor, which is a bit like tensors on the meta device, but with the device information (on top of shape and dtype)

Since we all know the form of every weight, we will nonetheless know the way much memory they’ll all eat once we load the pretrained tensors fully. Due to this fact, we will make a call on tips on how to split our model across CPUs and GPUs.

Computing a tool map

Before we start loading the pretrained weights, we’ll have to know where we wish to place them. This manner we will free the CPU RAM every time we’ve put a weight in its right place. This may be done with the empty model on the meta device, since we only have to know the form of every tensor and its dtype to compute how much space it can absorb memory.

Speed up provides a function to routinely determine a device map from an empty model. It can try to maximise using all available GPUs, then CPU RAM, and at last flag the weights that do not fit for disk offload. Let’s take a look using OPT-13b.

from speed up import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("facebook/opt-13b")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

device_map = infer_auto_device_map(model)

This can return a dictionary mapping modules or weights to a tool. On a machine with one Titan RTX for example, we get the next:

{'model.decoder.embed_tokens': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 ...
 'model.decoder.layers.9': 0,
 'model.decoder.layers.10.self_attn': 0,
 'model.decoder.layers.10.activation_fn': 0,
 'model.decoder.layers.10.self_attn_layer_norm': 0,
 'model.decoder.layers.10.fc1': 'cpu',
 'model.decoder.layers.10.fc2': 'cpu',
 'model.decoder.layers.10.final_layer_norm': 'cpu',
 'model.decoder.layers.11': 'cpu',
 ...
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18.self_attn': 'cpu',
 'model.decoder.layers.18.activation_fn': 'cpu',
 'model.decoder.layers.18.self_attn_layer_norm': 'cpu',
 'model.decoder.layers.18.fc1': 'disk',
 'model.decoder.layers.18.fc2': 'disk',
 'model.decoder.layers.18.final_layer_norm': 'disk',
 'model.decoder.layers.19': 'disk',
 ...
 'model.decoder.layers.39': 'disk',
 'lm_head': 'disk'}

Speed up evaluated that the embeddings and the decoder up until the ninth block could all fit on the GPU (device 0), then a part of the tenth block must be on the CPU, in addition to the next weights until the seventeenth layer. Then the 18th layer is split between the CPU and the disk and the next layers must all be offloaded to disk

Actually using this device map afterward won’t work, since the layers composing this model have residual connections (where the input of the block is added to the output of the block) so all of a given layer needs to be on the identical device. We will indicate this to Speed up by passing a listing of module names that should not be split with the no_split_module_classes keyword argument:

device_map = infer_auto_device_map(model, no_split_module_classes=["OPTDecoderLayer"])

This can then return

'model.decoder.embed_tokens': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 ...
 'model.decoder.layers.9': 0,
 'model.decoder.layers.10': 'cpu',
 'model.decoder.layers.11': 'cpu',
 ...
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18': 'disk',
 ...
 'model.decoder.layers.39': 'disk',
 'lm_head': 'disk'}

Now, each layer is all the time on the identical device.

In Transformers, when using device_map within the from_pretrained() method or in a pipeline, those classes of blocks to depart on the identical device are routinely provided, so that you needn’t worry about them. Note that you’ve gotten the next options for device_map (only relevant when you’ve gotten a couple of GPU):

"auto" or "balanced": Speed up will split the weights in order that each GPU is used equally;
"balanced_low_0": Speed up will split the weights in order that each GPU is used equally except the primary one, where it can attempt to have as little weights as possible (useful when you need to work with the outputs of the model on one GPU, for example when using the generate function);
"sequential": Speed up will fill the GPUs so as (so the last ones won’t be used in any respect).

You too can pass your personal device_map so long as it follows the format we saw before (dictionary layer/module names to device).

Finally, note that the outcomes of the device_map you receive depend upon the chosen dtype (as various kinds of floats take a distinct amount of space). Providing dtype="float16" will give us different results:

device_map = infer_auto_device_map(model, no_split_module_classes=["OPTDecoderLayer"], dtype="float16")

On this precision, we will fit the model as much as layer 21 on the GPU:



{'model.decoder.embed_tokens': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 ...
 'model.decoder.layers.21': 0,
 'model.decoder.layers.22': 'cpu',
 ...
 'model.decoder.layers.37': 'cpu',
 'model.decoder.layers.38': 'disk',
 'model.decoder.layers.39': 'disk',
 'lm_head': 'disk'}

Now that we all know where each weight is purported to go, we will progressively load the pretrained weights contained in the model.

Sharding state dicts

Traditionally, PyTorch models are saved in a complete file containing a map from parameter name to weight. This map is usually called a state_dict. Here is an excerpt from the PyTorch documentation on saving on loading:


torch.save(my_model.state_dict(), 'model_weights.pth')


new_model = ModelClass()
new_model.load_state_dict(torch.load('model_weights.pth'))

This works pretty much for models with lower than 1 billion parameters, but for larger models, this could be very taxing in RAM. The BLOOM model has 176 billions parameters; even with the weights saved in bfloat16 to save lots of space, it still represents 352GB as a complete. While the super computer that trained this model might need this amount of memory available, requiring this for inference is unrealistic.

That is why large models on the Hugging Face Hub should not saved and shared with one big file containing all of the weights, but several of them. In the event you go to the BLOOM model page for example, you will notice there’s 72 files named pytorch_model_xxxxx-of-00072.bin, which each contain a part of the model weights. Using this format, we will load one a part of the state dict in memory, put the weights contained in the model, move them on the fitting device, then discard this state dict part before going to the subsequent. As a substitute of requiring to have enough RAM to accommodate the entire model, we only need enough RAM to get the most important checkpoint part, which we call a shard, so 7.19GB within the case of BLOOM.

We call the checkpoints saved in several files like BLOOM sharded checkpoints, and we’ve standardized their format as such:

One file (called pytorch_model.bin.index.json) comprises some metadata and a map parameter name to file name, indicating where to search out each weight
All the opposite files are standard PyTorch state dicts, they simply contain an element of the model as an alternative of the entire one. You may have a take a look at the content of the index file here.

To load such a sharded checkpoint right into a model, we just have to loop over the varied shards. Speed up provides a function called load_checkpoint_in_model that can do that for you if you’ve gotten cloned one in every of the repos of the Hub, or you may directly use the from_pretrained approach to Transformers, which can handle the downloading and caching for you:

import torch
from transformers import AutoModelForCausalLM


checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.float16)

If the device map computed routinely requires some weights to be offloaded on disk since you haven’t got enough GPU and CPU RAM, you’ll get an error indicating you have to pass an folder where the weights that needs to be stored on disk shall be offloaded:

ValueError: The present `device_map` had weights offloaded to the disk. Please provide an 
`offload_folder` for them.

Adding this argument should resolve the error:

import torch
from transformers import AutoModelForCausalLM


checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", torch_dtype=torch.float16
)

Note that if you happen to are attempting to load a really large model that require some disk offload on top of CPU offload, you may run out of RAM when the last shards of the checkpoint are loaded, since there’s the a part of the model staying on CPU taking space. If that’s the case, use the choice offload_state_dict=True to temporarily offload the a part of the model staying on CPU while the weights are all loaded, and reload it in RAM once all of the weights have been processed

import torch
from transformers import AutoModelForCausalLM

checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)

This can slot in Colab, but shall be so near using all of the RAM available that it can exit of RAM if you attempt to generate a prediction. To get a model we will use, we want to dump yet another layer on the disk. We will achieve this by taking the device_map computed within the previous section, adapting it a bit, then passing it to the from_pretrained call:

import torch
from transformers import AutoModelForCausalLM

checkpoint = "facebook/opt-13b"
device_map["model.decoder.layers.37"] = "disk"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map=device_map, offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)

Running a model split on several devices

One last part we’ve not touched is how Speed up enables your model to run with its weight spread across several GPUs, CPU RAM, and the disk folder. This is finished very simply using hooks.

hooks are a PyTorch API that adds functions executed just before each forward called

We couldn’t use this directly since they only support models with regular arguments and no keyword arguments of their forward pass, but we took the identical idea. Once the model is loaded, the dispatch_model function will add hooks to each module and submodule which can be executed before and after each forward pass. They’ll:

ensure all of the inputs of the module are on the identical device because the weights;
if the weights have been offloaded to the CPU, move them to GPU 0 before the forward pass and back to the CPU just after;
if the weights have been offloaded to disk, load them in RAM then on the GPU 0 before the forward pass and free this memory just after.

The entire process is summarized in the next video:

This manner, your model may be loaded and run even if you happen to haven’t got enough GPU RAM and CPU RAM. The one thing you would like is disk space (and a number of patience!) While this solution is pretty naive if you’ve gotten multiple GPUs (there isn’t any clever pipeline parallelism involved, just using the GPUs sequentially) it still yields pretty decent results for BLOOM. And it permits you to run the model on smaller setups (albeit more slowly).

To learn more about Speed up big model inference, see the documentation.

Source link

How 🤗 Speed up runs very large models due to PyTorch

Creating an empty model

Computing a tool map

Sharding state dicts

Running a model split on several devices

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Image Classification with AutoTrain

Claude Opus 4.6 Anthropic

Very Large Language Models and The best way to Evaluate Them

Japanese Stable Diffusion

the Digital Object Identifier to Datasets and Models

How 🤗 Speed up runs very large models due to PyTorch

Creating an empty model

Computing a tool map

Sharding state dicts

Running a model split on several devices

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.