Speed up Large Model Training using PyTorch Fully Sharded Data Parallel

On this post we’ll have a look at how we are able to leverage Speed up Library for training large models which enables users to leverage the newest features of PyTorch FullyShardedDataParallel (FSDP).

Motivation 🤗

With the ever increasing scale, size and parameters of the Machine Learning (ML) models, ML practitioners are finding it difficult to coach and even load such large models on their hardware. On one hand, it has been found that enormous models learn quickly (data and compute efficient) and are significantly more performant compared to smaller models [1]; however, it becomes prohibitive to coach such models on many of the available hardware.

Distributed training is the important thing to enable training such large ML models. There have been major recent advances in the sector of Distributed Training at Scale. Few probably the most notable advances are given below:

Data Parallelism using ZeRO – Zero Redundancy Optimizer [2]
1. Stage 1: Shards optimizer states across data parallel staff/GPUs
2. Stage 2: Shards optimizer states + gradients across data parallel staff/GPUs
3. Stage 3: Shards optimizer states + gradients + model parameters across data parallel staff/GPUs
4. CPU Offload: Offloads the gradients + optimizer states to CPU constructing on top of ZERO Stage 2 [3]
Tensor Parallelism [4]: Type of model parallelism wherein sharding parameters of individual layers with huge variety of parameters across accelerators/GPUs is finished in a clever manner to realize parallel computation while avoiding expensive communication synchronization overheads.
Pipeline Parallelism [5]: Type of model parallelism wherein different layers of the model are convey different accelerators/GPUs and pipelining is employed to maintain all of the accelerators running concurrently. Here, as an illustration, the second accelerator/GPU computes on the primary micro-batch while the primary accelerator/GPU computes on the second micro-batch.
3D parallelism [3]: Employs Data Parallelism using ZERO + Tensor Parallelism + Pipeline Parallelism to coach humongous models within the order of 100s of Billions of parameters. As an example, BigScience 176B parameters Language Model employ this [6].

On this post we’ll have a look at Data Parallelism using ZeRO and more specifically the newest PyTorch feature FullyShardedDataParallel (FSDP). DeepSpeed and FairScale have implemented the core ideas of the ZERO paper. These have already been integrated in transformers Trainer and accompanied by great blog Fit More and Train Faster With ZeRO via DeepSpeed and FairScale [10]. PyTorch recently upstreamed the Fairscale FSDP into PyTorch Distributed with additional optimizations.

Speed up 🚀: Leverage PyTorch FSDP with none code changes

We’ll have a look at the duty of Causal Language Modelling using GPT-2 Large (762M) and XL (1.5B) model variants.

Below is the code for pre-training GPT-2 model. It is comparable to the official causal language modeling example here with the addition of two arguments n_train (2000) and n_val (500) to stop preprocessing/training on entire data with a purpose to perform quick proof of concept benchmarks.

run_clm_no_trainer.py

Sample FSDP config after running the command speed up config:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: FSDP
fsdp_config:
  min_num_params: 2000
  offload_params: false
  sharding_strategy: 1
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: major
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

Multi-GPU FSDP

Here, we experiment on the Single-Node Multi-GPU setting. We compare the performance of Distributed Data Parallel (DDP) and FSDP in various configurations. First, GPT-2 Large(762M) model is used wherein DDP works with certain batch sizes without throwing Out Of Memory (OOM) errors. Next, GPT-2 XL (1.5B) model is used wherein DDP fails with OOM error even on batch size of 1. We observe that FSDP enables larger batch sizes for GPT-2 Large model and it enables training the GPT-2 XL model with decent batch size unlike DDP.

Hardware setup: 2X24GB NVIDIA Titan RTX GPUs.

Command for training GPT-2 Large Model (762M parameters):

export BS=


time speed up launch run_clm_no_trainer.py 
--model_name_or_path gpt2-large 
--dataset_name wikitext 
--dataset_config_name wikitext-2-raw-v1 
--per_device_train_batch_size $BS 
--per_device_eval_batch_size $BS 
--num_train_epochs 1 
--block_size 12

Sample FSDP Run:

Method	Batch Size Max ($BS)	Approx Train Time (minutes)
DDP (Distributed Data Parallel)	7	15
DDP + FP16	7	8
FSDP with SHARD_GRAD_OP	11	11
FSDP with min_num_params = 1M + FULL_SHARD	15	12
FSDP with min_num_params = 2K + FULL_SHARD	15	13
FSDP with min_num_params = 1M + FULL_SHARD + Offload to CPU	20	23
FSDP with min_num_params = 2K + FULL_SHARD + Offload to CPU	22	24

Table 1: Benchmarking FSDP on GPT-2 Large (762M) model

With respect to DDP, from Table 1 we are able to observe that FSDP enables larger batch sizes, as much as 2X-3X without and with CPU offload setting, respectively. By way of train time, DDP with mixed precision is the fastest followed by FSDP using ZERO Stage 2 and Stage 3, respectively. Because the task of causal language modelling at all times has fixed context sequence length (–block_size), the train time speedup with FSDP wasn’t that great. For applications with dynamic batching, FSDP which enables larger batch sizes will likely have considerable speed up when it comes to train time. FSDP mixed precision support currently has few issues with transformer. Once that is supported, the training time speed up will further improve considerably.

CPU Offloading to enable training humongous models that won’t fit the GPU memory

Command for training GPT-2 XL Model (1.5B parameters):

export BS=


time speed up launch run_clm_no_trainer.py 
--model_name_or_path gpt2-xl 
--dataset_name wikitext 
--dataset_config_name wikitext-2-raw-v1 
--per_device_train_batch_size $BS 
--per_device_eval_batch_size $BS 
--num_train_epochs 1 
--block_size 12

Method	Batch Size Max ($BS)	Num GPUs	Approx Train Time (Hours)	Notes
DDP	1	1	NA	OOM Error RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 23.65 GiB total capability; 22.27 GiB already allocated; 20.31 MiB free; 22.76 GiB reserved in total by PyTorch)
DDP	1	2	NA	OOM Error RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 23.65 GiB total capability; 22.27 GiB already allocated; 20.31 MiB free; 22.76 GiB reserved in total by PyTorch)
DDP + FP16	1	1	NA	OOM Error RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 23.65 GiB total capability; 22.27 GiB already allocated; 20.31 MiB free; 22.76 GiB reserved in total by PyTorch)
FSDP with min_num_params = 2K	5	2	0.6
FSDP with min_num_params = 2K + Offload to CPU	10	1	3
FSDP with min_num_params = 2K + Offload to CPU	14	2	1.16

Table 2: Benchmarking FSDP on GPT-2 XL (1.5B) model

From Table 2, we are able to observe that DDP (w and w/o fp16) isn’t even in a position to run with batch size of 1 and ends in CUDA OOM error. FSDP with Zero-Stage 3 is in a position to be run on 2 GPUs with batch size of 5 (effective batch size =10 (5 X 2)). FSDP with CPU offload can further increase the max batch size to 14 per GPU when using 2 GPUs. FSDP with CPU offload enables training GPT-2 1.5B model on a single GPU with a batch size of 10. This allows ML practitioners with minimal compute resources to coach such large models, thereby democratizing large model training.

Capabilities and limitations of the FSDP Integration

Let’s dive into the present support that Speed up provides for FSDP integration and the known limitations.

Required PyTorch version for FSDP support: PyTorch Nightly (or 1.12.0 in case you read this after it has been released) because the model saving with FSDP activated is just available with recent fixes.

Configuration through CLI:

Sharding Strategy: [1] FULL_SHARD, [2] SHARD_GRAD_OP
Min Num Params: FSDP’s minimum variety of parameters for Default Auto Wrapping.
Offload Params: Decides Whether to dump parameters and gradients to CPU.

For more control, users can leverage the FullyShardedDataParallelPlugin wherein they will specify auto_wrap_policy, backward_prefetch and ignored_modules.

After creating an instance of this class, users can pass it when creating the Accelerator object.

For more information on these options, please seek advice from the PyTorch FullyShardedDataParallel code.

Next, we’ll see the importance of the min_num_params config. Below is an excerpt from [8] detailing the importance of FSDP Auto Wrap Policy.

(Source: link)

When using the default_auto_wrap_policy, a layer is wrapped in FSDP module if the variety of parameters in that layer is greater than the min_num_params . The code for finetuning BERT-Large (330M) model on the GLUE MRPC task is the official complete NLP example outlining tips on how to properly use FSDP feature with the addition of utilities for tracking peak memory usage.

fsdp_with_peak_mem_tracking.py

We leverage the tracking functionality support in Speed up to log the train and evaluation peak memory usage together with evaluation metrics. Below is the snapshot of the plots from wandb run.

We will observe that the DDP takes twice as much memory as FSDP with auto wrap. FSDP without auto wrap takes more memory than FSDP with auto wrap but considerably lower than that of DDP. FSDP with auto wrap with min_num_params=2k takes marginally less memory compared to setting with min_num_params=1M. This highlights the importance of the FSDP Auto Wrap Policy and users should mess around with the min_num_params to seek out the setting which considerably saves memory and isn’t leading to lot of communication overhead. PyTorch team is working on auto tuning tool for this config as mentioned in [8].

Few caveats to concentrate on

PyTorch FSDP auto wraps sub-modules, flattens the parameters and shards the parameters in place. As a consequence of this, any optimizer created before model wrapping gets broken and occupies more memory. Hence, it is very really useful and efficient to arrange model before creating optimizer. Speed up will mechanically wrap the model and create an optimizer for you in case of single model with a warning message.

FSDP Warning: When using FSDP, it’s efficient and really useful to call prepare for the model before creating the optimizer

Nevertheless, below is the really useful approach to prepare model and optimizer while using FSDP:

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+ model = accelerator.prepare(model)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)

- model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(model,
-        optimizer, train_dataloader, eval_dataloader, lr_scheduler
-    )

+ optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+         optimizer, train_dataloader, eval_dataloader, lr_scheduler
+        )

In case of a single model, if you’ve created optimizer with multiple parameter groups and called prepare with them together, then the parameter groups might be lost and the next warning is displayed:

FSDP Warning: When using FSDP, several parameter groups might be conflated right into a single one because of nested module wrapping and parameter flattening.

It’s because parameter groups created before wrapping could have no meaning post wrapping due parameter flattening of nested FSDP modules into 1D arrays (which might devour many layers). As an example, below are the named parameters of FSDP model on GPU 0 (When using 2 GPUs. Around 55M (110M/2) params in 1D arrays as this could have the first shard of the parameters). Here, if one has applied no weight decay for [bias, LayerNorm.weight] named parameters of unwrapped BERT-Base model, it will possibly’t be applied to the below FSDP wrapped model as there are not any named parameters with either of those strings and the parameters of those layers are concatenated with parameters of assorted other layers. More details mentioned on this issue (The unique model parameters' .grads are usually not set, meaning that they can't be optimized individually (which is why we cannot support multiple parameter groups)).

```
{
'_fsdp_wrapped_module.flat_param': torch.Size([494209]),

'_fsdp_wrapped_module._fpw_module.bert.embeddings.word_embeddings._fsdp_wrapped_module.flat_param': torch.Size([11720448]),

'_fsdp_wrapped_module._fpw_module.bert.encoder._fsdp_wrapped_module.flat_param': torch.Size([42527232])
}
```

In case of multiple models, it’s needed to arrange the models before creating optimizers else it’ll throw an error.
Mixed precision is currently not supported with FSDP as we wait for PyTorch to repair support for it.

The way it works 📝

(Source: link)

The above workflow gives an outline of what happens behind the scenes when FSDP is activated. Let’s first understand how DDP works and the way FSDP improves it. In DDP, each employee/accelerator/GPU has a duplicate of your complete model parameters, gradients and optimizer states. Each employee gets a distinct batch of information, it goes through the forwards pass, a loss is computed followed by the backward pass to generate gradients. Now, an all-reduce operation is performed wherein each employee gets the gradients from the remaining staff and averaging is finished. In this fashion, each employee now has the identical global gradients that are utilized by the optimizer to update the model parameters. We will see that having full replicas devour a whole lot of redundant memory on each GPU, which limits the batch size in addition to the scale of the models.

FSDP precisely addresses this by sharding the optimizer states, gradients and model parameters across the info parallel staff. It further facilitates CPU offloading of all those tensors, thereby enabling loading large models which won’t fit the available GPU memory. Much like DDP, each employee gets a distinct batch of information. In the course of the forward pass, if the CPU offload is enabled, the parameters of the local shard are first copied to the GPU/accelerator. Then, each employee performs all-gather operation for a given FSDP wrapped module/layer(s) to all get the needed parameters, computation is performed followed by releasing/emptying the parameter shards of other staff. This continues for all of the FSDP modules. The loss gets computed after the forward pass and throughout the backward pass, again an all-gather operation is performed to get all of the needed parameters for a given FSDP module, computation is performed to get local gradients followed by releasing the shards of other staff. Now, the local gradients are averaged and sharded to every relevant staff using reduce-scatter operation. This permits each employee to update the parameters of its local shard. If CPU offload is activated, the gradients are passed to CPU for updating parameters directly on CPU.

Please refer [7, 8, 9] for all of the in-depth details on the workings of the PyTorch FSDP and the extensive experimentation carried out using this feature.

Issues

In the event you encounter any issues with the combination a part of PyTorch FSDP, please open an Issue in speed up.

But when you’ve problems with PyTorch FSDP configuration, and deployment – it is advisable ask the experts of their domains, due to this fact, please, open a PyTorch Issue as a substitute.