Introducing 🤗 Speed up

Run your raw PyTorch training scripts on any form of device.

Most high-level libraries above PyTorch provide support for distributed training and mixed precision, however the abstraction they introduce require a user to learn a brand new API in the event that they need to customize the underlying training loop. 🤗 Speed up was created for PyTorch users who prefer to have full control over their training loops but are reluctant to write down (and maintain) the boilerplate code needed to make use of distributed training (for multi-GPU on one or several nodes, TPUs, …) or mixed precision training. Plans forward include support for fairscale, deepseed, AWS SageMaker specific data-parallelism and model parallelism.

It provides two things: a straightforward and consistent API that abstracts that boilerplate code and a launcher command to simply run those scripts on various setups.

Easy integration!

Let’s first have a have a look at an example:

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from speed up import Accelerator

+ accelerator = Accelerator()
- device="cpu"
+ device = accelerator.device

  model = torch.nn.Transformer().to(device)
  optim = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optim, data = accelerator.prepare(model, optim, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
          source = source.to(device)
          targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

By just adding five lines of code to any standard PyTorch training script, you possibly can now run said script on any form of distributed setting, in addition to with or without mixed precision. 🤗 Speed up even handles the device placement for you, so you possibly can simplify the training loop above even further:

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from speed up import Accelerator

+ accelerator = Accelerator()
- device="cpu"

- model = torch.nn.Transformer().to(device)
+ model = torch.nn.Transformer()
  optim = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optim, data = accelerator.prepare(model, optim, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
-         source = source.to(device)
-         targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

In contrast, listed here are the changes needed to have this code run with distributed training are the followings:

+ import os
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from torch.utils.data import DistributedSampler
+ from torch.nn.parallel import DistributedDataParallel

+ local_rank = int(os.environ.get("LOCAL_RANK", -1))
- device="cpu"
+ device = device = torch.device("cuda", local_rank)

  model = torch.nn.Transformer().to(device)
+ model = DistributedDataParallel(model)  
  optim = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
+ sampler = DistributedSampler(dataset)
- data = torch.utils.data.DataLoader(dataset, shuffle=True)
+ data = torch.utils.data.DataLoader(dataset, sampler=sampler)

  model.train()
  for epoch in range(10):
+     sampler.set_epoch(epoch)  
      for source, targets in data:
          source = source.to(device)
          targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

          loss.backward()

          optimizer.step()

These changes will make your training script work for multiple GPUs, but your script will then stop working on CPU or one GPU (unless you begin adding if statements all over the place). Much more annoying, if you happen to desired to test your script on TPUs you would wish to vary different lines of codes. Same for mixed precision training. The promise of 🤗 Speed up is:

to maintain the changes to your training loop to the bare minimum so you might have to learn as little as possible.
to have the identical functions work for any distributed setup, so only need to learn one API.

How does it work?

To see how the library works in practice, let’s have a have a look at each line of code we’d like so as to add to a training loop.

accelerator = Accelerator()

On top of giving the fundamental object that you’re going to use, this line will analyze from the environment the variety of distributed training run and perform the needed initialization. You’ll be able to force a training on CPU or a mixed precision training by passing cpu=True or fp16=True to this init. Each of those options can be set using the launcher on your script.

model, optim, data = accelerator.prepare(model, optim, data)

That is the fundamental bulk of the API and can prepare the three fundamental variety of objects: models (torch.nn.Module), optimizers (torch.optim.Optimizer) and dataloaders (torch.data.dataloader.DataLoader).

Model

Model preparation include wrapping it in the correct container (for example DistributedDataParallel) and putting it on the correct device. Like with a daily distributed training, you will have to unwrap your model for saving, or to access its specific methods, which may be done with accelerator.unwrap_model(model).

Optimizer

The optimizer can be wrapped in a special container that may perform the needed operations within the step to make mixed precision work. It should also properly handle device placement of the state dict if its non-empty or loaded from a checkpoint.

DataLoader

That is where a lot of the magic is hidden. As you might have seen within the code example, the library doesn’t depend on a DistributedSampler, it’s going to actually work with any sampler you may pass to your dataloader (if you happen to ever had to write down a distributed version of your custom sampler, there is no such thing as a more need for that!). The dataloader is wrapped in a container that may only grab the indices relevant to the present process within the sampler (or skip the batches for the opposite processes if you happen to use an IterableDataset) and put the batches on the correct device.

For this to work, Speed up provides a utility function that may synchronize the random number generators on each of the processes run during distributed training. By default, it only synchronizes the generator of your sampler, so your data augmentation might be different on each process, however the random shuffling might be the identical. You’ll be able to in fact use this utility to synchronize more RNGs if you happen to need it.

accelerator.backward(loss)

This last line adds the needed steps for the backward pass (mostly for mixed precision but other integrations would require some custom behavior here).

What about evaluation?

Evaluation can either be run normally on all processes, or if you happen to just want it to run on the fundamental process, you need to use the handy test:

if accelerator.is_main_process():

But you too can very easily run a distributed evaluation using Speed up, here’s what you would wish so as to add to your evaluation loop:

+ eval_dataloader = accelerator.prepare(eval_dataloader)
  predictions, labels = [], []
  for source, targets in eval_dataloader:
      with torch.no_grad():
          output = model(source)

-     predictions.append(output.cpu().numpy())
-     labels.append(targets.cpu().numpy())
+     predictions.append(accelerator.gather(output).cpu().numpy())
+     labels.append(accelerator.gather(targets).cpu().numpy())

  predictions = np.concatenate(predictions)
  labels = np.concatenate(labels)

+ predictions = predictions[:len(eval_dataloader.dataset)]
+ labels = label[:len(eval_dataloader.dataset)]

  metric_compute(predictions, labels)

Like for the training, it is advisable add one line to arrange your evaluation dataloader. You then can just use accelerator.gather to assemble across processes the tensors of predictions and labels. The last line so as to add truncates the predictions and labels to the variety of examples in your dataset since the prepared evaluation dataloader will return just a few more elements to be certain that batches all have the identical size on each process.

One launcher to rule all of them

The scripts using Speed up might be completely compatible along with your traditional launchers, comparable to torch.distributed.launch. But remembering all of the arguments to them is a bit annoying and while you’ve setup your instance with 4 GPUs, you may run most of your trainings using all of them. Speed up comes with a handy CLI that works in two steps:

speed up config

It will trigger somewhat questionnaire about your setup, which is able to create a config file you possibly can edit with all of the defaults on your training commands. Then

speed up launch path_to_script.py --args_to_the_script

will launch your training script using those default. The one thing you might have to do is provide all of the arguments needed by your training script.

To make this launcher much more awesome, you need to use it to spawn an AWS instance using SageMaker. Take a look at this guide to find how!

Tips on how to become involved?

To start, just pip install speed up or see the documentation for more install options.

Speed up is a totally open-sourced project, you will discover it on GitHub, have a have a look at its documentation or skim through our basic examples. Please tell us if you might have any issue or feature you desire to the library to support. For all questions, the forums is the place to examine!

For more complex examples in situation, you possibly can have a look at the official Transformers examples. Each folder comprises a run_task_no_trainer.py that leverages the Speed up library!

Source link

Introducing 🤗 Speed up

Easy integration!

How does it work?

Model

Optimizer

DataLoader

What about evaluation?

One launcher to rule all of them

Tips on how to become involved?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Is the AI and Data Job Market Dead?

PySpark for Pandas Users

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

Scaling-up BERT Inference on CPU (Part 1)

The human work behind humanoid robots is being hidden

Introducing 🤗 Speed up

Easy integration!

How does it work?

Model

Optimizer

DataLoader

What about evaluation?

One launcher to rule all of them

Tips on how to become involved?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.