Accelerating PyTorch distributed fine-tuning with Intel technologies

For all their amazing performance, state-of-the-art deep learning models often take an extended time to coach. As a way to speed up training jobs, engineering teams depend on distributed training, a divide-and-conquer technique where clustered servers each keep a duplicate of the model, train it on a subset of the training set, and exchange results to converge to a final model.

Graphical Processing Units (GPUs) have long been the de facto selection to coach deep learning models. Nevertheless, the rise of transfer learning is changing the sport. Models are actually rarely trained from scratch on humungous datasets. As an alternative, they’re continuously fine-tuned on specific (and smaller) datasets, with the intention to construct specialized models which might be more accurate than the bottom model for particular tasks. As these training jobs are much shorter, using a CPU-based cluster can prove to be an interesting option that keeps each training time and price under control.

What this post is about

On this post, you’ll learn the way to speed up PyTorch training jobs by distributing them on a cluster of Intel Xeon Scalable CPU servers, powered by the Ice Lake architecture and running performance-optimized software libraries. We’ll construct the cluster from scratch using virtual machines, and you must find a way to simply replicate the demo on your personal infrastructure, either within the cloud or on premise.

Running a text classification job, we are going to fine-tune a BERT model on the MRPC dataset (one in all the tasks included within the GLUE benchmark). The MRPC dataset incorporates 5,800 sentence pairs extracted from news sources, with a label telling us whether the 2 sentences in each pair are semantically equivalent. We picked this dataset for its reasonable training time, and trying other GLUE tasks is only a parameter away.

Once the cluster is up and running, we are going to run a baseline job on a single server. Then, we are going to scale it to 2 servers and 4 servers and measure the speed-up.

Along the best way, we are going to cover the next topics:

Listing the required infrastructure and software constructing blocks,
Establishing our cluster,
Installing dependencies,
Running a single-node job,
Running a distributed job.

Let’s get to work!

Using Intel servers

For best performance, we are going to use Intel servers based on the Ice Lake architecture, which supports hardware features equivalent to Intel AVX-512 and Intel Vector Neural Network Instructions (VNNI). These features speed up operations typically present in deep learning training and inference. You’ll be able to learn more about them on this presentation (PDF).

All three major cloud providers offer virtual machines powered by Intel Ice Lake CPUs:

In fact, it’s also possible to use your personal servers. In the event that they are based on the Cascade Lake architecture (Ice Lake’s predecessor), they’re good to go as Cascade Lake also includes AVX-512 and VNNI.

Using Intel performance libraries

To leverage AVX-512 and VNNI in PyTorch, Intel has designed the Intel extension for PyTorch. This software library provides out of the box speedup for training and inference, so we must always definitely install it.

In the case of distributed training, the important performance bottleneck is usually networking. Indeed, the several nodes within the cluster have to periodically exchange model state information to remain in sync. As transformers are large models with billions of parameters (sometimes rather more), the amount of data is important, and things only worsen because the variety of nodes increase. Thus, it is vital to make use of a communication library optimized for deep learning.

In reality, PyTorch includes the torch.distributed package, which supports different communication backends. Here, we’ll use the Intel oneAPI Collective Communications Library (oneCCL), an efficient implementation of communication patterns utilized in deep learning (all-reduce, etc.). You’ll be able to learn in regards to the performance of oneCCL versus other backends on this PyTorch blog post.

Now that we’re clear on constructing blocks, let’s talk in regards to the overall setup of our training cluster.

Establishing our cluster

On this demo, I’m using Amazon EC2 instances running Amazon Linux 2 (c6i.16xlarge, 64 vCPUs, 128GB RAM, 25Gbit/s networking). Setup will probably be different in other environments, but steps must be very similar.

Please have in mind that you have to 4 equivalent instances, so you might need to plan for some kind of automation to avoid running the identical setup 4 times. Here, I’ll arrange one instance manually, create a brand new Amazon Machine Image (AMI) from this instance, and use this AMI to launch three equivalent instances.

From a networking perspective, we are going to need the next setup:

Open port 22 for ssh access on all instances for setup and debugging.
Configure password-less ssh between the master instance (the one you may launch training from) and all other instances (master included).
Open all TCP ports on all instances for oneCCL communication contained in the cluster. Please make certain NOT to open these ports to the external world. AWS provides a convenient strategy to do that by only allowing connections from instances running a specific security group. Here’s how my setup looks.

Now, let’s provision the primary instance manually. I first create the instance itself, attach the safety group above, and add 128GB of storage. To optimize costs, I actually have launched it as a spot instance.

Once the instance is up, I connect with it with ssh with the intention to install dependencies.

Installing dependencies

Listed below are the steps we are going to follow:

Install Intel toolkits,
Install the Anaconda distribution,
Create a brand new conda environment,
Install PyTorch and the Intel extension for PyTorch,
Compile and install oneCCL,
Install the transformers library.

It looks like loads, but there’s nothing complicated. Here we go!

Installing Intel toolkits

First, we download and install the Intel OneAPI base toolkit in addition to the AI toolkit. You’ll be able to study them on the Intel website.

wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18236/l_BaseKit_p_2021.4.0.3422_offline.sh
sudo bash l_BaseKit_p_2021.4.0.3422_offline.sh

wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18235/l_AIKit_p_2021.4.0.1460_offline.sh
sudo bash l_AIKit_p_2021.4.0.1460_offline.sh

Installing Anaconda

Then, we download and install the Anaconda distribution.

wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
sh Anaconda3-2021.05-Linux-x86_64.sh

Making a latest conda environment

We log off and log in again to refresh paths. Then, we create a brand new conda environment to maintain things neat and tidy.

yes | conda create -n transformer python=3.7.9 -c anaconda
eval "$(conda shell.bash hook)"
conda activate transformer
yes | conda install pip cmake

Installing PyTorch and the Intel extension for PyTorch

Next, we install PyTorch 1.9 and the Intel extension toolkit. Versions must match.

yes | conda install pytorch==1.9.0 cpuonly -c pytorch
pip install torch_ipex==1.9.0 -f https://software.intel.com/ipex-whl-stable

Compiling and installing oneCCL

Then, we install some native dependencies required to compile oneCCL.

sudo yum -y update
sudo yum install -y git cmake3 gcc gcc-c++

Next, we clone the oneCCL repository, construct the library and install it. Again, versions must match.

source /opt/intel/oneapi/mkl/latest/env/vars.sh
git clone https://github.com/intel/torch-ccl.git
cd torch-ccl
git checkout ccl_torch1.9
git submodule sync
git submodule update --init --recursive
python setup.py install
cd ..

Installing the transformers library

Next, we install the transformers library and dependencies required to run GLUE tasks.

pip install transformers datasets
yes | conda install scipy scikit-learn

Finally, we clone a fork of the transformersrepository containing the instance we will run.

git clone https://github.com/kding1/transformers.git
cd transformers
git checkout dist-sigopt

We’re done! Let’s run a single-node job.

Launching a single-node job

To get a baseline, let’s launch a single-node job running the run_glue.py script in transformers/examples/pytorch/text-classification. This could work on any of the instances, and it’s an excellent sanity check before proceeding to distributed training.

python run_glue.py 
--model_name_or_path bert-base-cased --task_name mrpc 
--do_train --do_eval --max_seq_length 128 
--per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3 
--output_dir /tmp/mrpc/ --overwrite_output_dir True

This job takes 7 minutes and 46 seconds. Now, let’s arrange distributed jobs with oneCCL and speed things up!

Establishing a distributed job with oneCCL

Three steps are required to run a distributed training job:

List the nodes of the training cluster,
Define environment variables,
Modify the training script.

Listing the nodes of the training cluster

On the master instance, in transformers/examples/pytorch/text-classification, we create a text file named hostfile. This file stores the names of the nodes within the cluster (IP addresses would work too). The primary line should point to the master instance.

Here’s my file:

ip-172-31-28-17.ec2.internal
ip-172-31-30-87.ec2.internal
ip-172-31-29-11.ec2.internal
ip-172-31-20-77.ec2.internal

Defining environment variables

Next, we’d like to set some environment variables on the master node, most notably its IP address. You could find more information on oneCCL variables within the documentation.

for nic in eth0 eib0 hib0 enp94s0f0; do
  master_addr=$(ifconfig $nic 2>/dev/null | grep netmask | awk '{print $2}'| cut -f2 -d:)
  if [ "$master_addr" ]; then
    break
  fi
done
export MASTER_ADDR=$master_addr

source /home/ec2-user/anaconda3/envs/transformer/lib/python3.7/site-packages/torch_ccl-1.3.0+43f48a1-py3.7-linux-x86_64.egg/torch_ccl/env/setvars.sh

export LD_LIBRARY_PATH=/home/ec2-user/anaconda3/envs/transformer/lib/python3.7/site-packages/torch_ccl-1.3.0+43f48a1-py3.7-linux-x86_64.egg/:$LD_LIBRARY_PATH
export LD_PRELOAD="${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libiomp5.so"

export CCL_WORKER_COUNT=4
export CCL_WORKER_AFFINITY="0,1,2,3,32,33,34,35"
export CCL_ATL_TRANSPORT=ofi
export ATL_PROGRESS_MODE=0

Modifying the training script

The next changes have already been applied to our training script (run_glue.py) with the intention to enable distributed training. You would wish to use similar changes when using your personal training code.

Import the torch_cclpackage.
Receive the address of the master node and the local rank of the node within the cluster.

+import torch_ccl
+
 import datasets
 import numpy as np
 from datasets import load_dataset, load_metric
@@ -47,7 +49,7 @@ from transformers.utils.versions import require_version


 # Will error if the minimal version of Transformers shouldn't be installed. Remove at your personal risks.
-check_min_version("4.13.0.dev0")
+# check_min_version("4.13.0.dev0")

 require_version("datasets>=1.8.0", "To repair: pip install -r examples/pytorch/text-classification/requirements.txt")

@@ -191,6 +193,17 @@ def important():
     # or by passing the --help flag to this script.
     # We now keep distinct sets of args, for a cleaner separation of concerns.

+    # add local rank for cpu-dist
+    sys.argv.append("--local_rank")
+    sys.argv.append(str(os.environ.get("PMI_RANK", -1)))
+
+    # ccl specific environment variables
+    if "ccl" in sys.argv:
+        os.environ["MASTER_ADDR"] = os.environ.get("MASTER_ADDR", "127.0.0.1")
+        os.environ["MASTER_PORT"] = "29500"
+        os.environ["RANK"] = str(os.environ.get("PMI_RANK", -1))
+        os.environ["WORLD_SIZE"] = str(os.environ.get("PMI_SIZE", -1))
+
     parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
     if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):

Setup is now complete. Let’s scale our training job to 2 nodes and 4 nodes.

Running a distributed job with oneCCL

On the master node, I take advantage of mpirunto launch a 2-node job: -np (variety of processes) is about to 2 and -ppn (process per node) is about to 1. Hence, the primary two nodes in hostfile will probably be chosen.

mpirun -f hostfile -np 2 -ppn 1 -genv I_MPI_PIN_DOMAIN=[0xfffffff0] 
-genv OMP_NUM_THREADS=28 python run_glue.py 
--model_name_or_path distilbert-base-uncased --task_name mrpc 
--do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 
--learning_rate 2e-5 --num_train_epochs 3 --output_dir /tmp/mrpc/ 
--overwrite_output_dir True --xpu_backend ccl --no_cuda True

Inside seconds, a job starts on the primary two nodes. The job completes in 4 minutes and 39 seconds, a 1.7x speedup.

Setting -np to 4 and launching a brand new job, I now see one process running on each node of the cluster.

Training completes in 2 minutes and 36 seconds, a 3x speedup.

One very last thing. Changing --task_name to qqp, I also ran the Quora Query Pairs GLUE task, which relies on a much larger dataset (over 400,000 training samples). The fine-tuning times were:

Single-node: 11 hours 22 minutes,
2 nodes: 6 hours and 38 minutes (1.71x),
4 nodes: 3 hours and 51 minutes (2.95x).

It looks just like the speedup is pretty consistent. Be happy to maintain experimenting with different learning rates, batch sizes and oneCCL settings. I’m sure you possibly can go even faster!

Conclusion

On this post, you have learned the way to construct a distributed training cluster based on Intel CPUs and performance libraries, and the way to use this cluster to hurry up fine-tuning jobs. Indeed, transfer learning is putting CPU training back into the sport, and you must definitely consider it when designing and constructing your next deep learning workflows.

Thanks for reading this long post. I hope you found it informative. Feedback and questions are welcome at julsimon@huggingface.co. Until next time, continue learning!

Julien

Source link

Accelerating PyTorch distributed fine-tuning with Intel technologies

What this post is about

Using Intel servers

Using Intel performance libraries

Establishing our cluster

Installing dependencies

Launching a single-node job

Establishing a distributed job with oneCCL

Running a distributed job with oneCCL

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Current Status of The Quantum Software Stack

The Multi-Agent Trap

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Accelerating PyTorch distributed fine-tuning with Intel technologies

What this post is about

Using Intel servers

Using Intel performance libraries

Establishing our cluster

Installing dependencies

Launching a single-node job

Establishing a distributed job with oneCCL

Running a distributed job with oneCCL

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.