Training large language models in Pytorch requires greater than an easy training loop. It is normally distributed across multiple devices, with many optimization techniques for a stable and efficient training. Hugging Face 🤗 Speed up library was created to support distributed training across GPUs and TPUs with very easy integration into the training loops. 🤗 Transformers also support distributed training through the Trainer API, which provides feature-complete training in PyTorch, without even needing to implement a training loop.
One other popular tool amongst researchers to pre-train large transformer models is Megatron-LM, a strong framework developed by the Applied Deep Learning Research team at NVIDIA. Unlike speed up and the Trainer, using Megatron-LM isn’t straightforward and could be a bit of overwhelming for beginners. But it surely is extremely optimized for the training on GPUs and may give some speedups. On this blogpost, you’ll learn how you can train a language model on NVIDIA GPUs in Megatron-LM, and use it with transformers.
We are going to try to interrupt down different steps for training a GPT2 model on this framework, this includes:
- Environment setup
- Data preprocessing
- Training
- Model conversion to 🤗 Transformers
Why Megatron-LM?
Before entering into the training details, let’s first understand what makes this framework more efficient than others. This section is inspired by this great blog about BLOOM training with Megatron-DeepSpeed, please check with it for more details as this blog is meant to offer a mild introduction to Megatron-LM.
DataLoader
Megatron-LM comes with an efficient DataLoader where the info is tokenized and shuffled before the training. It also splits the info into numbered sequences with indexes which are stored such that they have to be computed just once. To construct the index, the variety of epochs is computed based on the training parameters and an ordering is created after which shuffled. That is unlike most cases where we iterate through your complete dataset until it’s exhausted after which repeat for the second epoch. This smoothes the educational curve and saves time throughout the training.
Fused CUDA Kernels
When a computation is run on the GPU, the needed data is fetched from memory, then the computation is run and the result’s saved back into memory. In easy terms, the thought of fused kernels is that similar operations, often performed individually by Pytorch, are combined right into a single hardware operation. In order that they reduce the variety of memory movements done in multiple discrete computations by merging them into one. The figure below illustrates the thought of Kernel Fusion. It’s inspired from this paper, which discusses the concept intimately.
When f, g and h are fused in a single kernel, the intermediary results x’ and y’ of f and g are stored within the GPU registers and immediately utilized by h. But without fusion, x’ and y’ would have to be copied to the memory after which loaded by h. Subsequently, Kernel Fusion gives a major speed as much as the computations.
Megatron-LM also uses a Fused implementation of AdamW from Apex which is quicker than the Pytorch implementation.
While one can customize the DataLoader like Megatron-LM and use Apex’s Fused optimizer with transformers, it isn’t a beginner friendly undertaking to construct custom Fused CUDA Kernels.
Now that you simply are accustomed to the framework and what makes it advantageous, let’s get into the training details!
The way to train with Megatron-LM ?
Setup
The best method to setup the environment is to tug an NVIDIA PyTorch Container that comes with all of the required installations from NGC. See documentation for more details. For those who don’t desire to make use of this container you will have to put in the most recent pytorch, cuda, nccl, and NVIDIA APEX releases and the nltk library.
So after having installed Docker, you possibly can run the container with the next command (xx.xx denotes your Docker version), after which clone Megatron-LM repository inside it:
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:xx.xx-py3
git clone https://github.com/NVIDIA/Megatron-LM
You furthermore may must add the vocabulary file vocab.json and merges table merges.txt of your tokenizer inside Megatron-LM folder of your container. These files could be present in the model’s repository with the weights, see this repository for GPT2. It’s also possible to train your individual tokenizer using transformers. You’ll be able to checkout the CodeParrot project for a practical example.
Now if you desire to copy this data from outside the container you should use the next commands:
sudo docker cp vocab.json CONTAINER_ID:/workspace/Megatron-LM
sudo docker cp merges.txt CONTAINER_ID:/workspace/Megatron-LM
Data preprocessing
In the remaining of this tutorial we will likely be using CodeParrot model and data for example.
The training data requires some preprocessing. First, it’s essential to convert it right into a loose json format, with one json containing a text sample per line. For those who’re using 🤗 Datasets, here is an example on how you can try this (all the time inside Megatron-LM folder):
from datasets import load_dataset
train_data = load_dataset('codeparrot/codeparrot-clean-train', split='train')
train_data.to_json("codeparrot_data.json", lines=True)
The info is then tokenized, shuffled and processed right into a binary format for training using the next command:
pip install nltk
python tools/preprocess_data.py
--input codeparrot_data.json
--output-prefix codeparrot
--vocab vocab.json
--dataset-impl mmap
--tokenizer-type GPT2BPETokenizer
--merge-file merges.txt
--json-keys content
--workers 32
--chunk-size 25
--append-eod
The staff and chunk_size options check with the variety of staff utilized in the preprocessing and the chunk size of knowledge assigned to each. dataset-impl refers back to the implementation mode of the indexed datasets from [‘lazy’, ‘cached’, ‘mmap’].
This outputs two files codeparrot_content_document.idx and codeparrot_content_document.bin that are utilized in the training.
Training
You’ll be able to configure the model architecture and training parameters as shown below, or put it in a bash script that you’ll run. This command runs the pretraining on 8 GPUs for a 110M parameter CodeParrot model. Note that the info is partitioned by default right into a 969:30:1 ratio for training/validation/test sets.
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
CHECKPOINT_PATH=/workspace/Megatron-LM/experiments/codeparrot-small
VOCAB_FILE=vocab.json
MERGE_FILE=merges.txt
DATA_PATH=codeparrot_content_document
GPT_ARGS="--num-layers 12
--hidden-size 768
--num-attention-heads 12
--seq-length 1024
--max-position-embeddings 1024
--micro-batch-size 12
--global-batch-size 192
--lr 0.0005
--train-iters 150000
--lr-decay-iters 150000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay .1
--adam-beta2 .999
--fp16
--log-interval 10
--save-interval 2000
--eval-interval 200
--eval-iters 10
"
TENSORBOARD_ARGS="--tensorboard-dir experiments/tensorboard"
python3 -m torch.distributed.launch $DISTRIBUTED_ARGS
pretrain_gpt.py
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 1
$GPT_ARGS
--vocab-file $VOCAB_FILE
--merge-file $MERGE_FILE
--save $CHECKPOINT_PATH
--load $CHECKPOINT_PATH
--data-path $DATA_PATH
$TENSORBOARD_ARGS
With this setting, the training takes roughly 12 hours.
This setup uses Data Parallelism, but it is usually possible to make use of Model Parallelism for very large models that do not fit in a single GPU. The primary option consists of Tensor Parallelism that splits the execution of a single transformer module over multiple GPUs, you will have to vary tensor-model-parallel-size parameter to the specified variety of GPUs. The second option is Pipeline Parallelism where the transformer modules are split into equally sized stages. The parameter pipeline-model-parallel-size determines the variety of stages to separate the model into. For more details please check with this blog
Converting the model to 🤗 Transformers
After training we wish to make use of the model in transformers e.g. for evaluation or to deploy it to production. You’ll be able to convert it to a transformers model following this tutorial. As an illustration, after the training is finished you possibly can copy the weights of the last iteration 150k and convert the model_optim_rng.pt file to a pytorch_model.bin file that’s supported by transformers with the next commands:
mkdir -p nvidia/megatron-codeparrot-small
sudo docker cp CONTAINER_ID:/workspace/Megatron-LM/experiments/codeparrot-small/iter_0150000/mp_rank_00/model_optim_rng.pt nvidia/megatron-codeparrot-small
git clone https://github.com/huggingface/transformers.git
git clone https://github.com/NVIDIA/Megatron-LM.git
export PYTHONPATH=Megatron-LM
python transformers/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py nvidia/megatron-codeparrot-small/model_optim_rng.pt
Watch out, you will have to exchange the generated vocabulary file and merges table after the conversion, with the unique ones we introduced earlier in the event you plan to load the tokenizer from there.
Do not forget to push your model to the hub and share it with the community, it only takes three lines of code 🤗:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nvidia/megatron-codeparrot-small")
model.push_to_hub("codeparrot-small")
It’s also possible to easily use it to generate text:
from transformers import pipeline
pipe = pipeline("text-generation", model="your_username/codeparrot-small")
outputs = pipe("def hello_world():")
print(outputs[0]["generated_text"])
def hello_world():
print("Hello World!")
Tranfsormers also handle big model inference efficiently. In case you trained a really large model (e.g. using Model Parallelism), you possibly can easily use it for inference with the next command:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("your_username/codeparrot-large", device_map="auto")
It will use speed up library behind the scenes to mechanically dispatch the model weights across the devices you’ve gotten available (GPUs, CPU RAM).
Disclaimer: We have now shown that anyone can use Megatron-LM to coach language models. The query is when to make use of it. This framework obviously adds a while overhead due to extra preprocessing and conversion steps. So it will be significant that you simply resolve which framework is more appropriate in your case and model size. We recommend trying it for pre-training models or prolonged fine-tuning, but probably not for shorter fine-tuning of medium-sized models. The Trainer API and speed up library are also very handy for model training, they’re device-agnostic and provides significant flexibility to the users.
Congratulations 🎉 now you understand how you can train a GPT2 model in Megatron-LM and make it supported by transformers!
