TPU training is a useful skill to have: TPU pods are high-performance and very scalable, making it easy to coach models at any scale from a couple of tens of tens of millions of parameters up to really enormous sizes: Google’s PaLM model (over 500 billion parameters!) was trained entirely on TPU pods.
We’ve previously written a tutorial and a Colab example showing small-scale TPU training with TensorFlow and introducing the core concepts it’s good to understand to get your model working on TPU. This time, we’re going to step that up one other level and train a masked language model from scratch using TensorFlow and TPU, including every step from training your tokenizer and preparing your dataset through to the ultimate model training and uploading. That is the sort of task that you simply’ll probably desire a dedicated TPU node (or VM) for, quite than simply Colab, and in order that’s where we’ll focus.
As in our Colab example, we’re benefiting from TensorFlow’s very clean TPU support via XLA and TPUStrategy. We’ll even be benefiting from the indisputable fact that the vast majority of the TensorFlow models in 🤗 Transformers are fully XLA-compatible. So surprisingly, little work is required to get them to run on TPU.
Unlike our Colab example, nevertheless, this instance is designed to be scalable and far closer to a sensible training run — although we only use a BERT-sized model by default, the code may very well be expanded to a much larger model and a rather more powerful TPU pod slice by changing a couple of configuration options.
Motivation
Why are we writing this guide now? In any case, 🤗 Transformers has had support for TensorFlow for several years now. But getting those models to coach on TPUs has been a serious pain point for the community. It is because:
- Many models weren’t XLA-compatible
- Data collators didn’t use native TF operations
We predict XLA is the long run: It’s the core compiler for JAX, it has first-class support in TensorFlow, and you may even use it from PyTorch. As such, we’ve made a big push to make our codebase XLA compatible and to remove some other roadblocks standing in the way in which of XLA and TPU compatibility. This implies users should find a way to coach most of our TensorFlow models on TPUs without hassle.
There’s also one other necessary reason to care about TPU training without delay: Recent major advances in LLMs and generative AI have created huge public interest in model training, and so it’s develop into incredibly hard for most individuals to get access to state-of-the-art GPUs. Knowing find out how to train on TPU gives you one other path to access ultra-high-performance compute hardware, which is rather more dignified than losing a bidding war for the last H100 on eBay after which ugly crying at your desk. You deserve higher. And speaking from experience: When you get comfortable with training on TPU, you may not wish to return.
What to anticipate
We’re going to coach a RoBERTa (base model) from scratch on the WikiText dataset (v1). In addition to training the model, we’re also going to coach the tokenizer, tokenize the information and upload it to Google Cloud Storage in TFRecord format, where it’ll be accessible for TPU training. You’ll find all of the code in this directory. Should you’re a certain sort of person, you may skip the remainder of this blog post and just jump straight to the code. Should you stick around, though, we’ll take a deeper take a look at among the key ideas within the codebase.
Lots of the ideas here were also mentioned in our Colab example, but we wanted to indicate users a full end-to-end example that puts all of it together and shows it in motion, quite than simply covering concepts at a high level. The next diagram gives you a pictorial overview of the steps involved in training a language model with 🤗 Transformers using TensorFlow and TPUs:

Getting the information and training a tokenizer
As mentioned, we used the WikiText dataset (v1). You’ll be able to head over to the dataset page on the Hugging Face Hub to explore the dataset.

Because the dataset is already available on the Hub in a compatible format, we are able to easily load and interact with it using 🤗 datasets. Nevertheless, for this instance, since we’re also training a tokenizer from scratch, here’s what we did:
- Loaded the
trainsplit of the WikiText using 🤗 datasets. - Leveraged 🤗 tokenizers to coach a Unigram model.
- Uploaded the trained tokenizer on the Hub.
You’ll find the tokenizer training code here and the tokenizer here. This script also means that you can run it with any compatible dataset from the Hub.
💡 It’s easy to make use of 🤗 datasets to host your text datasets. Seek advice from this guide to learn more.
Tokenizing the information and creating TFRecords
Once the tokenizer is trained, we are able to apply it to all of the dataset splits (train, validation, and test on this case) and create TFRecord shards out of them. Having the information splits spread across multiple TFRecord shards helps with massively parallel processing versus having each split in single TFRecord files.
We tokenize the samples individually. We then take a batch of samples, concatenate them together, and split them into several chunks of a set size (128 in our case). We follow this strategy quite than tokenizing a batch of samples with a set length to avoid aggressively discarding text content (due to truncation).
We then take these tokenized samples in batches and serialize those batches as multiple TFRecord shards, where the whole dataset length and individual shard size determine the variety of shards. Finally, these shards are pushed to a Google Cloud Storage (GCS) bucket.
Should you’re using a TPU node for training, then the information must be streamed from a GCS bucket because the node host memory may be very small. But for TPU VMs, we are able to use datasets locally and even attach persistent storage to those VMs. Since TPU nodes are still quite heavily used, we based our example on using a GCS bucket for data storage.
You’ll be able to see all of this in code in this script. For convenience, we’ve got also hosted the resultant TFRecord shards in this repository on the Hub.
Training a model on data in GCS
Should you’re conversant in using 🤗 Transformers, then you definitely already know the modeling code:
from transformers import AutoConfig, AutoTokenizer, TFAutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("tf-tpu/unigram-tokenizer-wikitext")
config = AutoConfig.from_pretrained("roberta-base")
config.vocab_size = tokenizer.vocab_size
model = TFAutoModelForMaskedLM.from_config(config)
But since we’re within the TPU territory, we want to perform this initialization under a technique scope in order that it could possibly be distributed across the TPU employees with data-parallel training:
import tensorflow as tf
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(...)
strategy = tf.distribute.TPUStrategy(tpu)
with strategy.scope():
tokenizer = AutoTokenizer.from_pretrained("tf-tpu/unigram-tokenizer-wikitext")
config = AutoConfig.from_pretrained("roberta-base")
config.vocab_size = tokenizer.vocab_size
model = TFAutoModelForMaskedLM.from_config(config)
Similarly, the optimizer also must be initialized under the identical strategy scope with which the model goes to be further compiled. Going over the complete training code isn’t something we wish to do on this post, so we welcome you to read it here. As a substitute, let’s discuss one other key point of — a TensorFlow-native data collator — DataCollatorForLanguageModeling.
DataCollatorForLanguageModeling is chargeable for masking randomly chosen tokens from the input sequence and preparing the labels. By default, we return the outcomes from these collators as NumPy arrays. Nevertheless, many collators also support returning these values as TensorFlow tensors if we specify return_tensor="tf". This was crucial for our data pipeline to be compatible with TPU training.
Thankfully, TensorFlow provides seamless support for reading files from a GCS bucket:
training_records = tf.io.gfile.glob(os.path.join(args.train_dataset, "*.tfrecord"))
If args.dataset incorporates the gs:// identifier, TensorFlow will understand that it must look right into a GCS bucket. Loading locally is as easy as removing the gs:// identifier. For the remainder of the information pipeline-related code, you may confer with this section within the training script.
Once the datasets have been prepared, the model and the optimizer have been initialized, and the model has been compiled, we are able to do the community’s favorite – model.fit(). For training, we didn’t do extensive hyperparameter tuning. We just trained it for longer with a learning rate of 1e-4. We also leveraged the PushToHubCallback for model checkpointing and syncing them with the Hub. You’ll find the hyperparameter details and a trained model here: https://huggingface.co/tf-tpu/roberta-base-epochs-500-no-wd.
Once the model is trained, running inference with it’s as easy as:
from transformers import pipeline
model_id = "tf-tpu/roberta-base-epochs-500-no-wd"
unmasker = pipeline("fill-mask", model=model_id, framework="tf")
unmasker("Goal of my life is to [MASK].")
[{'score': 0.1003185287117958,
'token': 52,
'token_str': 'be',
'sequence': 'Goal of my life is to be.'},
{'score': 0.032648514956235886,
'token': 5,
'token_str': '',
'sequence': 'Goal of my life is to .'},
{'score': 0.02152673341333866,
'token': 138,
'token_str': 'work',
'sequence': 'Goal of my life is to work.'},
{'score': 0.019547373056411743,
'token': 984,
'token_str': 'act',
'sequence': 'Goal of my life is to act.'},
{'score': 0.01939118467271328,
'token': 73,
'token_str': 'have',
'sequence': 'Goal of my life is to have.'}]
Conclusion
If there’s one thing we wish to emphasise with this instance, it’s that TPU training is powerful, scalable and simple. In reality, in case you’re already using Transformers models with TF/Keras and streaming data from tf.data, you is perhaps shocked at how little work it takes to maneuver your whole training pipeline to TPU. They’ve a popularity as somewhat arcane, high-end, complex hardware, but they’re quite approachable, and instantiating a big pod slice is certainly easier than keeping multiple GPU servers in sync!
Diversifying the hardware that state-of-the-art models are trained on goes to be critical within the 2020s, especially if the continuing GPU shortage continues. We hope that this guide gives you the tools it’s good to power cutting-edge training runs regardless of what circumstances you face.
As the nice poet GPT-4 once said:
Should you can keep your head when throughout you
Are losing theirs to GPU droughts,
And trust your code, while others doubt you,
To coach on TPUs, no second thoughts;
Should you can learn from errors, and proceed,
And optimize your aim to achieve the sky,
Yours is the trail to AI mastery,
And you may prevail, my friend, as time goes by.
Sure, it’s shamelessly ripping off Rudyard Kipling and it has no idea find out how to pronounce “drought”, but we hope you are feeling inspired regardless.
