Hugging Face’s TensorFlow Philosophy

-


Matthew Carrigan's avatar



Introduction

Despite increasing competition from PyTorch and JAX, TensorFlow stays the most-used deep learning framework. It also differs from those other two libraries in some very necessary ways. Specifically, it’s quite tightly integrated with its high-level API Keras, and its data loading library tf.data.

There may be a bent amongst PyTorch engineers (picture me staring darkly across the open-plan office here) to see this as an issue to be overcome; their goal is to determine learn how to make TensorFlow get out of their way so that they can use the low-level training and data-loading code they’re used to. That is entirely the mistaken approach to approach TensorFlow! Keras is an ideal high-level API. For those who push it out of the best way in any project greater than a few modules you’ll find yourself reproducing most of its functionality yourself while you realize you would like it.

As refined, respected and highly attractive TensorFlow engineers, we would like to make use of the incredible power and adaptability of cutting-edge models, but we would like to handle them with the tools and API we’re aware of. This blogpost shall be in regards to the decisions we make at Hugging Face to enable that, and what to anticipate from the framework as a TensorFlow programmer.



Interlude: 30 Seconds to 🤗

Experienced users can be happy to skim or skip this section, but when that is your first encounter with Hugging Face and transformers, I should start by supplying you with an summary of the core idea of the library: You simply ask for a pretrained model by name, and also you get it in a single line of code. The simplest way is to only use the TFAutoModel class:

from transformers import TFAutoModel

model = TFAutoModel.from_pretrained("bert-base-cased")

This one line will instantiate the model architecture and cargo the weights, supplying you with a precise replica of the unique, famous BERT model. This model won’t do much by itself, though – it lacks an output head or a loss function. In effect, it’s the “stem” of a neural net that stops right after the last hidden layer. So how do you place an output head on it? Easy, just use a special AutoModel class. Here we load the Vision Transformer (ViT) model and add a picture classification head:

from transformers import TFAutoModelForImageClassification

model_name = "google/vit-base-patch16-224"
model = TFAutoModelForImageClassification.from_pretrained(model_name)

Now our model has an output head and, optionally, a loss function appropriate for its recent task. If the brand new output head differs from the unique model, then its weights shall be randomly initialized. All other weights shall be loaded from the unique model. But why will we do that? Why would we use the stem of an existing model, as a substitute of just making the model we want from scratch?

It seems that enormous models pretrained on a number of data are much, significantly better starting points for nearly any ML problem than the usual approach to simply randomly initializing your weights. This is known as transfer learning, and in case you give it some thought, it is smart – solving a textual task well requires some knowledge of language, and solving a visible task well requires some knowledge of images and space. The rationale ML is so data-hungry without transfer learning is solely that this basic domain knowledge must be relearned from scratch for each problem, which necessitates an enormous volume of coaching examples. Through the use of transfer learning, nevertheless, an issue could be solved with a thousand training examples that might need required 1,000,000 without it, and infrequently with the next final accuracy. For more on this topic, try the relevant sections of the Hugging Face Course!

When using transfer learning, nevertheless, it’s totally necessary that you simply process inputs to the model the identical way that they were processed during training. This ensures that the model has to relearn as little as possible once we transfer its knowledge to a brand new problem. In transformers, this preprocessing is usually handled with tokenizers. Tokenizers could be loaded in the identical way as models, using the AutoTokenizer class. Make sure that you load the tokenizer that matches the model you would like to use!

from transformers import TFAutoModel, AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = TFAutoModel.from_pretrained("bert-base-cased")


test_strings = ["This is a sentence!", "This is another one!"]
tokenized_inputs = tokenizer(test_strings, return_tensors="np", padding=True)


outputs = model(tokenized_inputs)

That is only a taste of the library, after all – in case you want more, you may try our notebooks, or our code examples. There are also several other examples of the library in motion at keras.io!

At this point, you now understand among the basic concepts and classes in transformers. All the pieces I’ve written above is framework-agnostic (except for the “TF” in TFAutoModel), but when you would like to actually train and serve your model, that’s when things will begin to diverge between the frameworks. And that brings us to the principal focus of this text: As a TensorFlow engineer, what do you have to expect from transformers?



Philosophy #1: All TensorFlow models must be Keras Model objects, and all TensorFlow layers must be Keras Layer objects.

This almost goes without saying for a TensorFlow library, however it’s value emphasizing regardless. From the user’s perspective, an important effect of this alternative is that you could call Keras methods like fit(), compile() and predict() directly on our models.

For instance, assuming your data is already prepared and tokenized, then getting predictions from a sequence classification model with TensorFlow is so simple as:

model = TFAutoModelForSequenceClassification.from_pretrained(my_model)
model.predict(my_data)

And if you would like to train that model as a substitute, it’s just:

model.fit(my_data, my_labels)

Nonetheless, this convenience doesn’t mean you’re limited to tasks that we support out of the box. Keras models could be composed as layers in other models, so if you’ve got a large galactic brain concept that involves splicing together five different models then there’s nothing stopping you, except possibly your limited GPU memory. Perhaps you would like to merge a pretrained language model with a pretrained vision transformer to create a hybrid, like Deepmind’s recent Flamingo, or you would like to create the following viral text-to-image sensation like Dall-E Mini Craiyon? Here’s an example of a hybrid model using Keras subclassing:

class HybridVisionLanguageModel(tf.keras.Model):
  def __init__(self):
    super().__init__()
    self.language = TFAutoModel.from_pretrained("gpt2")
    self.vision = TFAutoModel.from_pretrained("google/vit-base-patch16-224")

  def call(self, inputs):
    
    



Philosophy #2: Loss functions are provided by default, but could be easily modified.

In Keras, the usual approach to train a model is to create it, then compile() it with an optimizer and loss function, and at last fit() it. It’s very easy to load a model with transformers, but setting the loss function could be tricky – even for normal language model training, your loss function could be surprisingly non-obvious, and a few hybrid models have extremely complex losses.

Our solution to that is easy: For those who compile() with out a loss argument, we’ll offer you the one you almost certainly wanted. Specifically, we’ll offer you one which matches each your base model and output type – in case you compile() a BERT-based masked language model with out a loss, we’ll offer you a masked language modelling loss that handles padding and masking appropriately, and can only compute losses on corrupted tokens, exactly matching the unique BERT training process. If for some reason you actually, really don’t want your model to be compiled with any loss in any respect, then simply specify loss=None when compiling.

model = TFAutoModelForQuestionAnswering.from_pretrained("bert-base-cased")
model.compile(optimizer="adam")  
model.fit(my_data, my_labels)

But in addition, and really importantly, we would like to get out of your way as soon as you would like to do something more complex. For those who specify a loss argument to compile(), then the model will use that as a substitute of the default loss. And, after all, in case you make your individual subclassed model just like the HybridVisionLanguageModel above, then you’ve got complete control over every aspect of the model’s functionality via the call() and train_step() methods you write.



Philosophy Implementation Detail #3: Labels are flexible

One source of confusion up to now was where exactly labels must be passed to the model. The usual approach to pass labels to a Keras model is as a separate argument, or as a part of an (inputs, labels) tuple:

model.fit(inputs, labels)

Prior to now, we as a substitute asked users to pass labels within the input dict when using the default loss. The rationale for this was that the code for computing the loss for that exact model was contained within the call() forward pass method. This worked, however it was definitely non-standard for Keras models, and caused several issues including incompatibilities with standard Keras metrics, not to say some user confusion. Thankfully, this is not any longer obligatory. We now recommend that labels are passed in the conventional Keras way, although the old method still works for backward compatibility reasons. Typically, loads of things that was fiddly should now “just work” for our TensorFlow models – give them a try!



Philosophy #4: You shouldn’t have to jot down your individual data pipeline, especially for common tasks

Along with transformers, an enormous open repository of pre-trained models, there may be also 🤗 datasets, an enormous open repository of datasets – text, vision, audio and more. These datasets convert easily to TensorFlow Tensors and Numpy arrays, making it easy to make use of them as training data. Here’s a fast example showing us tokenizing a dataset and converting it to Numpy. As all the time, make sure that your tokenizer matches the model you would like to train with, or things will get very weird!

from datasets import load_dataset
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam

dataset = load_dataset("glue", "cola")  
dataset = dataset["train"]  


tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenized_data = tokenizer(dataset["text"], return_tensors="np", padding=True)
labels = np.array(dataset["label"]) 


model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")

model.compile(optimizer=Adam(3e-5))

model.fit(tokenized_data, labels)

This approach is great when it really works, but for larger datasets you would possibly find it beginning to change into an issue. Why? Since the tokenized array and labels would should be fully loaded into memory, and since Numpy doesn’t handle “jagged” arrays, so every tokenized sample would should be padded to the length of the longest sample in the entire dataset. That’s going to make your array even greater, and all those padding tokens will decelerate training too!

As a TensorFlow engineer, this is often where you’d turn to tf.data to make a pipeline that can stream the information from storage relatively than loading all of it into memory. That’s a hassle, though, so we’ve got you. First, let’s use the map() method so as to add the tokenizer columns to the dataset. Keep in mind that our datasets are disc-backed by default – they won’t load into memory until you exchange them into arrays!

def tokenize_dataset(data):
    
    return tokenizer(data["text"])

dataset = dataset.map(tokenize_dataset)

Now our dataset has the columns we would like, but how will we train on it? Easy – wrap it with a tf.data.Dataset and all our problems are solved – data is loaded on-the-fly, and padding is applied only to batches relatively than the entire dataset, which suggests that we want way fewer padding tokens:

tf_dataset = model.prepare_tf_dataset(
    dataset,
    batch_size=16,
    shuffle=True
)

model.fit(tf_dataset)

Why is prepare_tf_dataset() a technique in your model? Easy: Because your model knows which columns are valid as inputs, and mechanically filters out columns within the dataset that are not valid input names! For those who’d relatively have more precise control over the tf.data.Dataset being created, you should utilize the lower level Dataset.to_tf_dataset() as a substitute.



Philosophy #5: XLA is great!

XLA is the just-in-time compiler shared by TensorFlow and JAX. It converts linear algebra code into more optimized versions that run quicker and use less memory. It’s really cool and we attempt to make sure that that we support it as much as possible. It’s extremely necessary for allowing models to be run on TPU, however it offers speed boosts for GPU and even CPU as well! To make use of it, simply compile() your model with the jit_compile=True argument (this works for all Keras models, not only Hugging Face ones):

model.compile(optimizer="adam", jit_compile=True)

We’ve made various major improvements recently on this area. Most importantly, we’ve updated our generate() code to make use of XLA – this can be a function that iteratively generates text output from language models. This has resulted in massive performance improvements – our legacy TF code was much slower than PyTorch, but the brand new code is way faster than it, and just like JAX in speed! For more information, please see our blogpost about XLA generation.

XLA is beneficial for things besides generation too, though! We’ve also made various fixes to make sure that you could train your models with XLA, and consequently our TF models have reached JAX-like speeds for tasks like language model training.

It’s necessary to be clear about the foremost limitation of XLA, though: XLA expects input shapes to be static. Which means that in case your task involves variable sequence lengths, you will have to run a brand new XLA compilation for every different input shape you pass to your model, which may really negate the performance advantages! You’ll be able to see some examples of how we cope with this in our TensorFlow notebooks and within the XLA generation blogpost above.



Philosophy #6: Deployment is just as necessary as training

TensorFlow has a wealthy ecosystem, particularly around model deployment, that the opposite more research-focused frameworks lack. We’re actively working on letting you utilize those tools to deploy your whole model for inference. We’re particularly thinking about supporting TF Serving and TFX. If that is interesting to you, please try our blogpost on deploying models with TF Serving!

One major obstacle in deploying NLP models, nevertheless, is that inputs will still should be tokenized, which suggests it’s not enough to only deploy your model. A dependency on tokenizers could be annoying in loads of deployment scenarios, and so we’re working to make it possible to embed tokenization into your model itself, allowing you to deploy only a single model artifact to handle the entire pipeline from input strings to output predictions. Immediately, we only support probably the most common models like BERT, but that is an lively area of labor! If you would like to try it, though, you should utilize a code snippet like this:




import tensorflow as tf
from transformers import TFAutoModel, TFBertTokenizer


class EndToEndModel(tf.keras.Model):
    def __init__(self, checkpoint):
        super().__init__()
        self.tokenizer = TFBertTokenizer.from_pretrained(checkpoint)
        self.model = TFAutoModel.from_pretrained(checkpoint)

    def call(self, inputs):
        tokenized = self.tokenizer(inputs)
        return self.model(**tokenized)

model = EndToEndModel(checkpoint="bert-base-cased")

test_inputs = [
    "This is a test sentence!",
    "This is another one!",
]
model.predict(test_inputs)  



Conclusion: We’re an open-source project, and meaning community is all the pieces

Made a cool model? Share it! When you’ve made an account and set your credentials it’s as easy as:

model_name = "google/vit-base-patch16-224"
model = TFAutoModelForImageClassification.from_pretrained(model_name)

model.fit(my_data, my_labels)

model.push_to_hub("my-new-model")

You may as well use the PushToHubCallback to upload checkpoints often during an extended training run! Either way, you’ll get a model page and an autogenerated model card, and most significantly of all, anyone else can use your model to get predictions, or as a place to begin for further training, using the exact same API as they use to load any existing model:

model_name = "your-username/my-new-model"
model = TFAutoModelForImageClassification.from_pretrained(model_name)

I believe the proven fact that there’s no distinction between big famous foundation models and models fine-tuned by a single user exemplifies the core belief at Hugging Face – the facility of users to construct great things. Machine learning was never meant to be a trickle of results from closed models held at a rarefied few corporations; it must be a group of open tools, artifacts, practices and knowledge that’s continuously being expanded, tested, critiqued and built upon – a bazaar, not a cathedral. For those who come across a brand new idea, a brand new method, otherwise you train a brand new model with great results, let everyone know!

And, in the same vein, are there stuff you’re missing? Bugs? Annoyances? Things that must be intuitive but aren’t? Tell us! For those who’re willing to get a (metaphorical) shovel and begin fixing it, that’s even higher, but don’t be shy to talk up even in case you don’t have the time or skillset to enhance the codebase yourself. Often, the core maintainers can miss problems because users don’t bring them up, so don’t assume that we must concentrate on something! If it’s bothering you, please ask on the forums, or in case you’re pretty sure it’s a bug or a missing necessary feature, then file a problem.

Loads of these items are small details, sure, but to coin a (relatively clunky) phrase, great software is constituted of hundreds of small commits. It’s through the constant collective effort of users and maintainers that open-source software improves. Machine learning goes to be a serious societal issue within the 2020s, and the strength of open-source software and communities will determine whether it becomes an open and democratic force open to critique and re-evaluation, or whether it’s dominated by giant black-box models whose owners won’t allow outsiders, even those whom the models make decisions about, to see their precious proprietary weights. So don’t be shy – if something’s mistaken, if you’ve got an idea for a way it might be done higher, if you would like to contribute but don’t know where, then tell us!

(And in case you could make a meme to troll the PyTorch team with after your cool recent feature is merged, all the higher.)





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x