Faster TensorFlow models in Hugging Face Transformers

In the previous few months, the Hugging Face team has been working hard on improving Transformers’ TensorFlow models to make them more robust and faster. The recent improvements are mainly focused on two facets:

Computational performance: BERT, RoBERTa, ELECTRA and MPNet have been improved with a view to have a much faster computation time. This gain of computational performance is noticeable for all of the computational facets: graph/eager mode, TF Serving and for CPU/GPU/TPU devices.
TensorFlow Serving: each of those TensorFlow model may be deployed with TensorFlow Serving to advantage of this gain of computational performance for inference.

Computational Performance

To show the computational performance improvements, we’ve done a radical benchmark where we compare BERT’s performance with TensorFlow Serving of v4.2.0 to the official implementation from Google. The benchmark has been run on a GPU V100 using a sequence length of 128 (times are in millisecond):

Batch size	Google implementation	v4.2.0 implementation	Relative difference Google/v4.2.0 implem
1	6.7	6.26	6.79%
2	9.4	8.68	7.96%
4	14.4	13.1	9.45%
8	24	21.5	10.99%
16	46.6	42.3	9.67%
32	83.9	80.4	4.26%
64	171.5	156	9.47%
128	338.5	309	9.11%

The present implementation of Bert in v4.2.0 is quicker than the Google implementation by as much as ~10%. Other than that additionally it is twice as fast because the implementations within the 4.1.1 release.

TensorFlow Serving

The previous section demonstrates that the brand latest Bert model got a dramatic increase in computational performance within the last version of Transformers. On this section, we are going to show you step-by-step the right way to deploy a Bert model with TensorFlow Serving to learn from the rise in computational performance in a production environment.

What’s TensorFlow Serving?

TensorFlow Serving belongs to the set of tools provided by TensorFlow Prolonged (TFX) that makes the duty of deploying a model to a server easier than ever. TensorFlow Serving provides two APIs, one which may be called upon using HTTP requests and one other one using gRPC to run inference on the server.

What’s a SavedModel?

A SavedModel comprises a standalone TensorFlow model, including its weights and its architecture. It doesn’t require the unique source of the model to be run, which makes it useful for sharing or deploying with any backend that supports reading a SavedModel akin to Java, Go, C++ or JavaScript amongst others. The inner structure of a SavedModel is represented as such:

savedmodel
    /assets
        -> here the needed assets by the model (if any)
    /variables
        -> here the model checkpoints that comprises the weights
   saved_model.pb -> protobuf file representing the model graph

How you can install TensorFlow Serving?

There are 3 ways to put in and use TensorFlow Serving:

through a Docker container,
through an apt package,
or using pip.

To make things easier and compliant with all the present OS, we are going to use Docker on this tutorial.

How you can create a SavedModel?

SavedModel is the format expected by TensorFlow Serving. Since Transformers v4.2.0, making a SavedModel has three additional features:

The sequence length may be modified freely between runs.
All model inputs can be found for inference.
hidden states or attention at the moment are grouped right into a single output when returning them with output_hidden_states=True or output_attentions=True.

Below, you will discover the inputs and outputs representations of a TFBertForSequenceClassification saved as a TensorFlow SavedModel:

The given SavedModel SignatureDef comprises the next input(s):
  inputs['attention_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_attention_mask:0
  inputs['input_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_input_ids:0
  inputs['token_type_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_token_type_ids:0
The given SavedModel SignatureDef comprises the next output(s):
  outputs['attentions'] tensor_info:
      dtype: DT_FLOAT
      shape: (12, -1, 12, -1, -1)
      name: StatefulPartitionedCall:0
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 2)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

To directly pass inputs_embeds (the token embeddings) as an alternative of input_ids (the token IDs) as input, we’d like to subclass the model to have a brand new serving signature. The next snippet of code shows the right way to accomplish that:

from transformers import TFBertForSequenceClassification
import tensorflow as tf


class MyOwnModel(TFBertForSequenceClassification):
    
    
    @tf.function(input_signature=[{
        "inputs_embeds": tf.TensorSpec((None, None, 768), tf.float32, name="inputs_embeds"),
        "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
        "token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
    }])
    def serving(self, inputs):
        
        output = self.call(inputs)

        
        return self.serving_output(output)


model = MyOwnModel.from_pretrained("bert-base-cased")

model.save_pretrained("my_model", saved_model=True)

The serving method needs to be overridden by the brand new input_signature argument of the tf.function decorator. See the official documentation to know more concerning the input_signature argument. The serving method is used to define how will behave a SavedModel when deployed with TensorFlow Serving. Now the SavedModel looks like as expected, see the brand new inputs_embeds input:

The given SavedModel SignatureDef comprises the next input(s):
  inputs['attention_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_attention_mask:0
  inputs['inputs_embeds'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, -1, 768)
      name: serving_default_inputs_embeds:0
  inputs['token_type_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_token_type_ids:0
The given SavedModel SignatureDef comprises the next output(s):
  outputs['attentions'] tensor_info:
      dtype: DT_FLOAT
      shape: (12, -1, 12, -1, -1)
      name: StatefulPartitionedCall:0
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 2)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

How you can deploy and use a SavedModel?

Let’s see step-by-step the right way to deploy and use a BERT model for sentiment classification.

Step 1

Create a SavedModel. To create a SavedModel, the Transformers library permits you to load a PyTorch model called nateraw/bert-base-uncased-imdb trained on the IMDB dataset and convert it to a TensorFlow Keras model for you:

from transformers import TFBertForSequenceClassification

model = TFBertForSequenceClassification.from_pretrained("nateraw/bert-base-uncased-imdb", from_pt=True)

model.save_pretrained("my_model", saved_model=True)

Step 2

Create a Docker container with the SavedModel and run it. First, pull the TensorFlow Serving Docker image for CPU (for GPU replace serving by serving:latest-gpu):

docker pull tensorflow/serving

Next, run a serving image as a daemon named serving_base:

docker run -d --name serving_base tensorflow/serving

copy the newly created SavedModel into the serving_base container’s models folder:

docker cp my_model/saved_model serving_base:/models/bert

commit the container that serves the model by changing MODEL_NAME to match the model’s name (here bert), the name (bert) corresponds to the name we wish to present to our SavedModel:

docker commit --change "ENV MODEL_NAME bert" serving_base my_bert_model

and kill the serving_base image ran as a daemon because we do not need it anymore:

docker kill serving_base

Finally, Run the image to serve our SavedModel as a daemon and we map the ports 8501 (REST API), and 8500 (gRPC API) within the container to the host and we name the container bert.

docker run -d -p 8501:8501 -p 8500:8500 --name bert my_bert_model

Step 3

Query the model through the REST API:

from transformers import BertTokenizerFast, BertConfig
import requests
import json
import numpy as np

sentence = "I really like the brand new TensorFlow update in transformers."


tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")


config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")


batch = tokenizer(sentence)


batch = dict(batch)


batch = [batch]


input_data = {"instances": batch}


r = requests.post("http://localhost:8501/v1/models/bert:predict", data=json.dumps(input_data))



result = json.loads(r.text)["predictions"][0]


abs_scores = np.abs(result)


label_id = np.argmax(abs_scores)


print(config.id2label[label_id])

This could return POSITIVE. Additionally it is possible to pass by the gRPC (google Distant Procedure Call) API to get the identical result:

from transformers import BertTokenizerFast, BertConfig
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc

sentence = "I really like the brand new TensorFlow update in transformers."
tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")







batch = tokenizer(sentence, return_tensors="tf")


channel = grpc.insecure_channel("localhost:8500")


stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)


request = predict_pb2.PredictRequest()


request.model_spec.name = "bert"


request.model_spec.signature_name = "serving_default"



request.inputs["input_ids"].CopyFrom(tf.make_tensor_proto(batch["input_ids"]))


request.inputs["attention_mask"].CopyFrom(tf.make_tensor_proto(batch["attention_mask"]))


request.inputs["token_type_ids"].CopyFrom(tf.make_tensor_proto(batch["token_type_ids"]))


result = stub.Predict(request)




output = result.outputs["logits"].float_val


print(config.id2label[np.argmax(np.abs(output))])

Conclusion

Because of the last updates applied on the TensorFlow models in transformers, one can now easily deploy its models in production using TensorFlow Serving. One in every of the following steps we’re occupied with is to directly integrate the preprocessing part contained in the SavedModel to make things even easier.

Source link

Faster TensorFlow models in Hugging Face Transformers

Computational Performance

TensorFlow Serving

What’s TensorFlow Serving?

What’s a SavedModel?

How you can install TensorFlow Serving?

How you can create a SavedModel?

How you can deploy and use a SavedModel?

Step 1

Step 2

Step 3

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Pentagon is planning for AI firms to coach on classified data, defense official says

The right way to Effectively Review Claude Code Output

NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

A Compact Hybrid Model for Efficient Local AI

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact

Faster TensorFlow models in Hugging Face Transformers

Computational Performance

TensorFlow Serving

What’s TensorFlow Serving?

What’s a SavedModel?

How you can install TensorFlow Serving?

How you can create a SavedModel?

How you can deploy and use a SavedModel?

Step 1

Step 2

Step 3

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.