Deploying TensorFlow Vision Models in Hugging Face with TF Serving

Prior to now few months, the Hugging Face team and external contributors
added quite a lot of vision models in TensorFlow to Transformers. This
list is growing comprehensively and already includes state-of-the-art
pre-trained models like Vision Transformer,
Masked Autoencoders,
RegNet,
ConvNeXt,
and lots of others!

With regards to deploying TensorFlow models, you’ve gotten got quite a lot of
options. Depending in your use case, it’s possible you’ll want to reveal your model
as an endpoint or package it in an application itself. TensorFlow
provides tools that cater to every of those different scenarios.

On this post, you will see find out how to deploy a Vision Transformer (ViT) model (for image classification)
locally using TensorFlow Serving
(TF Serving). This can allow developers to reveal the model either as a
REST or gRPC endpoint. Furthermore, TF Serving supports many
deployment-specific features off-the-shelf resembling model warmup,
server-side batching, etc.

To get the entire working code shown throughout this post, seek advice from
the Colab Notebook shown in the beginning.

Saving the Model

All TensorFlow models in 🤗 Transformers have a way named
save_pretrained(). With it, you’ll be able to serialize the model weights in
the h5 format in addition to within the standalone SavedModel format.
TF Serving needs a model to be present within the SavedModel format. So, let’s first
load a Vision Transformer model and put it aside:

from transformers import TFViTForImageClassification

temp_model_dir = "vit"
ckpt = "google/vit-base-patch16-224"

model = TFViTForImageClassification.from_pretrained(ckpt)
model.save_pretrained(temp_model_dir, saved_model=True)

By default, save_pretrained() will first create a version directory
contained in the path we offer to it. So, the trail ultimately becomes:
{temp_model_dir}/saved_model/{version}.

We are able to inspect the serving signature of the SavedModel like so:

saved_model_cli show --dir {temp_model_dir}/saved_model/1 --tag_set serve --signature_def serving_default

This could output:

The given SavedModel SignatureDef comprises the next input(s):
  inputs['pixel_values'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, -1, -1, -1)
      name: serving_default_pixel_values:0
The given SavedModel SignatureDef comprises the next output(s):
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1000)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

As will be noticed the model accepts single 4-d inputs (namely
pixel_values) which has the next axes: (batch_size, num_channels, height, width). For this model, the suitable height
and width are set to 224, and the variety of channels is 3. You possibly can confirm
this by inspecting the config argument of the model (model.config).
The model yields a 1000-d vector of logits.

Model Surgery

Often, every ML model has certain preprocessing and postprocessing
steps. The ViT model isn’t any exception to this. The main preprocessing
steps include:

Scaling the image pixel values to [0, 1] range.
Normalizing the scaled pixel values to [-1, 1].
Resizing the image in order that it has a spatial resolution of (224, 224).

You possibly can confirm these by investigating the image processor associated
with the model:

from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained(ckpt)
print(processor)

This could print:

ViTImageProcessor {
  "do_normalize": true,
  "do_resize": true,
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "size": 224
}

Since that is a picture classification model pre-trained on the
ImageNet-1k dataset, the model
outputs have to be mapped to the ImageNet-1k classes because the
post-processing step.

To scale back the developers’ cognitive load and training-serving skew,
it’s often idea to ship a model that has a lot of the
preprocessing and postprocessing steps in built. Subsequently, it is best to
serialize the model as a SavedModel such that the above-mentioned
processing ops get embedded into its computation graph.

Preprocessing

For preprocessing, image normalization is one of the vital essential
components:

def normalize_img(
    img, mean=processor.image_mean, std=processor.image_std
):
    
    img = img / 255
    mean = tf.constant(mean)
    std = tf.constant(std)
    return (img - mean) / std

You furthermore mght must resize the image and transpose it in order that it has leading
channel dimensions since following the usual format of 🤗
Transformers. The below code snippet shows all of the preprocessing steps:

CONCRETE_INPUT = "pixel_values" 
SIZE = processor.size["height"]


def normalize_img(
    img, mean=processor.image_mean, std=processor.image_std
):
    
    img = img / 255
    mean = tf.constant(mean)
    std = tf.constant(std)
    return (img - mean) / std


def preprocess(string_input):
    decoded_input = tf.io.decode_base64(string_input)
    decoded = tf.io.decode_jpeg(decoded_input, channels=3)
    resized = tf.image.resize(decoded, size=(SIZE, SIZE))
    normalized = normalize_img(resized)
    normalized = tf.transpose(
        normalized, (2, 0, 1)
    )  
    return normalized


@tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
def preprocess_fn(string_input):
    decoded_images = tf.map_fn(
        preprocess, string_input, dtype=tf.float32, back_prop=False
    )
    return {CONCRETE_INPUT: decoded_images}

Note on making the model accept string inputs:

When coping with images via REST or gRPC requests the dimensions of the
request payload can easily spiral up depending on the resolution of the
images being passed. Because of this it’s practice to compress them
reliably after which prepare the request payload.

Postprocessing and Model Export

You are now equipped with the preprocessing operations that you may inject
into the model’s existing computation graph. On this section, you will also
inject the post-processing operations into the graph and export the
model!

def model_exporter(model: tf.keras.Model):
    m_call = tf.function(model.call).get_concrete_function(
        tf.TensorSpec(
            shape=[None, 3, SIZE, SIZE], dtype=tf.float32, name=CONCRETE_INPUT
        )
    )

    @tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
    def serving_fn(string_input):
        labels = tf.constant(list(model.config.id2label.values()), dtype=tf.string)
        
        images = preprocess_fn(string_input)
        predictions = m_call(**images)
        
        indices = tf.argmax(predictions.logits, axis=1)
        pred_source = tf.gather(params=labels, indices=indices)
        probs = tf.nn.softmax(predictions.logits, axis=1)
        pred_confidence = tf.reduce_max(probs, axis=1)
        return {"label": pred_source, "confidence": pred_confidence}

    return serving_fn

You possibly can first derive the concrete function
from the model’s forward pass method (call()) so the model is nicely compiled
right into a graph. After that, you’ll be able to apply the next steps so as:

Pass the inputs through the preprocessing operations.
Pass the preprocessing inputs through the derived concrete function.
Post-process the outputs and return them in a nicely formatted
dictionary.

Now it is time to export the model!

MODEL_DIR = tempfile.gettempdir()
VERSION = 1

tf.saved_model.save(
    model,
    os.path.join(MODEL_DIR, str(VERSION)),
    signatures={"serving_default": model_exporter(model)},
)
os.environ["MODEL_DIR"] = MODEL_DIR

After exporting, let’s inspect the model signatures again:

saved_model_cli show --dir {MODEL_DIR}/1 --tag_set serve --signature_def serving_default

The given SavedModel SignatureDef comprises the next input(s):
  inputs['string_input'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_string_input:0
The given SavedModel SignatureDef comprises the next output(s):
  outputs['confidence'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: StatefulPartitionedCall:0
  outputs['label'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

You possibly can notice that the model’s signature has now modified. Specifically,
the input type is now a string and the model returns two things: a
confidence rating and the string label.

Provided you’ve got already installed TF Serving (covered within the Colab
Notebook), you are now able to deploy this model!

Deployment with TensorFlow Serving

It just takes a single command to do that:

nohup tensorflow_model_server 
  --rest_api_port=8501 
  --model_name=vit 
  --model_base_path=$MODEL_DIR >server.log 2>&1

From the above command, the vital parameters are:

rest_api_port denotes the port number that TF Serving will use
deploying the REST endpoint of your model. By default, TF Serving
uses the 8500 port for the gRPC endpoint.
model_name specifies the model name (will be anything) that can
used for calling the APIs.
model_base_path denotes the bottom model path that TF Serving will
use to load the most recent version of the model.

(The whole list of supported parameters is
here.)

And voila! Inside minutes, you need to be up and running with a deployed
model having two endpoints – REST and gRPC.

Querying the REST Endpoint

Recall that you just exported the model such that it accepts string inputs
encoded with the base64 format. So, to craft the
request payload you’ll be able to do something like this:


image_path = tf.keras.utils.get_file(
    "image.jpg", "http://images.cocodataset.org/val2017/000000039769.jpg"
)


bytes_inputs = tf.io.read_file(image_path)
b64str = base64.urlsafe_b64encode(bytes_inputs.numpy()).decode("utf-8")



data = json.dumps({"signature_name": "serving_default", "instances": [b64str]})

TF Serving’s request payload format specification for the REST endpoint
is out there here.
Inside the instances you’ll be able to pass multiple encoded images. This sort
of endpoints are supposed to be consumed for online prediction scenarios.
For inputs having greater than a single data point, you’ll to need to
enable batching
to get performance optimization advantages.

Now you’ll be able to call the API:

headers = {"content-type": "application/json"}
json_response = requests.post(
    "http://localhost:8501/v1/models/vit:predict", data=data, headers=headers
)
print(json.loads(json_response.text))

The REST API is –
http://localhost:8501/v1/models/vit:predict following the specification from
here. By default,
this at all times picks up the most recent version of the model. But should you wanted a
specific version you’ll be able to do: http://localhost:8501/v1/models/vit/versions/1:predict.

Querying the gRPC Endpoint

While REST is sort of popular within the API world, many applications often
profit from gRPC. This post
does job comparing the 2 ways of deployment. gRPC is generally
preferred for low-latency, highly scalable, and distributed systems.

There are a few steps are. First, you want to open a communication
channel:

import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc


channel = grpc.insecure_channel("localhost:8500")
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

Then, create the request payload:

request = predict_pb2.PredictRequest()
request.model_spec.name = "vit"
request.model_spec.signature_name = "serving_default"
request.inputs[serving_input].CopyFrom(tf.make_tensor_proto([b64str]))

You possibly can determine the serving_input key programmatically like so:

loaded = tf.saved_model.load(f"{MODEL_DIR}/{VERSION}")
serving_input = list(
    loaded.signatures["serving_default"].structured_input_signature[1].keys()
)[0]
print("Serving function input:", serving_input)

Now, you’ll be able to get some predictions:

grpc_predictions = stub.Predict(request, 10.0)  
print(grpc_predictions)

outputs {
  key: "confidence"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 1
      }
    }
    float_val: 0.8966591954231262
  }
}
outputs {
  key: "label"
  value {
    dtype: DT_STRING
    tensor_shape {
      dim {
        size: 1
      }
    }
    string_val: "Egyptian cat"
  }
}
model_spec {
  name: "resnet"
  version {
    value: 1
  }
  signature_name: "serving_default"
}

You may as well fetch the key-values of our interest from the above results like so:

grpc_predictions.outputs["label"].string_val, grpc_predictions.outputs[
    "confidence"
].float_val

Wrapping Up

On this post, we learned find out how to deploy a TensorFlow vision model from
Transformers with TF Serving. While local deployments are great for
weekend projects, we might need to find a way to scale these deployments to
serve many users. In the subsequent series of posts, you will see find out how to scale up
these deployments with Kubernetes and Vertex AI.

Additional References

Source link

Deploying TensorFlow Vision Models in Hugging Face with TF Serving

Saving the Model

Model Surgery

Preprocessing

Postprocessing and Model Export

Deployment with TensorFlow Serving

Querying the REST Endpoint

Querying the gRPC Endpoint

Wrapping Up

Additional References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Advantage Actor Critic (A2C)

Accelerating science with AI and simulations

OpenAI researcher quits over ChatGPT ads, warns of “Facebook” path

Faster Text Generation with TensorFlow and XLA

Is a secure AI assistant possible?

Deploying TensorFlow Vision Models in Hugging Face with TF Serving

Saving the Model

Model Surgery

Preprocessing

Postprocessing and Model Export

Deployment with TensorFlow Serving

Querying the REST Endpoint

Querying the gRPC Endpoint

Wrapping Up

Additional References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.