Within the previous post, we showed how
to deploy a Vision Transformer (ViT)
model from 🤗 Transformers locally with TensorFlow Serving. We covered
topics like embedding preprocessing and postprocessing operations inside
the Vision Transformer model, handling gRPC requests, and more!
While local deployments are a wonderful head begin to constructing
something useful, you’d have to perform deployments that may serve many
users in real-life projects. On this post, you’ll learn the way to scale the
local deployment from the previous post with Docker and Kubernetes.
Subsequently, we assume some familiarity with Docker and Kubernetes.
This post builds on top of the previous post, so, we highly
recommend reading it first. You will discover all of the code
discussed throughout this post in this repository.
Why go along with Docker and Kubernetes?
The fundamental workflow of scaling up a deployment like ours includes the
following steps:
-
Containerizing the applying logic: The applying logic
involves a served model that may handle requests and return
predictions. For containerization, Docker is the industry-standard
go-to. -
Deploying the Docker container: You’ve gotten various options here. Essentially the most
widely used option is deploying the Docker container on a Kubernetes
cluster. Kubernetes provides quite a few deployment-friendly features
(e.g. autoscaling and security). You should use an answer like
Minikube to
manage Kubernetes clusters locally or a serverless solution like
Elastic Kubernetes Service (EKS).
You is perhaps wondering why use an explicit setup like this within the age
of Sagemaker, Vertex AI
that gives ML deployment-specific features right off the bat. It’s fair to think
about it.
The above workflow is widely adopted within the industry, and plenty of
organizations profit from it. It has already been battle-tested for
a few years. It also lets you will have more granular control of your
deployments while abstracting away the non-trivial bits.
This post uses Google Kubernetes Engine (GKE)
to provision and manage a Kubernetes cluster. We assume you have already got a
billing-enabled GCP project should you’re using GKE. Also, note that you simply’d have to
configure the gcloud utility for
performing the deployment on GKE. However the concepts discussed on this post
equally apply must you resolve to make use of Minikube.
Note: The code snippets shown on this post could be executed on a Unix terminal
so long as you will have configured the gcloud utility together with Docker and kubectl.
More instructions can be found within the accompanying repository.
Containerization with Docker
The serving model can handle raw image inputs as bytes and is able to preprocessing and
postprocessing.
On this section, you’ll see the way to containerize that model using the
base TensorFlow Serving Image. TensorFlow Serving consumes models
within the SavedModel format. Recall the way you
obtained such a SavedModel within the previous post. We assume that
you will have the SavedModel compressed in tar.gz format. You possibly can fetch
it from here
just in case. Then SavedModel needs to be placed within the special directory
structure of . That is how TensorFlow Serving concurrently manages multiple deployments of various versioned models.
Preparing the Docker image
The shell script below places the SavedModel in hf-vit/1 under the
parent directory models. You may copy all the things inside it when preparing
the Docker image. There is just one model in this instance, but this
is a more generalizable approach.
$ MODEL_TAR=model.tar.gz
$ MODEL_NAME=hf-vit
$ MODEL_VERSION=1
$ MODEL_PATH=models/$MODEL_NAME/$MODEL_VERSION
$ mkdir -p $MODEL_PATH
$ tar -xvf $MODEL_TAR --directory $MODEL_PATH
Below, we show how the models directory is structured in our case:
$ find /models
/models
/models/hf-vit
/models/hf-vit/1
/models/hf-vit/1/keras_metadata.pb
/models/hf-vit/1/variables
/models/hf-vit/1/variables/variables.index
/models/hf-vit/1/variables/variables.data-00000-of-00001
/models/hf-vit/1/assets
/models/hf-vit/1/saved_model.pb
The custom TensorFlow Serving image needs to be built on top of the base one.
There are numerous approaches for this, but you’ll do that by running a Docker container as illustrated within the
official document. We start by running tensorflow/serving image in background mode, then your complete models directory is copied to the running container
as below.
$ docker run -d --name serving_base tensorflow/serving
$ docker cp models/ serving_base:/models/
We used the official Docker image of TensorFlow Serving as the bottom, but
you should utilize ones that you will have built from source
as well.
Note: TensorFlow Serving advantages from hardware optimizations that leverage instruction sets equivalent to
AVX512. These
instruction sets can speed up deep learning model inference. So,
should you know the hardware on which the model can be deployed, it’s often
useful to acquire an optimized construct of the TensorFlow Serving image
and use it throughout.
Now that the running container has all of the required files within the
appropriate directory structure, we want to create a brand new Docker image
that features these changes. This could be done with the docker commit command below, and you’ll need a brand new Docker image named $NEW_IMAGE.
One necessary thing to notice is that it’s essential set the MODEL_NAME
environment variable to the model name, which is hf-vit on this
case. This tells TensorFlow Serving what model to deploy.
$ NEW_IMAGE=tfserving:$MODEL_NAME
$ docker commit
--change "ENV MODEL_NAME $MODEL_NAME"
serving_base $NEW_IMAGE
Running the Docker image locally
Lastly, you’ll be able to run the newly built Docker image locally to see if it
works tremendous. Below you see the output of the docker run command. Since
the output is verbose, we trimmed it all the way down to concentrate on the necessary
bits. Also, it’s value noting that it opens up 8500 and 8501
ports for gRPC and HTTP/REST endpoints, respectively.
$ docker run -p 8500:8500 -p 8501:8501 -t $NEW_IMAGE &
---------OUTPUT---------
(Re-)adding model: hf-vit
Successfully reserved resources to load servable {name: hf-vit version: 1}
Approving load for servable version {name: hf-vit version: 1}
Loading servable version {name: hf-vit version: 1}
Reading SavedModel from: /models/hf-vit/1
Reading SavedModel debug info (if present) from: /models/hf-vit/1
Successfully loaded servable version {name: hf-vit version: 1}
Running gRPC ModelServer at 0.0.0.0:8500 ...
Exporting HTTP/REST API at:localhost:8501 ...
Pushing the Docker image
The ultimate step here is to push the Docker image to a picture repository.
You may use Google Container Registry (GCR) for this
purpose. The next lines of code can do that for you:
$ GCP_PROJECT_ID=
$ GCP_IMAGE=gcr.io/$GCP_PROJECT_ID/$NEW_IMAGE
$ gcloud auth configure-docker
$ docker tag $NEW_IMAGE $GCP_IMAGE
$ docker push $GCP_IMAGE
Since we’re using GCR, it’s essential prefix the
Docker image tag (note the opposite formats too) with gcr.io/ . With the Docker image prepared and pushed to GCR, you’ll be able to now proceed to deploy it on a
Kubernetes cluster.
Deploying on a Kubernetes cluster
Deployment on a Kubernetes cluster requires the next:
-
Provisioning a Kubernetes cluster, done with Google Kubernetes Engine (GKE) in
this post. Nevertheless, you’re welcome to make use of other platforms and tools
like EKS or Minikube. -
Connecting to the Kubernetes cluster to perform a deployment.
-
Writing YAML manifests.
-
Performing deployment with the manifests with a utility tool,
kubectl.
Let’s go over each of those steps.
Provisioning a Kubernetes cluster on GKE
You should use a shell script like so for this (available
here):
$ GKE_CLUSTER_NAME=tfs-cluster
$ GKE_CLUSTER_ZONE=us-central1-a
$ NUM_NODES=2
$ MACHINE_TYPE=n1-standard-8
$ gcloud container clusters create $GKE_CLUSTER_NAME
--zone=$GKE_CLUSTER_ZONE
--machine-type=$MACHINE_TYPE
--num-nodes=$NUM_NODES
GCP offers a wide range of machine types to configure the deployment in a
way you wish. We encourage you to discuss with the
documentation
to learn more about it.
Once the cluster is provisioned, it’s essential connect with it to perform
the deployment. Since GKE is used here, you furthermore mght have to authenticate
yourself. You should use a shell script like so to do each of those:
$ GCP_PROJECT_ID=
$ export USE_GKE_GCLOUD_AUTH_PLUGIN=True
$ gcloud container clusters get-credentials $GKE_CLUSTER_NAME
--zone $GKE_CLUSTER_ZONE
--project $GCP_PROJECT_ID
The gcloud container clusters get-credentials command takes care of
each connecting to the cluster and authentication. Once this is completed,
you’re ready to write down the manifests.
Writing Kubernetes manifests
Kubernetes manifests are written in YAML
files. While it’s possible to make use of a single manifest file to perform the
deployment, creating separate manifest files is usually useful for
delegating the separation of concerns. It’s common to make use of three manifest
files for achieving this:
-
deployment.yamldefines the specified state of the Deployment by
providing the name of the Docker image, additional arguments when
running the Docker image, the ports to open for external accesses,
and the boundaries of resources. -
service.yamldefines connections between external clients and
inside Pods within the Kubernetes cluster. -
hpa.yamldefines rules to scale up and down the variety of Pods
consisting of the Deployment, equivalent to the proportion of CPU
utilization.
You will discover the relevant manifests for this post
here.
Below, we present a pictorial overview of how these manifests are
consumed.
Next, we undergo the necessary parts of every of those manifests.
deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: tfs-server
name: tfs-server
...
spec:
containers:
- image: gcr.io/$GCP_PROJECT_ID/tfserving-hf-vit:latest
name: tfs-k8s
imagePullPolicy: At all times
args: ["--tensorflow_inter_op_parallelism=2",
"--tensorflow_intra_op_parallelism=8"]
ports:
- containerPort: 8500
name: grpc
- containerPort: 8501
name: restapi
resources:
limits:
cpu: 800m
requests:
cpu: 800m
...
You possibly can configure the names like tfs-server, tfs-k8s any way you
want. Under containers, you specify the Docker image URI the
deployment will use. The present resource utilization gets monitored by
setting the allowed bounds of the resources for the container. It
can let Horizontal Pod Autoscaler (discussed later) resolve to scale up or down the variety of
containers. requests.cpu is the minimal amount of CPU resources to
make the container work accurately set by operators. Here 800m means 80%
of the entire CPU resource. So, HPA monitors the common CPU utilization
out of the sum of requests.cpu across all Pods to make scaling
decisions.
Besides Kubernetes specific configuration, you’ll be able to specify TensorFlow
Serving specific options in args.On this case, you will have two:
-
tensorflow_inter_op_parallelism, which sets the variety of threads
to run in parallel to execute independent operations. The
really helpful value for that is 2. -
tensorflow_intra_op_parallelism, which sets the variety of threads
to run in parallel to execute individual operations. The really helpful
value is the variety of physical cores the deployment CPU has.
You possibly can learn more about these options (and others) and recommendations on tuning
them for deployment from
here and
here.
service.yaml:
apiVersion: v1
kind: Service
metadata:
labels:
app: tfs-server
name: tfs-server
spec:
ports:
- port: 8500
protocol: TCP
targetPort: 8500
name: tf-serving-grpc
- port: 8501
protocol: TCP
targetPort: 8501
name: tf-serving-restapi
selector:
app: tfs-server
type: LoadBalancer
We made the service type ‘LoadBalancer’ so the endpoints are
exposed externally to the Kubernetes cluster. It selects the
‘tfs-server’ Deployment to make connections with external clients via
the desired ports. We open two ports of ‘8500’ and ‘8501’ for gRPC and
HTTP/REST connections respectively.
hpa.yaml:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: tfs-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tfs-server
minReplicas: 1
maxReplicas: 3
targetCPUUtilizationPercentage: 80
HPA stands for Horizontal Pod Autoscaler. It sets criteria
to make your mind up when to scale the variety of Pods within the goal Deployment. You
can learn more in regards to the autoscaling algorithm internally utilized by
Kubernetes here.
Here you specify how Kubernetes should handle autoscaling. In
particular, you define the replica sure inside which it should perform
autoscaling – minReplicas and maxReplicas and the goal CPU
utilization. targetCPUUtilizationPercentage is a very important metric
for autoscaling. The next thread aptly summarizes what it means
(taken from here):
The CPU utilization is the common CPU usage of all Pods in a
deployment across the last minute divided by the requested CPU of this
deployment. If the mean of the Pods’ CPU utilization is higher than the
goal you defined, your replicas can be adjusted.
Recall specifying resources within the deployment manifest. By
specifying the resources, the Kubernetes control plane starts
monitoring the metrics, so the targetCPUUtilization works.
Otherwise, HPA doesn’t know the present status of the Deployment.
You possibly can experiment and set these to the required numbers based in your
requirements. Note, nonetheless, that autoscaling can be contingent on the
quota you will have available on GCP since GKE internally uses Google Compute Engine
to administer these resources.
Performing the deployment
Once the manifests are ready, you’ll be able to apply them to the currently
connected Kubernetes cluster with the
kubectl apply
command.
$ kubectl apply -f deployment.yaml
$ kubectl apply -f service.yaml
$ kubectl apply -f hpa.yaml
While using kubectl is tremendous for applying each of the manifests to
perform the deployment, it may possibly quickly turn out to be harder if you will have many
different manifests. That is where a utility like
Kustomize could be helpful. You just
define one other specification named kustomization.yaml like so:
commonLabels:
app: tfs-server
resources:
- deployment.yaml
- hpa.yaml
- service.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
Then it’s only a one-liner to perform the actual deployment:
$ kustomize construct . | kubectl apply -f -
Complete instructions can be found
here.
Once the deployment has been performed, we are able to retrieve the endpoint IP
like so:
$ kubectl rollout status deployment/tfs-server
$ kubectl get svc tfs-server --watch
---------OUTPUT---------
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tfs-server LoadBalancer xxxxxxxxxx xxxxxxxxxx 8500:30869/TCP,8501:31469/TCP xxx
Note down the external IP when it becomes available.
And that sums up all of the steps it’s essential deploy your model on
Kubernetes! Kubernetes elegantly provides abstractions for complex bits
like autoscaling and cluster management while letting you concentrate on
the crucial points you need to care about while deploying a model. These
include resource utilization, security (we didn’t cover that here),
performance north stars like latency, etc.
Testing the endpoint
On condition that you bought an external IP for the endpoint, you should utilize the
following listing to check it:
import tensorflow as tf
import json
import base64
image_path = tf.keras.utils.get_file(
"image.jpg", "http://images.cocodataset.org/val2017/000000039769.jpg"
)
bytes_inputs = tf.io.read_file(image_path)
b64str = base64.urlsafe_b64encode(bytes_inputs.numpy()).decode("utf-8")
data = json.dumps(
{"signature_name": "serving_default", "instances": [b64str]}
)
json_response = requests.post(
"http://:8501/v1/models/hf-vit:predict" ,
headers={"content-type": "application/json"},
data=data
)
print(json.loads(json_response.text))
---------OUTPUT---------
{'predictions': [{'label': 'Egyptian cat', 'confidence': 0.896659195}]}
Should you’re interested to know the way this deployment would perform if it
meets more traffic then we recommend you to ascertain this text.
Check with the corresponding repository
to know more about running load tests with Locust and visualize the outcomes.
Notes on different TF Serving configurations
TensorFlow Serving
provides
various options to tailor the deployment based in your application use
case. Below, we briefly discuss a few of them.
enable_batching enables the batch inference capability that
collects incoming requests with a specific amount of timing window,
collates them as a batch, performs a batch inference, and returns the
results of every request to the suitable clients. TensorFlow Serving
provides a wealthy set of configurable options (equivalent to max_batch_size,
num_batch_threads) to tailor your deployment needs. You possibly can learn
more about them
here. Batching is
particularly useful for applications where you do not need predictions from a model
immediately. In those cases, you’d typically gather together multiple samples for prediction in batches and
then send those batches for prediction. Lucky for us, TensorFlow Serving can configure all of those
mechanically once we enable its batching capabilities.
enable_model_warmup warms up a few of the TensorFlow components
which are lazily instantiated with dummy input data. This fashion, you’ll be able to
ensure all the things is appropriately loaded up and that there can be no
lags through the actual service time.
Conclusion
On this post and the associated repository,
you learned about deploying the Vision Transformer model
from 🤗 Transformers on a Kubernetes cluster. Should you’re doing this for
the primary time, the steps may look like a bit daunting, but once
you get the grasp, they’ll soon turn out to be an integral part of your
toolbox. Should you were already aware of this workflow, we hope this post was still useful
for you.
We applied the identical deployment workflow for an ONNX-optimized version of the identical
Vision Transformer model. For more details, take a look at this link. ONNX-optimized models are especially useful should you’re using x86 CPUs for deployment.
In the following post, we’ll show you the way to perform these deployments with
significantly less code with Vertex AI – more like
model.deploy(autoscaling_config=...) and boom! We hope you’re just as
excited as we’re.
Acknowledgement
Because of the ML Developer Relations Program team at Google, which
provided us with GCP credits for conducting the experiments.

