Deploying 🤗 ViT on Vertex AI

Within the previous posts, we showed deploy a Vision Transformers
(ViT) model
from 🤗 Transformers locally and
on a Kubernetes cluster. This post will
show you deploy the identical model on the Vertex AI platform.
You’ll achieve the identical scalability level as Kubernetes-based deployment but with
significantly less code.

This post builds on top of the previous two posts linked above. You’re
advised to examine them out in the event you haven’t already.

Yow will discover a very worked-out example within the Colab Notebook
linked at the start of the post.

What’s Vertex AI?

In keeping with Google Cloud:

Vertex AI provides tools to support your entire ML workflow, across
different model types and ranging levels of ML expertise.

Concerning model deployment, Vertex AI provides a number of vital features
with a unified API design:

Authentication
Autoscaling based on traffic
Model versioning
Traffic splitting between different versions of a model
Rate limiting
Model monitoring and logging
Support for online and batch predictions

For TensorFlow models, it offers various off-the-shelf utilities, which
you’ll get to on this post. Nevertheless it also has similar support for other
frameworks like
PyTorch
and scikit-learn.

To make use of Vertex AI, you’ll need a billing-enabled Google Cloud
Platform (GCP) project
and the next services enabled:

Revisiting the Serving Model

You’ll use the identical ViT B/16 model implemented in TensorFlow as you probably did within the last two posts. You serialized the model with
corresponding pre-processing and post-processing operations embedded to
reduce training-serving skew.
Please check with the first post that discusses
this intimately. The signature of the ultimate serialized SavedModel looks like:

The given SavedModel SignatureDef incorporates the next input(s):
  inputs['string_input'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_string_input:0
The given SavedModel SignatureDef incorporates the next output(s):
  outputs['confidence'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: StatefulPartitionedCall:0
  outputs['label'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

The model will accept base64 encoded strings of images, perform
pre-processing, run inference, and at last perform the post-processing
steps. The strings are base64 encoded to forestall any
modifications during network transmission. Pre-processing includes
resizing the input image to 224×224 resolution, standardizing it to the
[-1, 1] range, and transposing it to the channels_first memory
layout. Postprocessing includes mapping the expected logits to string
labels.

To perform a deployment on Vertex AI, you’ll want to keep the model
artifacts in a Google Cloud Storage (GCS) bucket.
The accompanying Colab Notebook shows create a GCS bucket and
save the model artifacts into it.

Deployment workflow with Vertex AI

The figure below gives a pictorial workflow of deploying an already
trained TensorFlow model on Vertex AI.

Let’s now discuss what the Vertex AI Model Registry and Endpoint are.

Vertex AI Model Registry

Vertex AI Model Registry is a totally managed machine learning model
registry. There are a few things to notice about what fully managed
means here. First, you don’t have to worry about how and where models
are stored. Second, it manages different versions of the identical model.

These features are vital for machine learning in production.
Constructing a model registry that guarantees high availability and security
is nontrivial. Also, there are sometimes situations where you need to roll
back the present model to a past version since we are able to not
control the within a black box machine learning model. Vertex AI
Model Registry allows us to attain these without much difficulty.

The currently supported model types include SavedModel from
TensorFlow, scikit-learn, and XGBoost.

Vertex AI Endpoint

From the user’s perspective, Vertex AI Endpoint simply provides an
endpoint to receive requests and send responses back. Nevertheless, it has a
lot of things under the hood for machine learning operators to
configure. Listed here are among the configurations that you may select:

Version of a model
Specification of VM by way of CPU, memory, and accelerators
Min/Max variety of compute nodes
Traffic split percentage
Model monitoring window length and its objectives
Prediction requests sampling rate

Performing the Deployment

The google-cloud-aiplatform
Python SDK provides easy APIs to administer the lifecycle of a deployment on
Vertex AI. It is split into 4 steps:

uploading a model
creating an endpoint
deploying the model to the endpoint
making prediction requests.

Throughout these steps, you’ll
need ModelServiceClient, EndpointServiceClient, and
PredictionServiceClient modules from the google-cloud-aiplatform
Python SDK to interact with Vertex AI.

1. Step one within the workflow is to upload the SavedModel to
Vertex AI’s model registry:

tf28_gpu_model_dict = {
    "display_name": "ViT Base TF2.8 GPU model",
    "artifact_uri": f"{GCS_BUCKET}/{LOCAL_MODEL_DIR}",
    "container_spec": {
        "image_uri": "us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-8:latest",
    },
}
tf28_gpu_model = (
    model_service_client.upload_model(parent=PARENT, model=tf28_gpu_model_dict)
    .result(timeout=180)
    .model
)

Let’s unpack the code piece by piece:

GCS_BUCKET denotes the trail of your GCS bucket where the model
artifacts are positioned (e.g., gs://hf-tf-vision).
In container_spec, you provide the URI of a Docker image that may
be used to serve predictions. Vertex AI
provides pre-built images to serve TensorFlow models, but you may also use your custom Docker images when using a distinct framework
(an example).
model_service_client is a
ModelServiceClient
object that exposes the methods to upload a model to the Vertex AI
Model Registry.
PARENT is ready to f"projects/{PROJECT_ID}/locations/{REGION}"
that lets Vertex AI determine where the model goes to be scoped
inside GCP.

2. Then you’ll want to create a Vertex AI Endpoint:

tf28_gpu_endpoint_dict = {
    "display_name": "ViT Base TF2.8 GPU endpoint",
}
tf28_gpu_endpoint = (
    endpoint_service_client.create_endpoint(
        parent=PARENT, endpoint=tf28_gpu_endpoint_dict
    )
    .result(timeout=300)
    .name
)

Here you’re using an endpoint_service_client which is an
EndpointServiceClient
object. It allows you to create and configure your Vertex AI Endpoint.

3. Now you’re all the way down to performing the actual deployment!

tf28_gpu_deployed_model_dict = {
    "model": tf28_gpu_model,
    "display_name": "ViT Base TF2.8 GPU deployed model",
    "dedicated_resources": {
        "min_replica_count": 1,
        "max_replica_count": 1,
        "machine_spec": {
            "machine_type": DEPLOY_COMPUTE, 
            "accelerator_type": DEPLOY_GPU, 
            "accelerator_count": 1,
        },
    },
}

tf28_gpu_deployed_model = endpoint_service_client.deploy_model(
    endpoint=tf28_gpu_endpoint,
    deployed_model=tf28_gpu_deployed_model_dict,
    traffic_split={"0": 100},
).result()

Here, you’re chaining together the model you uploaded to the Vertex AI
Model Registry and the Endpoint you created within the above steps. You’re
first defining the configurations of the deployment under
tf28_gpu_deployed_model_dict.

Under dedicated_resources you’re configuring:

min_replica_count and max_replica_count that handle the
autoscaling facets of your deployment.
machine_spec allows you to define the configurations of the deployment
hardware:
- machine_type is the bottom machine type that can be used to run
  the Docker image. The underlying autoscaler will scale this
  machine as per the traffic load. You’ll be able to select one from the
  supported machine types.
- accelerator_type is the hardware accelerator that can be used
  to perform inference.
- accelerator_count denotes the variety of hardware accelerators to
  attach to every replica.

Note that providing an accelerator shouldn’t be a requirement to deploy
models on Vertex AI.

Next, you deploy the endpoint using the above specifications:

tf28_gpu_deployed_model = endpoint_service_client.deploy_model(
    endpoint=tf28_gpu_endpoint,
    deployed_model=tf28_gpu_deployed_model_dict,
    traffic_split={"0": 100},
).result()

Notice the way you’re defining the traffic split for the model. For those who had
multiple versions of the model, you possibly can have defined a dictionary
where the keys would denote the model version and values would denote
the share of traffic the model is speculated to serve.

With a Model Registry and a dedicated
interface
to administer Endpoints, Vertex AI allows you to easily control the vital
facets of the deployment.

It takes about 15 – half-hour for Vertex AI to scope the deployment.
Once it’s done, you must find a way to see it on the
console.

Performing Predictions

In case your deployment was successful, you’ll be able to test the deployed
Endpoint by making a prediction request.

First, prepare a base64 encoded image string:

import base64
import tensorflow as tf

image_path = tf.keras.utils.get_file(
    "image.jpg", "http://images.cocodataset.org/val2017/000000039769.jpg"
)
bytes = tf.io.read_file(image_path)
b64str = base64.b64encode(bytes.numpy()).decode("utf-8")

4. The next utility first prepares a listing of instances (only
one instance on this case) after which uses a prediction service client (of
type PredictionServiceClient).
serving_input is the name of the input signature key of the served
model. On this case, the serving_input is string_input, which
you’ll be able to confirm from the SavedModel signature output shown above.

from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

def predict_image(image, endpoint, serving_input):
    # The format of every instance should conform to
    # the deployed model's prediction input schema.
    instances_list = [{serving_input: {"b64": image}}]
    instances = [json_format.ParseDict(s, Value()) for s in instances_list]

    print(
        prediction_service_client.predict(
            endpoint=endpoint,
            instances=instances,
        )
    )

predict_image(b64str, tf28_gpu_endpoint, serving_input)

For TensorFlow models deployed on Vertex AI, the request payload needs
to be formatted in a certain way. For models like ViT that cope with
binary data like images, they should be base64 encoded. In keeping with
the official guide,
the request payload for every instance must be like so:

{serving_input: {"b64": base64.b64encode(jpeg_data).decode()}}

The predict_image() utility prepares the request payload conforming
to this specification.

If all the things goes well with the deployment, if you call
predict_image(), you must get an output like so:

predictions {
  struct_value {
    fields {
      key: "confidence"
      value {
        number_value: 0.896659553
      }
    }
    fields {
      key: "label"
      value {
        string_value: "Egyptian cat"
      }
    }
  }
}
deployed_model_id: "5163311002082607104"
model: "projects/29880397572/locations/us-central1/models/7235960789184544768"
model_display_name: "ViT Base TF2.8 GPU model"

Note, nevertheless, this shouldn’t be the one solution to obtain predictions using a
Vertex AI Endpoint. For those who head over to the Endpoint console and choose
your endpoint, it is going to show you two alternative ways to acquire
predictions:

It’s also possible to avoid cURL requests and procure predictions
programmatically without using the Vertex AI SDK. Seek advice from
this notebook
to learn more.

Now that you simply’ve learned use Vertex AI to deploy a TensorFlow
model, let’s now discuss some helpful features provided by Vertex AI.
These make it easier to get deeper insights into your deployment.

Monitoring with Vertex AI

Vertex AI also allows you to monitor your model with none configuration.
From the Endpoint console, you’ll be able to get details in regards to the performance of
the Endpoint and the utilization of the allocated resources.

As seen within the above chart, for a transient period of time, the accelerator
duty cycle (utilization) was about 100% which is a sight for sore eyes.
For the remaining of the time, there weren’t any requests to process hence
things were idle.

This sort of monitoring helps you quickly flag the currently deployed
Endpoint and make adjustments as obligatory. It’s also possible to
request monitoring of model explanations. Refer
here
to learn more.

Local Load Testing

We conducted a neighborhood load test to raised understand the boundaries of the
Endpoint with Locust. The table below
summarizes the request statistics:

Amongst all the several statistics shown within the table, Average (ms)
refers to the common latency of the Endpoint. Locust fired off about
17230 requests, and the reported average latency is 646
Milliseconds, which is impressive. In practice, you’d need to simulate
more real traffic by conducting the load test in a distributed manner.
Refer here
to learn more.

This directory
has all the data needed to know the way we conducted the load test.

Pricing

You should use the GCP cost estimator to estimate the associated fee of usage,
and the precise hourly pricing table will be found here.
It’s price noting that you simply are only charged when the node is processing
the actual prediction requests, and you’ll want to calculate the value with
and without GPUs.

For the Vertex Prediction for a custom-trained model, we are able to select
N1 machine types from n1-standard-2 to n1-highcpu-32.
You used n1-standard-8 for this post which is provided with 8
vCPUs and 32GBs of RAM.

Machine Type	Hourly Pricing (USD)
n1-standard-8 (8vCPU, 30GB)	$ 0.4372

Also, if you attach accelerators to the compute node, you can be
charged extra by the kind of accelerator you wish. We used
NVIDIA_TESLA_T4 for this blog post, but just about all modern
accelerators, including TPUs are supported. Yow will discover further
information here.

Accelerator Type	Hourly Pricing (USD)
NVIDIA_TESLA_T4	$ 0.4024

Call for Motion

The gathering of TensorFlow vision models in 🤗 Transformers is growing. It now supports
state-of-the-art semantic segmentation with
SegFormer.
We encourage you to increase the deployment workflow you learned on this post to semantic segmentation models like SegFormer.

Conclusion

On this post, you learned deploy a Vision Transformer model with
the Vertex AI platform using the simple APIs it provides. You furthermore may learned
how Vertex AI’s features profit the model deployment process by enabling
you to concentrate on declarative configurations and removing the complex
parts. Vertex AI also supports deployment of PyTorch models via custom
prediction routes. Refer
here
for more details.

The series first introduced you to TensorFlow Serving for locally deploying
a vision model from 🤗 Transformers. Within the second post, you learned scale
that local deployment with Docker and Kubernetes. We hope this series on the
online deployment of TensorFlow vision models was helpful so that you can take your
ML toolbox to the subsequent level. We will’t wait to see what you construct with these tools.

Acknowledgements

Due to the ML Developer Relations Program team at Google, which
provided us with GCP credits for conducting the experiments.

Parts of the deployment code were referred from
this notebook of the official GitHub repository
of Vertex AI code samples.

Source link

Deploying 🤗 ViT on Vertex AI

What’s Vertex AI?

Revisiting the Serving Model

Deployment workflow with Vertex AI

Vertex AI Model Registry

Vertex AI Endpoint

Performing the Deployment

Performing Predictions

Monitoring with Vertex AI

Local Load Testing

Pricing

Call for Motion

Conclusion

Acknowledgements

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

I checked out considered one of the largest anti-AI protests ever

OpenAI steps into Anthropic’s Pentagon void

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

Deploying 🤗 ViT on Vertex AI

What’s Vertex AI?

Revisiting the Serving Model

Deployment workflow with Vertex AI

Vertex AI Model Registry

Vertex AI Endpoint

Performing the Deployment

Performing Predictions

Monitoring with Vertex AI

Local Load Testing

Pricing

Call for Motion

Conclusion

Acknowledgements

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.