My Journey to a serverless transformers pipeline on Google Cloud

-


Dominici's avatar


A guest blog post by community member Maxence Dominici

This text will discuss my journey to deploy the transformers sentiment-analysis pipeline on Google Cloud. We’ll start with a fast introduction to transformers after which move to the technical a part of the implementation. Finally, we’ll summarize this implementation and review what we now have achieved.



The Goal

img.png
I desired to create a micro-service that mechanically detects whether a customer review left in Discord is positive or negative. This may allow me to treat the comment accordingly and improve the shopper experience. For example, if the review was negative, I could create a feature which might contact the shopper, apologize for the poor quality of service, and inform him/her that our support team will contact him/her as soon as possible to help him and hopefully fix the issue. Since I do not plan to get greater than 2,000 requests per thirty days, I didn’t impose any performance constraints regarding the time and the scalability.



The Transformers library

I used to be a bit confused in the beginning after I downloaded the .h5 file. I assumed it will be compatible with tensorflow.keras.models.load_model, but this wasn’t the case. After a couple of minutes of research I used to be in a position to work out that the file was a weights checkpoint somewhat than a Keras model.
After that, I attempted out the API that Hugging Face offers and skim a bit more in regards to the pipeline feature they provide. For the reason that results of the API & the pipeline were great, I made a decision that I could serve the model through the pipeline alone server.

Below is the official example from the Transformers GitHub page.

from transformers import pipeline


classifier = pipeline('sentiment-analysis')
classifier('We're very glad to incorporate pipeline into the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9978193640708923}]



Deploy transformers to Google Cloud

GCP is chosen as it’s the cloud environment I’m using in my personal organization.



Step 1 – Research

I already knew that I could use an API-Service like flask to serve a transformers model. I searched within the Google Cloud AI documentation and located a service to host Tensorflow models named AI-Platform Prediction. I also found App Engine and Cloud Run there, but I used to be concerned in regards to the memory usage for App Engine and was not very conversant in Docker.



Step 2 – Test on AI-Platform Prediction

Because the model shouldn’t be a “pure TensorFlow” saved model but a checkpoint, and I could not turn it right into a “pure TensorFlow model”, I found out that the instance on this page would not work.
From there I saw that I could write some custom code, allowing me to load the pipeline as an alternative of getting to handle the model, which seemed is less complicated. I also learned that I could define a pre-prediction & post-prediction motion, which could possibly be useful in the long run for pre- or post-processing the info for patrons’ needs.
I followed Google’s guide but encountered a difficulty because the service continues to be in beta and all the pieces shouldn’t be stable. This issue is detailed here.



Step 3 – Test on App Engine

I moved to Google’s App Engine because it’s a service that I’m conversant in, but encountered an installation issue with TensorFlow as a consequence of a missing system dependency file. I then tried with PyTorch which worked with an F4_1G instance, but it surely couldn’t handle greater than 2 requests on the identical instance, which isn’t great performance-wise.



Step 4 – Test on Cloud Run

Lastly, I moved to Cloud Run with a docker image. I followed this guide to get an idea of how it really works. In Cloud Run, I could configure a better memory and more vCPUs to perform the prediction with PyTorch. I ditched Tensorflow as PyTorch seems to load the model faster.



Implementation of the serverless pipeline

The ultimate solution consists of 4 different components:

  • most important.py handling the request to the pipeline
  • Dockerfile used to create the image that can be deployed on Cloud Run.
  • Model folder having the pytorch_model.bin, config.json and vocab.txt.
  • requirement.txt for installing the dependencies

The content on the most important.py is admittedly easy. The concept is to receive a GET request containing two fields. First the review that should be analysed, second the API key to “protect” the service. The second parameter is optional, I used it to avoid establishing the oAuth2 of Cloud Run. After these arguments are provided, we load the pipeline which is built based on the model distilbert-base-uncased-finetuned-sst-2-english (provided above). In the long run, the very best match is returned to the client.

import os
from flask import Flask, jsonify, request
from transformers import pipeline

app = Flask(__name__)

model_path = "./model"

@app.route("https://huggingface.co/")
def classify_review():
    review = request.args.get('review')
    api_key = request.args.get('api_key')
    if review is None or api_key != "MyCustomerApiKey":
        return jsonify(code=403, message="bad request")
    classify = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
    return classify("that was great")[0]


if __name__ == '__main__':
    
    
    app.run(debug=False, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))

Then the DockerFile which can be used to create a docker image of the service. We specify that our service runs with python:3.7, plus that we want to put in our requirements. Then we use gunicorn to handle our process on the port 5000.


FROM python:3.7

ENV PYTHONUNBUFFERED True

COPY requirements.txt /
RUN pip install -r requirements.txt

COPY . /app

EXPOSE 5000
ENV PORT 5000
WORKDIR /app

CMD exec gunicorn --bind :$PORT most important:app --workers 1 --threads 1 --timeout 0

It will be significant to notice the arguments --workers 1 --threads 1 which implies that I would like to execute my app on just one employee (= 1 process) with a single thread. It’s because I don’t need to have 2 instances up directly because it’d increase the billing. One among the downsides is that it can take more time to process if the service receives two requests directly. After that, I put the limit to 1 thread as a consequence of the memory usage needed for loading the model into the pipeline. If I were using 4 threads, I may need 4 Gb / 4 = 1 Gb only to perform the total process, which shouldn’t be enough and would result in a memory error.

Finally, the requirement.txt file

Flask==1.1.2
torch===1.7.1
transformers~=4.2.0
gunicorn>=20.0.0



Deployment instructions

First, you have to to fulfill some requirements akin to having a project on Google Cloud, enabling the billing and installing the gcloud cli. You will discover more details about it within the Google’s guide – Before you start,

Second, we want to construct the docker image and deploy it to cloud run by choosing the right project (replace PROJECT-ID) and set the name of the instance akin to ai-customer-review. You will discover more information in regards to the deployment on Google’s guide – Deploying to.

gcloud builds submit --tag gcr.io/PROJECT-ID/ai-customer-review
gcloud run deploy --image gcr.io/PROJECT-ID/ai-customer-review --platform managed

After a couple of minutes, you may even must upgrade the memory allocated to your Cloud Run instance from 256 MB to 4 Gb. To accomplish that, head over to the Cloud Run Console of your project.

There you must find your instance, click on it.

img.png

After that you’re going to have a blue button labelled “edit and deploy recent revision” on top of the screen, click on it and you may be prompt many configuration fields. At the underside you must discover a “Capability” section where you’ll be able to specify the memory.

img.png



Performances

img.png

Handling a request takes lower than five seconds from the moment you send the request including loading the model into the pipeline, and prediction. The cold start might take up a further 10 seconds kind of.

We will improve the request handling performance by warming the model, it means loading it on start-up as an alternative on each request (global variable for instance), by doing so, we win time and memory usage.



Costs

I simulated the associated fee based on the Cloud Run instance configuration with Google pricing simulator
Estimate of the monthly cost

For my micro-service, I’m planning to close 1,000 requests per thirty days, optimistically. 500 may more likely for my usage. That is why I considered 2,000 requests as an upper certain when designing my microservice.
As a consequence of that low variety of requests, I didn’t hassle a lot regarding the scalability but might come back into it if my billing increases.

Nevertheless, it is vital to emphasize that you’re going to pay the storage for every Gigabyte of your construct image. It’s roughly €0.10 per Gb per thirty days, which is nice if you happen to don’t keep all of your versions on the cloud since my version is barely above 1 Gb (Pytorch for 700 Mb & the model for 250 Mb).



Conclusion

Through the use of Transformers’ sentiment evaluation pipeline, I saved a non-negligible period of time. As a substitute of coaching/fine-tuning a model, I could find one able to be utilized in production and begin the deployment in my system. I would fine-tune it in the long run, but as shown on my test, the accuracy is already amazing!
I’d have liked a “pure TensorFlow” model, or at the least a technique to load it in TensorFlow without Transformers dependencies to make use of the AI platform. It will even be great to have a lite version.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x