Speed up BERT inference with Hugging Face Transformers and AWS Inferentia

-


Philipp Schmid's avatar

notebook: sagemaker/18_inferentia_inference

The adoption of BERT and Transformers continues to grow. Transformer-based models are actually not only achieving state-of-the-art performance in Natural Language Processing but additionally for Computer Vision, Speech, and Time-Series. 💬 🖼 🎤 ⏳

Corporations are actually slowly moving from the experimentation and research phase to the production phase to be able to use transformer models for large-scale workloads. But by default BERT and its friends are relatively slow, big, and complicated models in comparison with the normal Machine Learning algorithms. Accelerating Transformers and BERT is and can change into an interesting challenge to resolve in the longer term.

AWS’s take to resolve this challenge was to design a custom machine learning chip designed for optimized inference workload called AWS Inferentia. AWS says that AWS Inferentia “delivers as much as 80% lower cost per inference and as much as 2.3X higher throughput than comparable current generation GPU-based Amazon EC2 instances.”

The true value of AWS Inferentia instances in comparison with GPU comes through the multiple Neuron Cores available on each device. A Neuron Core is the custom accelerator inside AWS Inferentia. Each Inferentia chip comes with 4x Neuron Cores. This allows you to either load 1 model on each core (for top throughput) or 1 model across all cores (for lower latency).



Tutorial

On this end-to-end tutorial, you’ll learn the best way to speed up BERT inference for text classification with Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia.

Yow will discover the notebook here: sagemaker/18_inferentia_inference

You’ll learn the best way to:

Let’s start! 🚀


Should you are going to make use of Sagemaker in a neighborhood environment (not SageMaker Studio or Notebook Instances), you wish access to an IAM Role with the required permissions for Sagemaker. Yow will discover here more about it.



1. Convert your Hugging Face Transformer to AWS Neuron

We’re going to use the AWS Neuron SDK for AWS Inferentia. The Neuron SDK features a deep learning compiler, runtime, and tools for converting and compiling PyTorch and TensorFlow models to neuron compatible models, which may be run on EC2 Inf1 instances.

As a primary step, we’d like to put in the Neuron SDK and the required packages.

Tip: Should you are using Amazon SageMaker Notebook Instances or Studio you possibly can go along with the conda_python3 conda kernel.


!pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com


!pip install torch-neuron==1.9.1.* neuron-cc[tensorflow] sagemaker>=2.79.0 transformers==4.12.3 --upgrade

After we’ve got installed the Neuron SDK we will load and convert our model. Neuron models are converted using torch_neuron with its trace method just like torchscript. Yow will discover more information in our documentation.

To have the ability to convert our model we first need to pick the model we wish to make use of for our text classification pipeline from hf.co/models. For this instance, let’s go along with distilbert-base-uncased-finetuned-sst-2-english but this may be easily adjusted with other BERT-like models.

model_id = "distilbert-base-uncased-finetuned-sst-2-english"

On the time of writing, the AWS Neuron SDK doesn’t support dynamic shapes, which implies that the input size must be static for compiling and inference.

In simpler terms, which means when the model is compiled with e.g. an input of batch size 1 and sequence length of 16, the model can only run inference on inputs with that very same shape.

When using a t2.medium instance the compilation takes around 3 minutes

import os
import tensorflow  
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification


tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, torchscript=True)


dummy_input = "dummy input which shall be padded later"
max_length = 128
embeddings = tokenizer(dummy_input, max_length=max_length, padding="max_length",return_tensors="pt")
neuron_inputs = tuple(embeddings.values())


model_neuron = torch.neuron.trace(model, neuron_inputs)
model.config.update({"traced_sequence_length": max_length})


save_dir="tmp"
os.makedirs("tmp",exist_ok=True)
model_neuron.save(os.path.join(save_dir,"neuron_model.pt"))
tokenizer.save_pretrained(save_dir)
model.config.save_pretrained(save_dir)



2. Create a custom inference.py script for text-classification

The Hugging Face Inference Toolkit supports zero-code deployments on top of the pipeline feature from 🤗 Transformers. This enables users to deploy Hugging Face transformers without an inference script [Example].

Currently, this feature is just not supported with AWS Inferentia, which implies we’d like to offer an inference.py script for running inference.

Should you can be thinking about support for zero-code deployments for Inferentia tell us on the forum.


To make use of the inference script, we’d like to create an inference.py script. In our example, we’re going to overwrite the model_fn to load our neuron model and the predict_fn to create a text-classification pipeline.

If you desire to know more concerning the inference.py script try this example. It explains amongst other things what model_fn and predict_fn are.

!mkdir code

We’re using the NEURON_RT_NUM_CORES=1 to be sure that every HTTP employee uses 1 Neuron core to maximise throughput.

%%writefile code/inference.py

import os
from transformers import AutoConfig, AutoTokenizer
import torch
import torch.neuron


os.environ["NEURON_RT_NUM_CORES"] = "1"


AWS_NEURON_TRACED_WEIGHTS_NAME = "neuron_model.pt"

def model_fn(model_dir):
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
    model_config = AutoConfig.from_pretrained(model_dir)

    return model, tokenizer, model_config

def predict_fn(data, model_tokenizer_model_config):
    
    model, tokenizer, model_config = model_tokenizer_model_config

    
    inputs = data.pop("inputs", data)
    embeddings = tokenizer(
        inputs,
        return_tensors="pt",
        max_length=model_config.traced_sequence_length,
        padding="max_length",
        truncation=True,
    )
    
    neuron_inputs = tuple(embeddings.values())

    
    with torch.no_grad():
        predictions = model(*neuron_inputs)[0]
        scores = torch.nn.Softmax(dim=1)(predictions)

    
    return [{"label": model_config.id2label[item.argmax().item()], "rating": item.max().item()} for item in scores]



3. Create and upload the neuron model and inference script to Amazon S3

Before we will deploy our neuron model to Amazon SageMaker we’d like to create a model.tar.gz archive with all our model artifacts saved into tmp/, e.g. neuron_model.pt and upload this to Amazon S3.

To do that we’d like to establish our permissions.

import sagemaker
import boto3
sess = sagemaker.Session()


sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

Next, we create our model.tar.gz. The inference.py script shall be placed right into a code/ folder.


!cp -r code/ tmp/code/

%cd tmp
!tar zcvf model.tar.gz *
%cd ..

Now we will upload our model.tar.gz to our session S3 bucket with sagemaker.

from sagemaker.s3 import S3Uploader


s3_model_path = f"s3://{sess.default_bucket()}/{model_id}"


s3_model_uri = S3Uploader.upload(local_path="tmp/model.tar.gz",desired_s3_uri=s3_model_path)
print(f"model artifcats uploaded to {s3_model_uri}")



4. Deploy a Real-time Inference Endpoint on Amazon SageMaker

After we’ve got uploaded our model.tar.gz to Amazon S3 can we create a custom HuggingfaceModel. This class shall be used to create and deploy our real-time inference endpoint on Amazon SageMaker.

from sagemaker.huggingface.model import HuggingFaceModel


huggingface_model = HuggingFaceModel(
   model_data=s3_model_uri,       
   role=role,                    
   transformers_version="4.12",  
   pytorch_version="1.9",        
   py_version='py37',            
)


huggingface_model._is_compiled_model = True


predictor = huggingface_model.deploy(
    initial_instance_count=1,      
    instance_type="ml.inf1.xlarge" 
)



5. Run and evaluate Inference performance of BERT on Inferentia

The .deploy() returns an HuggingFacePredictor object which may be used to request inference.

data = {
  "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = predictor.predict(data=data)
res

We managed to deploy our neuron compiled BERT to AWS Inferentia on Amazon SageMaker. Now, let’s test its performance. As a dummy load test, we are going to loop and send 10,000 synchronous requests to our endpoint.


for i in range(10000):
    resp = predictor.predict(
        data={"inputs": "it 's a captivating and sometimes affecting journey ."}
    )

Let’s inspect the performance in cloudwatch.

print(f"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{predictor.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}")

The typical latency for our BERT model is 5-6ms for a sequence length of 128.

Figure 1. Model Latency



Delete model and endpoint

To wash up, we will delete the model and endpoint.

predictor.delete_model()
predictor.delete_endpoint()



Conclusion

We successfully managed to compile a vanilla Hugging Face Transformers model to an AWS Inferentia compatible Neuron Model. After that we deployed our Neuron model to Amazon SageMaker using the brand new Hugging Face Inference DLC. We managed to attain 5-6ms latency per neuron core, which is quicker than CPU when it comes to latency, and achieves a better throughput than GPUs since we ran 4 models in parallel.

Should you otherwise you company are currently using a BERT-like Transformer for encoder tasks (text-classification, token-classification, question-answering etc.), and the latency meets your requirements it is best to switch to AWS Inferentia. This may not only save costs, but may also increase efficiency and performance on your models.

We’re planning on doing a more detailed case study on cost-performance of transformers in the longer term, so stay tuned!

Also if you desire to learn more about accelerating transformers it is best to also try Hugging Face optimum.


Thanks for reading! If you may have any questions, be at liberty to contact me, through Github, or on the forum. It’s also possible to connect with me on Twitter or LinkedIn.





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x