notebook: sagemaker/18_inferentia_inference
The adoption of BERT and Transformers continues to grow. Transformer-based models are actually not only achieving state-of-the-art performance in Natural Language Processing but additionally for Computer Vision, Speech, and Time-Series. 💬 🖼 🎤 ⏳
Corporations are actually slowly moving from the experimentation and research phase to the production phase to be able to use transformer models for large-scale workloads. But by default BERT and its friends are relatively slow, big, and complicated models in comparison with the normal Machine Learning algorithms. Accelerating Transformers and BERT is and can change into an interesting challenge to resolve in the longer term.
AWS’s take to resolve this challenge was to design a custom machine learning chip designed for optimized inference workload called AWS Inferentia. AWS says that AWS Inferentia “delivers as much as 80% lower cost per inference and as much as 2.3X higher throughput than comparable current generation GPU-based Amazon EC2 instances.”
The true value of AWS Inferentia instances in comparison with GPU comes through the multiple Neuron Cores available on each device. A Neuron Core is the custom accelerator inside AWS Inferentia. Each Inferentia chip comes with 4x Neuron Cores. This allows you to either load 1 model on each core (for top throughput) or 1 model across all cores (for lower latency).
Tutorial
On this end-to-end tutorial, you’ll learn the best way to speed up BERT inference for text classification with Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia.
Yow will discover the notebook here: sagemaker/18_inferentia_inference
You’ll learn the best way to:
Let’s start! 🚀
Should you are going to make use of Sagemaker in a neighborhood environment (not SageMaker Studio or Notebook Instances), you wish access to an IAM Role with the required permissions for Sagemaker. Yow will discover here more about it.
1. Convert your Hugging Face Transformer to AWS Neuron
We’re going to use the AWS Neuron SDK for AWS Inferentia. The Neuron SDK features a deep learning compiler, runtime, and tools for converting and compiling PyTorch and TensorFlow models to neuron compatible models, which may be run on EC2 Inf1 instances.
As a primary step, we’d like to put in the Neuron SDK and the required packages.
Tip: Should you are using Amazon SageMaker Notebook Instances or Studio you possibly can go along with the conda_python3 conda kernel.
!pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
!pip install torch-neuron==1.9.1.* neuron-cc[tensorflow] sagemaker>=2.79.0 transformers==4.12.3 --upgrade
After we’ve got installed the Neuron SDK we will load and convert our model. Neuron models are converted using torch_neuron with its trace method just like torchscript. Yow will discover more information in our documentation.
To have the ability to convert our model we first need to pick the model we wish to make use of for our text classification pipeline from hf.co/models. For this instance, let’s go along with distilbert-base-uncased-finetuned-sst-2-english but this may be easily adjusted with other BERT-like models.
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
On the time of writing, the AWS Neuron SDK doesn’t support dynamic shapes, which implies that the input size must be static for compiling and inference.
In simpler terms, which means when the model is compiled with e.g. an input of batch size 1 and sequence length of 16, the model can only run inference on inputs with that very same shape.
When using a t2.medium instance the compilation takes around 3 minutes
import os
import tensorflow
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, torchscript=True)
dummy_input = "dummy input which shall be padded later"
max_length = 128
embeddings = tokenizer(dummy_input, max_length=max_length, padding="max_length",return_tensors="pt")
neuron_inputs = tuple(embeddings.values())
model_neuron = torch.neuron.trace(model, neuron_inputs)
model.config.update({"traced_sequence_length": max_length})
save_dir="tmp"
os.makedirs("tmp",exist_ok=True)
model_neuron.save(os.path.join(save_dir,"neuron_model.pt"))
tokenizer.save_pretrained(save_dir)
model.config.save_pretrained(save_dir)
2. Create a custom inference.py script for text-classification
The Hugging Face Inference Toolkit supports zero-code deployments on top of the pipeline feature from 🤗 Transformers. This enables users to deploy Hugging Face transformers without an inference script [Example].
Currently, this feature is just not supported with AWS Inferentia, which implies we’d like to offer an inference.py script for running inference.
Should you can be thinking about support for zero-code deployments for Inferentia tell us on the forum.
To make use of the inference script, we’d like to create an inference.py script. In our example, we’re going to overwrite the model_fn to load our neuron model and the predict_fn to create a text-classification pipeline.
If you desire to know more concerning the inference.py script try this example. It explains amongst other things what model_fn and predict_fn are.
!mkdir code
We’re using the NEURON_RT_NUM_CORES=1 to be sure that every HTTP employee uses 1 Neuron core to maximise throughput.
%%writefile code/inference.py
import os
from transformers import AutoConfig, AutoTokenizer
import torch
import torch.neuron
os.environ["NEURON_RT_NUM_CORES"] = "1"
AWS_NEURON_TRACED_WEIGHTS_NAME = "neuron_model.pt"
def model_fn(model_dir):
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
model_config = AutoConfig.from_pretrained(model_dir)
return model, tokenizer, model_config
def predict_fn(data, model_tokenizer_model_config):
model, tokenizer, model_config = model_tokenizer_model_config
inputs = data.pop("inputs", data)
embeddings = tokenizer(
inputs,
return_tensors="pt",
max_length=model_config.traced_sequence_length,
padding="max_length",
truncation=True,
)
neuron_inputs = tuple(embeddings.values())
with torch.no_grad():
predictions = model(*neuron_inputs)[0]
scores = torch.nn.Softmax(dim=1)(predictions)
return [{"label": model_config.id2label[item.argmax().item()], "rating": item.max().item()} for item in scores]
3. Create and upload the neuron model and inference script to Amazon S3
Before we will deploy our neuron model to Amazon SageMaker we’d like to create a model.tar.gz archive with all our model artifacts saved into tmp/, e.g. neuron_model.pt and upload this to Amazon S3.
To do that we’d like to establish our permissions.
import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
Next, we create our model.tar.gz. The inference.py script shall be placed right into a code/ folder.
!cp -r code/ tmp/code/
%cd tmp
!tar zcvf model.tar.gz *
%cd ..
Now we will upload our model.tar.gz to our session S3 bucket with sagemaker.
from sagemaker.s3 import S3Uploader
s3_model_path = f"s3://{sess.default_bucket()}/{model_id}"
s3_model_uri = S3Uploader.upload(local_path="tmp/model.tar.gz",desired_s3_uri=s3_model_path)
print(f"model artifcats uploaded to {s3_model_uri}")
4. Deploy a Real-time Inference Endpoint on Amazon SageMaker
After we’ve got uploaded our model.tar.gz to Amazon S3 can we create a custom HuggingfaceModel. This class shall be used to create and deploy our real-time inference endpoint on Amazon SageMaker.
from sagemaker.huggingface.model import HuggingFaceModel
huggingface_model = HuggingFaceModel(
model_data=s3_model_uri,
role=role,
transformers_version="4.12",
pytorch_version="1.9",
py_version='py37',
)
huggingface_model._is_compiled_model = True
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.inf1.xlarge"
)
5. Run and evaluate Inference performance of BERT on Inferentia
The .deploy() returns an HuggingFacePredictor object which may be used to request inference.
data = {
"inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}
res = predictor.predict(data=data)
res
We managed to deploy our neuron compiled BERT to AWS Inferentia on Amazon SageMaker. Now, let’s test its performance. As a dummy load test, we are going to loop and send 10,000 synchronous requests to our endpoint.
for i in range(10000):
resp = predictor.predict(
data={"inputs": "it 's a captivating and sometimes affecting journey ."}
)
Let’s inspect the performance in cloudwatch.
print(f"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{predictor.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}")
The typical latency for our BERT model is 5-6ms for a sequence length of 128.
Delete model and endpoint
To wash up, we will delete the model and endpoint.
predictor.delete_model()
predictor.delete_endpoint()
Conclusion
We successfully managed to compile a vanilla Hugging Face Transformers model to an AWS Inferentia compatible Neuron Model. After that we deployed our Neuron model to Amazon SageMaker using the brand new Hugging Face Inference DLC. We managed to attain 5-6ms latency per neuron core, which is quicker than CPU when it comes to latency, and achieves a better throughput than GPUs since we ran 4 models in parallel.
Should you otherwise you company are currently using a BERT-like Transformer for encoder tasks (text-classification, token-classification, question-answering etc.), and the latency meets your requirements it is best to switch to AWS Inferentia. This may not only save costs, but may also increase efficiency and performance on your models.
We’re planning on doing a more detailed case study on cost-performance of transformers in the longer term, so stay tuned!
Also if you desire to learn more about accelerating transformers it is best to also try Hugging Face optimum.
Thanks for reading! If you may have any questions, be at liberty to contact me, through Github, or on the forum. It’s also possible to connect with me on Twitter or LinkedIn.
