Deploying LLMs On Amazon SageMaker With DJL Serving

Artificial Intelligence

Deploying LLMs On Amazon SageMaker With DJL Serving

admin

June 7, 2023

Deploying LLMs On Amazon SageMaker With DJL Serving

Deploy BART on Amazon SageMaker Real-Time Inference

Large Language Models (LLMs) and Generative AI proceed to take over the Machine Learning and general tech space in 2023. With the LLM expansion has come an influx of latest models that proceed to enhance at a shocking rate.

While the accuracy and performance of those models are incredible, they’ve their very own set of challenges by way of hosting these models. Without model hosting, it is difficult to acknowledge the worth that these LLMs provide in real-world applications. What are the particular challenges with LLM hosting and performance tuning?

How can we load these larger models which can be scaling as much as past 100s of GBs in size?
How can we properly apply model partitioning techniques to efficiently utilize hardware while not compromising on model accuracy?
How can we fit these models on a singular GPU or multiple?

These are all difficult questions which can be addressed and abstracted out through a model server generally known as DJL Serving. DJL Serving is a high performance universal solution that integrates directly with various model partitioning frameworks corresponding to the next: HuggingFace Speed up, DeepSpeed, and FasterTransformers. With DJL Serving you’ll be able to configure your serving stack to utilize these partitioning frameworks to optimize inference at scale across multiple GPUs with these larger models.

In today’s article in specific we explore one in all the smaller language models in BART for Feature Extraction. We are going to showcase how you need to use DJL Serving to configure your serving stack and host a HuggingFace Model of your selection. This instance can function a template to construct upon and utilize the model partitioning frameworks aforementioned. We are going to then take our DJL specific code and integrate it with SageMaker to create a Real-Time Endpoint which you could use for inference.

NOTE: For those of you recent to AWS, ensure you make an account at the next link if you ought to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I’d suggest following this article for understanding Deployment/Inference more in depth.

What’s a Model Server? What Model Servers Does Amazon SageMaker Support?

Model Servers at a really basic premise are “inference as a service”. We’d like a simple option to expose our models via an API, but these model servers handle the grunt work behind the scenes. These model servers load and unload our model artifacts and supply the runtime environment to your ML models that you simply are hosting. These model servers can be tuned depending on what they expose to the user. For instance, TensorFlow Serving gives the selection of gRPC vs REST to your API calls.

Amazon SageMaker integrates with a wide range of these different model servers which can be also exposed via the various Deep Learning Containers that AWS provides. A few of these model servers include the next:

For this specific example we are going to utilize DJL Serving because it is tailored for Large Language Model Hosting with it’s different model partitioning frameworks it has enabled. That doesn’t mean the server is proscribed to LLMs, it’s also possible to put it to use for other models so long as you might be properly configuring the environment to put in and cargo up every other dependencies.

At a really high level overview depending on the model server that you simply are using the best way you bake and shape your artifacts that you simply provide the server is the one difference together with whatever model frameworks and environments they support as well.

DJL Serving vs JumpStart

In my previous article we explored how we could deploy Cohere’s Language Models via SageMaker JumpStart. Why not use SageMaker JumpStart on this case? In the intervening time not all LLMs are supported by SageMaker JumpStart. Within the case that there’s a particular LLM that JumpStart doesn’t support it is smart to make use of DJL Serving.

The opposite major use case for DJL Serving is on the subject of customization and performance optimization. With JumpStart you might be constrained to the model offering and whatever limitations exist with the container that’s already been pre-baked for you. With DJL there may be more code work at a container level but you’ll be able to apply performance optimization techniques of your selection with the various partitioning frameworks that exist.

DJL Serving Setup

For this code example we will probably be utilizing a ml.c5.9xlarge SageMaker Classic Notebook Instance with a conda_amazonei_pytorch_latest_p37 kernel for development.

Before we are able to get to DJL Serving Setup we are able to quickly explore the BART model itself. This model might be present in the HuggingFace Model Hub and might be utilized for a wide range of tasks corresponding to Feature Extraction and Summarization. The next code snippet is how you’ll be able to utilize the BART Tokenizer and Model for a sample inference locally.

from transformers import AutoTokenizer, AutoModeltokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")
model = AutoModel.from_pretrained("facebook/bart-large")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
last_hidden_states

Now we are able to map this model to DJL Serving with a couple of specific files. First we define a serving.properties file which essentially defines the configuration to your model deployment. On this case we specify a couple of parameters.

Engine: We’re utilizing Python for the DJL Engine, the opposite options listed here are also DeepSpeed, FasterTransformers, and Speed up.
Model_ID: For the HuggingFace Hub each model has a model_id that might be used as an identifier, we are able to feed this into our model script for model loading.
Task: For HuggingFace specific models you’ll be able to include a task as a lot of these models can support various language tasks, on this case we specify Feature Extraction.

engine=Python
option.model_id=facebook/bart-large
option.task=feature-extraction

Other configurations you’ll be able to specify for DJL include: tensor_parallel degree, minimum and maximum staff on a per model basis. For an intensive list of properties you’ll be able to configure please seek advice from the next documentation.

The following files we offer are our actual model artifact and a requirements.txt for any additional libraries you’ll utilize in your inference script.

numpy

On this case now we have no model artifacts as we are going to directly load the model from the HuggingFace Hub in our inference script.

In our Inference Script (model.py) we are able to create a category that captures each model loading and inference.

class BartModel(object):
"""
Deploying Bart with DJL Serving
"""def __init__(self):
self.initialized = False

Our initialize method will parse our serving.properties file and cargo the BART Model and Tokenizer from the HuggingFace Model Hub. The properties object essentially incorporates every part you’ve got defined within the serving.properties file.

def initialize(self, properties: dict):
"""
Initialize model.
"""
logging.info(properties)tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")
model = AutoModel.from_pretrained("facebook/bart-large")
self.model_name = properties.get("model_id")
self.task = properties.get("task")
self.model = AutoModel.from_pretrained(self.model_name)
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.initialized = True

We then define an inference method which accepts a string input and tokenizes the text for the BART Model inference that we are able to copy from the local inference example above.

def inference(self, inputs):
"""
Custom service entry point function.:param inputs: the Input object holds the text for the BART model to infer upon
:return: the Output object to be send back
"""
#sample input: "That is the sample text that I'm passing in"
try:
data = inputs.get_as_string()
inputs = self.tokenizer(data, return_tensors="pt")
preds = self.model(**inputs)
res = preds.last_hidden_state.detach().cpu().numpy().tolist() #convert to JSON Serializable object
outputs = Output()
outputs.add_as_json(res)
except Exception as e:
logging.exception("inference failed")
# error handling
outputs = Output().error(str(e))

We then instantiate this class and capture all of this within the “handle” method. By default for DJL Serving that is the tactic that the handler parses for within the inference script.

_service = BartModel()def handle(inputs: Input):
"""
Default handler function
"""
if not _service.initialized:
# stateful model
_service.initialize(inputs.get_properties())
if inputs.is_empty():
return None
return _service.inference(inputs)

We now have all of the essential artifacts on the DJL Serving side and might configure these files to suit the SageMaker constructs to create a Real-Time Endpoint.

SageMaker Endpoint Creation & Inference

For making a SageMaker Endpoint the method could be very just like that of other Model Servers corresponding to MMS. We’d like two artifacts to create a SageMaker Model Entity:

model.tar.gz: This can contain our DJL specific files and we organize these in a format that the model server expects.
Container Image: SageMaker Inference all the time expects a container, on this case we use the DJL Deepseed image provided and maintained by AWS.

We are able to create our model tarball, upload it to S3 after which retrieve our image to get the artifacts ready for Inference.

import sagemaker, boto3
from sagemaker import image_uris# retreive DeepSpeed image
img_uri = image_uris.retrieve(framework="djl-deepspeed", 
region=region, version="0.21.0")
# create model tarball
bashCommand = "tar -cvpzf model.tar.gz model.py requirements.txt serving.properties"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
# Upload tar.gz to bucket
model_artifacts = f"s3://{bucket}/model.tar.gz"
response = s3.meta.client.upload_file('model.tar.gz', bucket, 'model.tar.gz')

We are able to then utilize the Boto3 SDK to conduct our Model, Endpoint Configuration, and Endpoint creation. The one change from the same old three API calls is that within the Endpoint Configuration API call we specify Model Download Timeout and Container Health Check Timeout parameters to higher numbers as we’re coping with a bigger model on this case. We also utilize a g5 family instance for the extra GPU compute power. For many LLMs, GPUs are mandatory to give you the option to host models at this size and scale.

client = boto3.client(service_name="sagemaker")model_name = "djl-bart" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)
create_model_response = client.create_model(
ModelName=model_name,
ExecutionRoleArn=role,
PrimaryContainer={"Image": img_uri, "ModelDataUrl": model_artifacts},
)
print("Model Arn: " + create_model_response["ModelArn"])
endpoint_config_name = "djl-bart" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
production_variants = [
{
"VariantName": "AllTraffic",
"ModelName": model_name,
"InitialInstanceCount": 1,
"InstanceType": 'ml.g5.12xlarge',
"ModelDataDownloadTimeoutInSeconds": 1800,
"ContainerStartupHealthCheckTimeoutInSeconds": 3600,
}
]
endpoint_config = {
"EndpointConfigName": endpoint_config_name,
"ProductionVariants": production_variants,
}
endpoint_config_response = client.create_endpoint_config(**endpoint_config)
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])
endpoint_name = "djl-bart" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Once the endpoint has been created we are able to perform a sample inference utilizing the invoke_endpoint API call and you must see a numpy array returned.

runtime = boto3.client(service_name="sagemaker-runtime")
response = runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="text/plain",
Body="I feel my dog is absolutely cute!")
result = json.loads(response['Body'].read().decode())

Additional Resources & Conclusion

Yow will discover the code for your entire example on the link above. LLM Hosting continues to be a growing space with many challenges that DJL Serving can assist simplify. Paired with the hardware and optimizations SageMaker provides this can assist enhance your inference performance for LLMs.

As all the time be happy to go away any feedback or questions across the article. Thanks for reading and stay tuned for more content within the LLM space.