Deploying Cohere Language Models On Amazon SageMaker

Artificial Intelligence

Deploying Cohere Language Models On Amazon SageMaker

admin

May 21, 2023

Deploying Cohere Language Models On Amazon SageMaker

Scale and Host LLMs on AWS

Large Language Models (LLMs) and Generative AI are accelerating Machine Learning growth across various industries. With LLMs the scope for Machine Learning has increased to incredible heights, but has also been accompanied with a recent set of challenges.

The dimensions of LLMs result in difficult problems in each the Training and Hosting portions of the ML lifecycle. Specifically for Hosting LLMs there are a myriad of challenges to think about. How can we fit a model right into a singular GPU for inference? How can we apply model compression and partitioning techniques without compromising on accuracy? How can we improve inference latency and throughput for these LLMs?

To find a way to handle lots of these questions requires advanced ML Engineering where we have now to orchestrate model hosting on a platform that may apply compression and parallelization techniques at a container and hardware level. There are answers equivalent to DJL Serving that provide containers tuned for LLM hosting, but we is not going to explore them in this text.

In this text, we’ll explore SageMaker JumpStart Foundational Models. With Foundational Models we don’t worry about containers or model parallelization and compression techniques, but focus totally on directly deploying a pre-trained model with hardware of your alternative. Specifically in this text we’ll explore a preferred LLM provider referred to as Cohere and the way we will host certainly one of their popular language models on SageMaker for Inference.

: For those of you recent to AWS, make certain you make an account at the next link if you wish to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I’d suggest following this article for understanding Deployment/Inference more in depth. Specifically, for SageMaker JumpStart, I’d reference this following blog.

What’s SageMaker JumpStart? What Are Foundational Models?

SageMaker JumpStart in essence is SageMaker’s Model Zoo. There are a number of various pre-trained models which might be already containerized and may be deployed via the SageMaker Python SDK. The most important value here is that customers don’t have to worry about tuning or configuring a container to host a particular model, that heavy lift is taken care of.

Specifically for LLMs, JumpStart Foundational Models were launched with popular language models from quite a lot of providers equivalent to Stability AI and on this case Cohere. You may view a full list of the Foundational Models which might be available on the SageMaker Console.

SageMaker JumpStart Foundational Models (Screenshot by Creator)

These Foundational Models are also exposed via the AWS MarketPlace where you possibly can subscribe for specific models that will not be accessible by default. Within the case of Cohere’s Medium model that we will likely be working with, this must be accessible via JumpStart with none subscription, but within the case you do run into any issues you possibly can enlist for access at the next link.

Cohere Medium Language Model Deployment

For this instance we’ll specifically explore how we will deploy Cohere’s GPT Medium Language Model via SageMaker JumpStart. Before we start, we install the cohere-sagemaker SDK. This SDK further simplifies the deployment process because it builds a wrapper around the same old SageMaker Inference constructs (SageMaker Model, SageMaker Endpoint Configuration, and SageMaker Endpoint).

!pip install cohere-sagemaker --quiet

From this SDK we import the Client object that can help us create our endpoint and likewise perform inference.

from cohere_sagemaker import Client
import boto3

If we go to the Marketplace link we see that this model is accessible via Model Package. Thus, for the following step we offer the Model Package ARN for the Cohere Medium model. Note that this specific model is currently only available in US-East-1 and EU-West-1 regions.

# Currently us-east-1 and eu-west-1 only supported
model_package_map = {
"us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/cohere-gpt-medium-v1-5-15e34931a06235b7bac32dca396a970a",
"eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/cohere-gpt-medium-v1-5-15e34931a06235b7bac32dca396a970a",
}region = boto3.Session().region_name
if region not in model_package_map.keys():
raise Exception(f"Current boto3 session region {region} isn't supported.")
model_package_arn = model_package_map[region]

Now that we have now our model package we will instantiate our Client object and create our endpoint. With JumpStart we have now to offer our Model Package Details, Instance Type and Count, in addition to the Endpoint Name.

# instantiate client
co = Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-gpt-medium", 
instance_type="ml.g5.xlarge", n_instances=1)

For language models equivalent to Cohere for the instance type we recommend mostly GPU based instances equivalent to the g5 family, or the p3/p2 family, and the g4dn instance class. All these instances have enough compute and memory to find a way to handle the scale of those models. For further guidance you may also follow the MarketPlace advice for the instance to make use of for the precise model you select.

Next we perform a sample inference with the generate API call which is able to create text for the prompt we offer our endpoint with. This generate API call serves as a Cohere wrapper across the invoke_endpoint API call we traditionally see with SageMaker endpoints.

prompt = "Write a LinkedIn post about starting a profession in tech:"# API Call
response = co.generate(prompt=prompt, max_tokens=100, temperature=0, return_likelihoods='GENERATION')
print(response.generations[0].text)

Sample Inference (Screenshot by Creator)

Parameter Tuning

For an in depth understanding of different LLM parameters which you can tune I’d reference Cohere’s official article here. We primarily give attention to tuning two different parameters which we saw in our generate API call.

: Max tokens because the word indicates is the limit to variety of tokens our LLM can generate. What an LLM defines as a token varies, it might probably be a personality, word, phrase, or more. Cohere utilizes byte-pair encoding for his or her tokens. To totally understand how their models define tokens please seek advice from the next documentation. In essence we will iterate on this parameter to seek out an optimal value as we don’t want a worth that’s too small because it won’t properly answer our prompt or a worth that’s too large to the purpose where the response doesn’t make much sense. Cohere’s generation models support as much as 2048 tokens.
: The temperature parameter helps control the “creativity” of the model. For instance when one word is generated, there’s an inventory of words with various probabilities for the following word. When the temperature parameter is lower the model tends to select the word with the very best probability. Once we increase the temperature the responses are inclined to get a considerable amount of variety because the model starts choosing words with lower probabilities. This parameter ranges from 0 to five for this model.

First we will explore iterating upon the max_token size. We create an array of 5 arbitrary token sizes and loop through them for inference while keeping temperature constant.

token_range = [100, 200, 300, 400, 500]for token in token_range:
response = co.generate(prompt=prompt, max_tokens=token, temperature=0.9, return_likelihoods='GENERATION')
print("-----------------------------------")
print(response.generations[0].text)
print("-----------------------------------")

As expected we will see the difference within the length of every of the responses.

We may also test the temperature parameter by iterating through values between 0 to five.

for i in range(5):
response = co.generate(prompt=prompt, max_tokens=100, temperature=i, return_likelihoods='GENERATION')
print("-----------------------------------")
print(response.generations[0].text)
print("-----------------------------------")

We are able to see that at a worth of 1 we have now a really realistic output that is sensible for essentially the most part.

Temperature 1 (Screenshot by Creator)

At a temperature of 5 we see an output that makes somewhat sense, but is deviating extremely from the subject as a consequence of the word selection.

When you would really like to check all different combos of those parameters in your optimal configuration you may also run the next code block.

import itertools# Create array of all combos of each params
temperature = [0,1,2,3,4,5]
params = [token_range, temperature]
param_combos = list(itertools.product(*params))
for param in param_combos:
response = co.generate(prompt=prompt, max_tokens=param[0], 
temperature=param[1], return_likelihoods='GENERATION')

Additional Resources & Conclusion

The code for your complete example may be found on the link above (stay tuned for more LLM and JumpStart examples). With SageMaker JumpStart’s Foundational Models it becomes easy to host LLMs via an API call without doing the grunt of labor of containerizing and model serving. I hope this text was a useful introduction to LLMs with Amazon SageMaker, be at liberty to go away any feedback or questions as at all times.

Scale and Host LLMs on AWS

What’s SageMaker JumpStart? What Are Foundational Models?

Cohere Medium Language Model Deployment

Parameter Tuning

Additional Resources & Conclusion

LEAVE A REPLY Cancel reply