We’re excited to announce the final availability of Hugging Face Text Generation Inference (TGI) on AWS Inferentia2 and Amazon SageMaker.
Text Generation Inference (TGI), is a purpose-built solution for deploying and serving Large Language Models (LLMs) for production workloads at scale. TGI enables high-performance text generation using Tensor Parallelism and continuous batching for the preferred open LLMs, including Llama, Mistral, and more. Text Generation Inference is utilized in production by corporations akin to Grammarly, Uber, Deutsche Telekom, and plenty of more.
The mixing of TGI into Amazon SageMaker, together with AWS Inferentia2, presents a robust solution and viable alternative to GPUs for constructing production LLM applications. The seamless integration ensures easy deployment and maintenance of models, making LLMs more accessible and scalable for a wide selection of production use cases.
With the brand new TGI for AWS Inferentia2 on Amazon SageMaker, AWS customers can profit from the identical technologies that power highly-concurrent, low-latency LLM experiences like HuggingChat, OpenAssistant, and Serverless Endpoints for LLMs on the Hugging Face Hub.
Deploy Zephyr 7B on AWS Inferentia2 using Amazon SageMaker
This tutorial shows how easy it’s to deploy a state-of-the-art LLM, akin to Zephyr 7B, on AWS Inferentia2 using Amazon SageMaker. Zephyr is a 7B fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on a mixture of publicly available and artificial datasets using Direct Preference Optimization (DPO), as described intimately within the technical report. The model is released under the Apache 2.0 license, ensuring wide accessibility and use.
We’re going to point out you how you can:
- Setup development environment
- Retrieve the TGI Neuronx Image
- Deploy Zephyr 7B to Amazon SageMaker
- Run inference and chat with the model
Let’s start.
1. Setup development environment
We’re going to use the sagemaker python SDK to deploy Zephyr to Amazon SageMaker. We want to make certain to have an AWS account configured and the sagemaker python SDK installed.
!pip install transformers "sagemaker>=2.206.0" --upgrade --quiet
If you happen to are going to make use of Sagemaker in an area environment. You would like access to an IAM Role with the required permissions for Sagemaker. Yow will discover out more about it here.
import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")
2. Retrieve TGI Neuronx Image
The brand new Hugging Face TGI Neuronx DLCs might be used to run inference on AWS Inferentia2. You should utilize the get_huggingface_llm_image_uri approach to the sagemaker SDK to retrieve the suitable Hugging Face TGI Neuronx DLC URI based on your required backend, session, region, and version. Yow will discover all of the available versions here.
Note: On the time of writing this blog post the most recent version of the Hugging Face LLM DLC just isn’t yet available via the get_huggingface_llm_image_uri method. We’re going to use the raw container uri as a substitute.
from sagemaker.huggingface import get_huggingface_llm_image_uri
llm_image = get_huggingface_llm_image_uri(
"huggingface-neuronx",
version="0.0.20"
)
print(f"llm image uri: {llm_image}")
4. Deploy Zephyr 7B to Amazon SageMaker
Text Generation Inference (TGI) on Inferentia2 supports popular open LLMs, including Llama, Mistral, and more. You’ll be able to check the total list of supported models (text-generation) here.
Compiling LLMs for Inferentia2
On the time of writing, AWS Inferentia2 doesn’t support dynamic shapes for inference, which implies that we’d like to specify our sequence length and batch size ahead of time.
To make it easier for purchasers to utilize the total power of Inferentia2, we created a neuron model cache, which comprises pre-compiled configurations for the preferred LLMs. A cached configuration is defined through a model architecture (Mistral), model size (7B), neuron version (2.16), variety of inferentia cores (2), batch size (2), and sequence length (2048).
This implies we need not compile the model ourselves, but we are able to use the pre-compiled model from the cache. Examples of this are mistralai/Mistral-7B-v0.1 and HuggingFaceH4/zephyr-7b-beta. Yow will discover compiled/cached configurations on the Hugging Face Hub. If your required configuration just isn’t yet cached, you’ll be able to compile it yourself using the Optimum CLI or open a request on the Cache repository
For this post we re-compiled HuggingFaceH4/zephyr-7b-beta using the next command and parameters on a inf2.8xlarge instance, and pushed it to the Hub at aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2
optimum-cli export neuron -m HuggingFaceH4/zephyr-7b-beta --batch_size 4 --sequence_length 2048 --num_cores 2 --auto_cast_type bf16 ./zephyr-7b-beta-neuron
huggingface-cli upload aws-neuron/zephyr-7b-seqlen-2048-bs-4 ./zephyr-7b-beta-neuron ./ --exclude "checkpoint/**"
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta').push_to_hub('aws-neuron/zephyr-7b-seqlen-2048-bs-4')"
If you happen to try to compile an LLM with a configuration that just isn’t yet cached, it may take as much as 45 minutes.
Deploying TGI Neuronx Endpoint
Before deploying the model to Amazon SageMaker, we must define the TGI Neuronx endpoint configuration. We want to make certain the next additional parameters are defined:
HF_NUM_CORES: Variety of Neuron Cores used for the compilation.HF_BATCH_SIZE: The batch size that was used to compile the model.HF_SEQUENCE_LENGTH: The sequence length that was used to compile the model.HF_AUTO_CAST_TYPE: The auto solid type that was used to compile the model.
We still have to define traditional TGI parameters with:
HF_MODEL_ID: The Hugging Face model ID.HF_TOKEN: The Hugging Face API token to access gated models.MAX_BATCH_SIZE: The utmost batch size that the model can handle, equal to the batch size used for compilation.MAX_INPUT_LENGTH: The utmost input length that the model can handle.MAX_TOTAL_TOKENS: The utmost total tokens the model can generate, equal to the sequence length used for compilation.
import json
from sagemaker.huggingface import HuggingFaceModel
instance_type = "ml.inf2.8xlarge"
health_check_timeout = 1800
config = {
"HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
"HF_NUM_CORES": "2",
"HF_BATCH_SIZE": "4",
"HF_SEQUENCE_LENGTH": "2048",
"HF_AUTO_CAST_TYPE": "bf16",
"MAX_BATCH_SIZE": "4",
"MAX_INPUT_LENGTH": "1512",
"MAX_TOTAL_TOKENS": "2048",
}
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
After we’ve created the HuggingFaceModel we are able to deploy it to Amazon SageMaker using the deploy method. We’ll deploy the model with the ml.inf2.8xlarge instance type.
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
)
SageMaker will create our endpoint and deploy the model to it. This may take 10-Quarter-hour.
5. Run inference and chat with the model
After our endpoint is deployed, we are able to run inference on it, using the predict method from predictor. We are able to provide different parameters to affect the generation, adding them to the parameters attribute of the payload. Yow will discover the supported parameters here, or within the open API specification of TGI within the swagger documentation
The HuggingFaceH4/zephyr-7b-beta is a conversational chat model, meaning we are able to chat with it using a prompt structure like the next:
<|system|>nYou are a friendly.n<|user|>nInstructionn<|assistant|>n
Manually preparing the prompt is error prone, so we are able to use the apply_chat_template method from the tokenizer to assist with it. It expects a messages dictionary within the well-known OpenAI format, and converts it into the right format for the model. Let’s have a look at if Zephyr knows some facts about AWS.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2")
messages = [
{"role": "system", "content": "You are the AWS expert"},
{"role": "user", "content": "Can you tell me an interesting fact about AWS?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
payload = {
"do_sample": True,
"top_p": 0.6,
"temperature": 0.9,
"top_k": 50,
"max_new_tokens": 256,
"repetition_penalty": 1.03,
"return_full_text": False,
"stop": [""]
}
chat = llm.predict({"inputs":prompt, "parameters":payload})
print(chat[0]["generated_text"][len(prompt):])
Awesome, we’ve successfully deployed Zephyr to Amazon SageMaker on Inferentia2 and chatted with it.
6. Clean up
To scrub up, we are able to delete the model and endpoint.
llm.delete_model()
llm.delete_endpoint()
Conclusion
The mixing of Hugging Face Text Generation Inference (TGI) with AWS Inferentia2 and Amazon SageMaker provides a cheap alternative solution for deploying Large Language Models (LLMs).
We’re actively working on supporting more models, streamlining the compilation process, and refining the caching system.
Thanks for reading! If you will have any questions, be happy to contact me on Twitter or LinkedIn.
