We’re excited to announce that the brand new Hugging Face Embedding Container for Amazon SageMaker is now generally available (GA). AWS customers can now efficiently deploy embedding models on SageMaker to construct Generative AI applications, including Retrieval-Augmented Generation (RAG) applications.
On this Blog we’ll show you easy methods to deploy open Embedding Models, like Snowflake/snowflake-arctic-embed-l, BAAI/bge-large-en-v1.5 or sentence-transformers/all-MiniLM-L6-v2 to Amazon SageMaker for inference using the brand new Hugging Face Embedding Container. We’ll deploy the Snowflake/snowflake-arctic-embed-m-v1.5 top-of-the-line open Embedding Models for retrieval – you may check its rankings on the MTEB Leaderboard.
The instance covers:
What’s the Hugging Face Embedding Container?
The Hugging Face Embedding Container is a brand new purpose-built Inference Container to simply deploy Embedding Models in a secure and managed environment. The DLC is powered by Text Embedding Inference (TEI) a blazing fast and memory efficient solution for deploying and serving Embedding Models. TEI enables high-performance extraction for the preferred models, including FlagEmbedding, Ember, GTE and E5. TEI implements many features similar to:
- No model graph compilation step
- Small docker images and fast boot times
- Token based dynamic batching
- Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt
- Safetensors weight loading
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
TEI supports the next model architectures
Lets start!
1. Setup development environment
We’re going to use the sagemaker python SDK to deploy Snowflake Arctic to Amazon SageMaker. We want to be certain that to have an AWS account configured and the sagemaker python SDK installed.
!pip install "sagemaker>=2.221.1" --upgrade --quiet
In case you are going to make use of Sagemaker in an area environment, you wish access to an IAM Role with the required permissions for Sagemaker. You’ll find out more about it here.
import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")
2. Retrieve the brand new Hugging Face Embedding Container
In comparison with deploying regular Hugging Face models we first must retrieve the container uri and supply it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the brand new Hugging Face Embedding Container in Amazon SageMaker, we are able to use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the specified Hugging Face Embedding Container. Essential to notice is that TEI has 2 different versions for cpu and gpu, so we create a helper function to retrieve the proper image uri based on the instance type.
from sagemaker.huggingface import get_huggingface_llm_image_uri
def get_image_uri(instance_type):
key = "huggingface-tei" if instance_type.startswith("ml.g") or instance_type.startswith("ml.p") else "huggingface-tei-cpu"
return get_huggingface_llm_image_uri(key, version="1.2.3")
3. Deploy Snowflake Arctic to Amazon SageMaker
To deploy Snowflake/snowflake-arctic-embed-m-v1.5 to Amazon SageMaker we create a HuggingFaceModel model class and define our endpoint configuration including the HF_MODEL_ID, instance_type etc. We’ll use a c6i.2xlarge instance type, which has 4 Intel Ice-Lake vCPUs, 8GB of memory and costs around $0.204 per hour.
import json
from sagemaker.huggingface import HuggingFaceModel
instance_type = "ml.g5.xlarge"
config = {
'HF_MODEL_ID': "Snowflake/snowflake-arctic-embed-m-v1.5",
}
emb_model = HuggingFaceModel(
role=role,
image_uri=get_image_uri(instance_type),
env=config
)
After we’ve created the HuggingFaceModel we are able to deploy it to Amazon SageMaker using the deploy method. We’ll deploy the model with the ml.c6i.2xlarge instance type.
emb = emb_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
)
SageMaker will now create our endpoint and deploy the model to it. This may take ~5 minutes.
4. Run and evaluate Inference performance
After our endpoint is deployed we are able to run inference on it. We’ll use the predict method from the predictor to run inference on our endpoint.
data = {
"inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}
res = emb.predict(data=data)
print(f"length of embeddings: {len(res[0])}")
print(f"first 10 elements of embeddings: {res[0][:10]}")
Awesome! Now that we are able to generate embeddings, lets test the performance of our model.
We’ll send 3,900 requests to our endpoint and use threading with 10 concurrent threads. We’ll measure the common latency and throughput of our endpoint. We’re going to send an input of 256 tokens for a complete of ~1 Million tokens. We decided to make use of 256 tokens as input length to search out the fitting balance between shorter and longer inputs.
Note: When running the load test, the requests are sent from Europe, and the endpoint is deployed in us-east-1. This adds network overhead latency to the requests.
import threading
import time
number_of_threads = 10
number_of_requests = int(3900 // number_of_threads)
print(f"variety of threads: {number_of_threads}")
print(f"variety of requests per thread: {number_of_requests}")
def send_requests():
for _ in range(number_of_requests):
emb.predict(data={"inputs": "Hugging Face is an organization and a well-liked platform in the sector of natural language processing (NLP) and machine learning. They're known for his or her contributions to the event of state-of-the-art models for various NLP tasks and for providing a platform that facilitates the sharing and usage of pre-trained models. One in all the important thing offerings from Hugging Face is the Transformers library, which is an open-source library for working with a wide range of pre-trained transformer models, including those for text generation, translation, summarization, query answering, and more. The library is widely utilized in the research and development of NLP applications and is supported by a big and energetic community. Hugging Face also provides a model hub where users can discover, share, and download pre-trained models. Moreover, they provide tools and frameworks to make it easier for developers to integrate and use these models in their very own projects. The corporate has played a major role in advancing the sector of NLP and making cutting-edge models more accessible to the broader community. Hugging Face also provides a model hub where users can discover, share, and download pre-trained models. Moreover, they provide tools and frameworks to make it easier for developers and ma"})
threads = [threading.Thread(target=send_requests) for _ in range(number_of_threads) ]
start = time.time()
[t.start() for t in threads]
[t.join() for t in threads]
print(f"total time: {round(time.time() - start)} seconds")
Sending 3,900 requests or embedding 1 million tokens took around 841 seconds. This implies we are able to run around ~5 requests per second. But have in mind that features the network latency from europe to us-east-1. Once we inspect the latency of the endpoint through cloudwatch we are able to see that latency for our Embeddings model is 2s at 10 concurrent requests. This could be very impressive for a small & old CPU instance, which cost ~150$ per 30 days. You possibly can deploy the model to a GPU instance to get faster inference times.
Note: We ran the identical test on a ml.g5.xlarge with 1x NVIDIA A10G GPU. Embedding 1 million tokens took around 30 seconds. This implies we are able to run around ~130 requests per second. The latency for the endpoint is 4ms at 10 concurrent requests. The ml.g5.xlarge costs around $1.408 per hour on Amazon SageMaker.
GPU instances are much faster than CPU instances, but also they are costlier. If you must bulk process embeddings, you should use a GPU instance. If you must run a small endpoint with low costs, you should use a CPU instance. We plan to work on a dedicated benchmark for the Hugging Face Embedding Container in the longer term.
print(f"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{emb.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{emb.endpoint_name}")
5. Delete model and endpoint
To scrub up, we are able to delete the model and endpoint
emb.delete_model()
emb.delete_endpoint()
Conclusion
The brand new Hugging Face Embedding Container lets you easily deploy open Embedding Models similar to Snowflake/snowflake-arctic-embed-l to Amazon SageMaker for inference. We walked through establishing the event environment, retrieving the container, deploying the model, and evaluating its inference performance.
With this latest container, customers can now easily deploy high-performance embedding models, enabling the creation of sophisticated Generative AI applications with improved efficiency. We’re excited to see what you construct with the brand new Hugging Face Embedding Container for Amazon SageMaker. If you could have any questions or feedback, please tell us.

