AWS Inferentia2 is the newest AWS machine learning chip available through the Amazon EC2 Inf2 instances on Amazon Web Services. Designed from the bottom up for AI workloads, Inf2 instances offer great performance and value/performance for production workloads.
Now we have been working for over a yr with the product and engineering teams at AWS to make the performance and cost-efficiency of AWS Trainium and Inferentia chips available to Hugging Face users. Our open-source library optimum-neuron makes it easy to coach and deploy Hugging Face models on these accelerators. You’ll be able to read more about our work accelerating transformers, large language models and text-generation-inference (TGI).
Today, we’re making the facility of Inferentia2 directly and widely available to Hugging Face Hub users.
Enabling over 100,000 models on AWS Inferentia2 with Amazon SageMaker
A number of months ago, we introduced a brand new solution to deploy Large Language Models (LLMs) on SageMaker, with a brand new Inferentia/Trainium option for supported models, like Meta Llama 3. You’ll be able to deploy a Llama3 model on Inferentia2 instances on SageMaker to serve inference at scale and profit from SageMaker’s complete set of fully managed features for constructing and fine-tuning models, MLOps, and governance.
Today, we’re expanding support for this deployment experience to over 100,000 public models available on Hugging Face, including 14 recent model architectures (albert,bert,camembert,convbert,deberta,deberta-v2,distilbert,electra,roberta,mobilebert,mpnet,vit,xlm,xlm-roberta), and 6 recent machine learning tasks (text-classification,text-generation,token-classification,fill-mask,question-answering,feature-extraction).
Following these easy code snippets, AWS customers will find a way to simply deploy the models on Inferentia2 instances in Amazon SageMaker.
Hugging Face Inference Endpoints introduces support for AWS Inferentia2
The best choice to deploy models from the Hub is Hugging Face Inference Endpoints. Today, we’re glad to introduce recent Inferentia 2 instances for Hugging Face Inference Endpoints. So now, whenever you discover a model in Hugging Face you’re concerned with, you may deploy it in only a couple of clicks on Inferentia2. All it’s worthwhile to do is select the model you wish to deploy, select the brand new Inf2 instance option under the Amazon Web Services instance configuration, and also you’re off to the races.
For supported models like Llama 3, you may select 2 flavors:
- Inf2-small, with 2 cores and 32 GB memory ($0.75/hour) perfect for Llama 3 8B
- Inf2-xlarge, with 24 cores and 384 GB memory ($12/hour) perfect for Llama 3 70B
Hugging Face Inference Endpoints are billed by the second of capability used, with cost scaling up with replica autoscaling, and right down to zero with scale to zero – each automated and enabled with easy to make use of settings.
Inference Endpoints uses Text Generation Inference for Neuron (TGI) to run Llama 3 on AWS Inferentia. TGI is a purpose-built solution for deploying and serving Large Language Models (LLMs) for production workloads at scale, supporting continuous batching, streaming and way more. As well as, LLMs deployed with Text Generation Inference are compatible with the OpenAI SDK Messages API, so when you have already got Gen AI applications integrated with LLMs, you don’t need to alter the code of your application, and just must point to your recent endpoint deployed with Hugging Face Inference Endpoints.
After you deploy your endpoint on Inferentia2, you may send requests using the Widget provided within the UI or the OpenAI SDK.
Whats Next
We’re working hard to expand the scope of models enabled for deployment on AWS Inferentia2 with Hugging Face Inference Endpoints. Next, we wish so as to add support for Diffusion and Embedding models, so you may generate images and construct semantic search and suggestion systems leveraging the acceleration of AWS Inferentia2 and the benefit of use of Hugging Face Inference Endpoints.
As well as, we proceed our work to enhance performance for Text Generation Inference (TGI) on Neuronx, ensuring faster and more efficient LLM deployments on AWS Inferentia 2 in our open source libraries. Stay tuned for these updates as we proceed to reinforce our capabilities and optimize your deployment experience!


