Why we’re switching to Hugging Face Inference Endpoints, and possibly it is best to too

-


Matthew Upson's avatar


Hugging Face recently launched Inference Endpoints; which as they put it: solves transformers in production. Inference Endpoints is a managed service that permits you to:

  • Deploy (almost) any model on Hugging Face Hub
  • To any cloud (AWS, and Azure, GCP on the best way)
  • On a variety of instance types (including GPU)
  • We’re switching a few of our Machine Learning (ML) models that do inference on a CPU to this recent service. This blog is about why, and why you may also want to contemplate it.



What were we doing?

The models that we now have converted to Inference Endpoints were previously managed internally and were running on AWS Elastic Container Service (ECS) backed by AWS Fargate. This offers you a serverless cluster which might run container based tasks. Our process was as follows:

Now, you’ll be able to reasonably argue that ECS was not the most effective approach to serving ML models, however it served us up until now, and in addition allowed ML models to take a seat alongside other container based services, so it reduced cognitive load.



What will we do now?

With Inference Endpoints, our flow looks like this:

So that is significantly easier. We could also use one other managed service similar to SageMaker, Seldon, or Bento ML, etc., but since we’re already uploading our model to Hugging Face hub to act as a model registry, and we’re pretty invested in Hugging Face’s other tools (like transformers, and AutoTrain) using Inference Endpoints makes a whole lot of sense for us.



What about Latency and Stability?

Before switching to Inference Endpoints we tested different CPU endpoints types using ab.

For ECS we didn’t test so extensively, but we all know that a big container had a latency of about ~200ms from an instance in the identical region. The tests we did for Inference Endpoints we based on text classification model superb tuned on RoBERTa with the next test parameters:

  • Requester region: eu-east-1
  • Requester instance size: t3-medium
  • Inference endpoint region: eu-east-1
  • Endpoint Replicas: 1
  • Concurrent connections: 1
  • Requests: 1000 (1000 requests in 1–2 minutes even from a single connection would represent very heavy use for this particular application)

The next table shows latency (ms ± standard deviation and time to finish test in seconds) for 4 Intel Ice Lake equipped CPU endpoints.

size   |  vCPU (cores) |   Memory (GB)  |  ECS (ms) |  🤗 (ms)
----------------------------------------------------------------------
small  |  1            |  2             |   _       | ~ 296   
medium |  2            |  4             |   _       | 156 ± 51 (158s)  
large  |  4            |   8            |   ~200    | 80 ± 30 (80s)   
xlarge |  8            | 16             |  _        | 43 ± 31 (43s)    

What we see from these results is pretty encouraging. The appliance that may devour these endpoints serves requests in real time, so we’d like as low latency as possible. We will see that the vanilla Hugging Face container was greater than twice as fast as our bespoke container run on ECS — the slowest response we received from the big Inference Endpoint was just 108ms.



What about the price?

So how much does this all cost? The table below shows a price comparison for what we were doing previously (ECS + Fargate) and using Inference Endpoints.

size   |  vCPU         |   Memory (GB)  |  ECS      |  🤗       |  % diff
----------------------------------------------------------------------
small  |  1            |  2             |  $ 33.18  | $ 43.80   |  0.24
medium |  2            |  4             |  $ 60.38  | $ 87.61   |  0.31 
large  |  4            |  8             |  $ 114.78 | $ 175.22  |  0.34
xlarge |  8            | 16             |  $ 223.59 | $ 350.44  | 0.5 

We will say a few things about this. Firstly, we wish a managed solution to deployment, we don’t have a dedicated MLOPs team (yet), so we’re in search of an answer that helps us minimize the time we spend on deploying models, even when it costs slightly greater than handling the deployments ourselves.

Inference Endpoints are dearer that what we were doing before, there’s an increased cost of between 24% and 50%. At the size we’re currently operating, this extra cost, a difference of ~$60 a month for a big CPU instance is nothing in comparison with the time and cognitive load we’re saving by not having to fret about APIs, and containers. If we were deploying 100s of ML microservices we might probably wish to re-evaluate, but that might be true of many approaches to hosting.



Some notes and caveats:

  • You will discover pricing for Inference Endpoints here, but a unique number is displayed whenever you deploy a brand new endpoint from the GUI. I’ve used the latter, which is higher.
  • The values that I present within the table for ECS + Fargate are an underestimate, but probably not by much. I extracted them from the fargate pricing page and it includes just the price of hosting the instance. I’m not including the information ingress/egress (probably the largest thing is downloading the model from Hugging Face hub), nor have I included the prices related to ECR.



Other considerations



Deployment Options

Currently you’ll be able to deploy an Inference Endpoint from the GUI or using a RESTful API. You can too make use of our command line tool hugie (which can be the topic of a future blog) to launch Inference Endpoints in a single line of code by passing a configuration, it’s really this easy:

hugie endpoint create example/development.json

For me, what’s lacking is a custom terraform provider. It’s all well and good deploying an inference endpoint from a GitHub motion using hugie, as we do, however it could be higher if we could use the awesome state machine that’s terraform to maintain track of those. I’m pretty sure that somebody (if not Hugging Face) will write one soon enough — if not, we are going to.



Hosting multiple models on a single endpoint

Philipp Schmid posted a very nice blog about how you can write a custom Endpoint Handler class to help you host multiple models on a single endpoint, potentially saving you quite a little bit of money. His blog was about GPU inference, and the one real limitation is what number of models you’ll be able to fit into the GPU memory. I assume this may also work for CPU instances, though I’ve not tried yet.



To conclude…

We discover Hugging Face Inference Endpoints to be a quite simple and convenient strategy to deploy transformer (and sklearn) models into an endpoint so that they will be consumed by an application. Whilst they cost slightly greater than the ECS approach we were using before, it’s well value it since it saves us time on excited about deployment, we are able to think about the thing we wish to: constructing NLP solutions for our clients to assist solve their problems.

Should you’re excited about Hugging Face Inference Endpoints in your company, please contact us here – our team will contact you to debate your requirements!

This text was originally published on February 15, 2023 in Medium.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x