An outline of inference solutions on Hugging Face

-


Julien Simon's avatar


Day by day, developers and organizations are adopting models hosted on Hugging Face to show ideas into proof-of-concept demos, and demos into production-grade applications. As an illustration, Transformer models have turn out to be a preferred architecture for a wide selection of machine learning (ML) applications, including natural language processing, computer vision, speech, and more. Recently, diffusers have turn out to be a preferred architecuture for text-to-image or image-to-image generation. Other architectures are popular for other tasks, and we host all of them on the HF Hub!

At Hugging Face, we’re obsessive about simplifying ML development and operations without compromising on state-of-the-art quality. On this respect, the power to check and deploy the newest models with minimal friction is critical, all along the lifecycle of an ML project. Optimizing the cost-performance ratio is equally necessary, and we would wish to thank our friends at Intel for sponsoring our free CPU-based inference solutions. That is one other major step in our partnership. It is also great news for our user community, who can now benefit from the speedup delivered by the Intel Xeon Ice Lake architecture at zero cost.

Now, let’s review your inference options with Hugging Face.



Free Inference Widget

Certainly one of my favorite features on the Hugging Face hub is the Inference Widget. Situated on the model page, the Inference Widget helps you to upload sample data and predict it in a single click.

Here’s a sentence similarity example with the sentence-transformers/all-MiniLM-L6-v2 model:



It’s the perfect solution to quickly get a way of what a model does, its output, and the way it performs on a couple of samples out of your dataset. The model is loaded on-demand on our servers and unloaded when it is not needed anymore. You haven’t got to jot down any code and the feature is free. What’s not to like?



Free Inference API

The Inference API is what powers the Inference widget under the hood. With a straightforward HTTP request, you may load any hub model and predict your data with it in seconds. The model URL and a legitimate hub token are all you wish.

Here’s how I can load and predict with the xlm-roberta-base model in a single line:

curl https://api-inference.huggingface.co/models/xlm-roberta-base 
    -X POST 
    -d '{"inputs": "The reply to the universe is ."}' 
    -H "Authorization: Bearer HF_TOKEN"

The Inference API is the best solution to construct a prediction service which you can immediately call out of your application during development and tests. No need for a bespoke API, or a model server. As well as, you may immediately switch from one model to the subsequent and compare their performance in your application. And guess what? The Inference API is free to make use of.

As rate limiting is enforced, we do not recommend using the Inference API for production. As an alternative, it’s best to consider Inference Endpoints.



Production with Inference Endpoints

When you’re glad with the performance of your ML model, it is time to deploy it for production. Unfortunately, when leaving the sandbox, all the things becomes a priority: security, scaling, monitoring, etc. That is where quite a lot of ML stumble and sometimes fall.
We built Inference Endpoints to unravel this problem.

In only a couple of clicks, Inference Endpoints allow you to deploy any hub model on secure and scalable infrastructure, hosted in your AWS or Azure region of selection. Additional settings include CPU and GPU hosting, built-in auto-scaling, and more. This makes finding the suitable cost/performance ratio easy, with pricing starting as little as $0.06 per hour.

Inference Endpoints support three security levels:

  • Public: the endpoint runs in a public Hugging Face subnet, and anyone on the Web can access it with none authentication.

  • Protected: the endpoint runs in a public Hugging Face subnet, and anyone on the Web with the suitable Hugging Face token can access it.

  • Private: the endpoint runs in a personal Hugging Face subnet and just isn’t accessible on the Web. It’s only available through a personal connection in your AWS or Azure account. It will satisfy the strictest compliance requirements.



To learn more about Inference Endpoints, please read this tutorial and the documentation.



Spaces

Finally, Spaces is one other production-ready choice to deploy your model for inference on top of a straightforward UI framework (Gradio as an example), and we also support hardware upgrades like advanced Intel CPUs and NVIDIA GPUs. There is not any higher solution to demo your models!



To learn more about Spaces, please take a take a look at the documentation and do not hesitate to browse posts or ask questions in our forum.



Getting began

It couldn’t be simpler. Just log in to the Hugging Face hub and browse our models. Once you’ve got found one that you simply like, you may try the Inference Widget directly on the page. Clicking on the “Deploy” button, you may get auto-generated code to deploy the model on the free Inference API for evaluation, and a direct link to deploy it to production with Inference Endpoints or Spaces.

Please give it a try to tell us what you think that. We might like to read your feedback on the Hugging Face forum.

Thanks for reading!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x