Bringing serverless GPU inference to Hugging Face users

Update (November 2024): The mixing is not any longer available. Please switch to the Hugging Face Inference API, Inference Endpoints, or other deployment options to your AI model needs.

Today, we’re thrilled to announce the launch of Deploy on Cloudflare Employees AI, a brand new integration on the Hugging Face Hub. Deploy on Cloudflare Employees AI makes using open models as a serverless API easy, powered by state-of-the-art GPUs deployed in Cloudflare edge data centers. Starting today, we’re integrating a few of the hottest open models on Hugging Face into Cloudflare Employees AI, powered by our production solutions, like Text Generation Inference.

With Deploy on Cloudflare Employees AI, developers can construct robust Generative AI applications without managing GPU infrastructure and servers and at a really low operating cost: only pay for the compute you utilize, not for idle capability.

Generative AI for Developers

This latest experience expands upon the strategic partnership we announced last 12 months to simplify the access and deployment of open Generative AI models. Considered one of the primary problems developers and organizations face is the scarcity of GPU availability and the fixed costs of deploying servers to begin constructing. Deploy on Cloudflare Employees AI offers a straightforward, low-cost solution to those challenges, providing serverless access to popular Hugging Face Models with a pay-per-request pricing model.

Let’s take a have a look at a concrete example. Imagine you develop an RAG Application that gets ~1000 requests per day, with an input of 1k tokens and an output of 100 tokens using Meta Llama 2 7B. The LLM inference production costs would amount to about $1 a day.

“We’re excited to bring this integration to life so quickly. Putting the facility of Cloudflare’s global network of serverless GPUs into the hands of developers, paired with the most well-liked open source models on Hugging Face, will open the doors to numerous exciting innovation by our community all over the world,” said John Graham-Cumming, CTO, Cloudflare

How it really works

Using Hugging Face Models on Cloudflare Employees AI is super easy. Below, you can find step-by-step instructions on tips on how to use Hermes 2 Pro on Mistral 7B, the most recent model from Nous Research.

Yow will discover all available models on this Cloudflare Collection.

Note: You wish access to a Cloudflare Account and API Token.

Yow will discover the Deploy on Cloudflare option on all available model pages, including models like Llama, Gemma or Mistral.

Open the “Deploy” menu, and choose “Cloudflare Employees AI” – it will open an interface that features instructions on tips on how to use this model and send requests.

Note: If the model you ought to use doesn’t have a “Cloudflare Employees AI” option, it’s currently not supported. We’re working on extending the supply of models along with Cloudflare. You possibly can reach out to us at api-enterprise@huggingface.co together with your request.

The mixing can currently be used via two options: using the Employees AI REST API or directly in Employees with the Cloudflare AI SDK. Select your chosen option and duplicate the code into your environment. When using the REST API, it’s worthwhile to make certain the ACCOUNT_ID and API_TOKEN variables are defined.

That’s it! Now you may start sending requests to Hugging Face Models hosted on Cloudflare Employees AI. Make sure that to make use of the right prompt & template expected by the model.

We’re just getting began

We’re excited to collaborate with Cloudflare to make AI more accessible to developers. We are going to work with the Cloudflare team to make more models and experiences available to you!

Source link

Bringing serverless GPU inference to Hugging Face users

Generative AI for Developers

How it really works

We’re just getting began

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Why physical AI is becoming manufacturing’s next advantage

Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

Bringing serverless GPU inference to Hugging Face users

Generative AI for Developers

How it really works

We’re just getting began

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.