Welcome to Inference Providers on the Hub 🔥

Today, we’re launching the combination of 4 awesome serverless Inference Providers – fal, Replicate, Sambanova, Together AI – directly on the Hub’s model pages. Also they are seamlessly integrated into our client SDKs (for JS and Python), making it easier than ever to explore serverless inference of a wide selection of models that run in your favorite providers.
Inference Providers

We’ve been hosting a serverless Inference API on the Hub for a very long time (we launched the v1 in summer 2020 – wow, time flies 🤯). While this has enabled easy exploration and prototyping, we’ve refined our core value proposition towards collaboration, storage, versioning, and distribution of huge datasets and models with the community. At the identical time, serverless providers have flourished, and the time was right for Hugging Face to supply easy and unified access to serverless inference through a set of great providers.

Just as we work with great partners like AWS, Nvidia and others for dedicated deployment options via the model pages’ Deploy button, it was natural to partner with the following generation of serverless inference providers for model-centric, serverless inference.

Here’s what this allows, taking the timely example of DeepSeek-ai/DeepSeek-R1, a model which has achieved mainstream fame over the past few days 🔥:

Rodrigo Liang, Co-Founder & CEO at SambaNova: “We’re excited to be partnering with Hugging Face to speed up its Inference API. Hugging Face developers now have access to much faster inference speeds on a big selection of the perfect open source models.”

Zeke Sikelianos, Founding Designer at Replicate: “Hugging Face is the de facto home of open-source model weights, and has been a key player in making AI more accessible to the world. We use Hugging Face internally at Replicate as our weights registry of alternative, and we’re honored to be among the many first inference providers to be featured on this launch.”

That is just the beginning, and we’ll construct on top of this with the community within the coming weeks!

How it really works

In the web site UI

In your user account settings, you’re in a position to:

set your individual API keys for the providers you’ve signed up with. Otherwise, you’ll be able to still use them – your requests can be routed through HF.
order providers by preference. This is applicable to the widget and code snippets within the model pages.

Inference Providers

As we mentioned, there are two modes when calling Inference APIs:

custom key (calls go on to the inference provider, using your individual API key of the corresponding inference provider); or
Routed by HF (in that case, you do not need a token from the provider, and the fees are applied on to your HF account moderately than the provider’s account)

Inference Providers

Model pages showcase third-party inference providers (those which can be compatible with the present model, sorted by user preference)

From the client SDKs

from Python, using huggingface_hub

The next example shows methods to use DeepSeek-R1 using Together AI because the inference provider. You need to use a Hugging Face token for automatic routing through Hugging Face, or your individual Together AI API key if you could have one.

Install huggingface_hub version v0.28.0 or later (release notes).

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="together",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx"
)

messages = [
    {
        "role": "user",
        "content": "What is the capital of France?"
    }
]

completion = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1", 
    messages=messages, 
    max_tokens=500
)

print(completion.selections[0].message)

Note: You may as well use the OpenAI client library to call the Inference Providers too; see here an example for DeepSeek model.

And here’s methods to generate a picture from a text prompt using FLUX.1-dev running on fal.ai:

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="fal-ai",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx"
)


image = client.text_to_image(
    "Labrador within the sort of Vermeer",
    model="black-forest-labs/FLUX.1-dev"
)

To maneuver to a unique provider, you’ll be able to simply change the provider name, the whole lot else stays the identical:

from huggingface_hub import InferenceClient

client = InferenceClient(
-	provider="fal-ai",
+	provider="replicate",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx"
)

from JS using @huggingface/inference

import { HfInference } from "@huggingface/inference";

const client = recent HfInference("xxxxxxxxxxxxxxxxxxxxxxxx");

const chatCompletion = await client.chatCompletion({
    model: "deepseek-ai/DeepSeek-R1",
    messages: [
        {
            role: "user",
            content: "What is the capital of France?"
        }
    ],
    provider: "together",
    max_tokens: 500
});

console.log(chatCompletion.selections[0].message);

From HTTP calls

We expose the Routing proxy directly under the huggingface.co domain so you’ll be able to call it directly, it is very useful for OpenAI-compatible APIs as an illustration. You’ll be able to just swap the URL as a base URL: https://router.huggingface.co/{:provider}.

Here’s how you’ll be able to call Llama-3.3-70B-Instruct using Sambanova because the inference provider via cURL.

curl 'https://router.huggingface.co/sambanova/v1/chat/completions' 
-H 'Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxx' 
-H 'Content-Type: application/json' 
--data '{
    "model": "Llama-3.3-70B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
    "max_tokens": 500,
    "stream": false
}'

Billing

For direct requests, i.e. whenever you use the important thing from an inference provider, you’re billed by the corresponding provider. As an example, for those who use a Together AI key you are billed in your Together AI account.

For routed requests, i.e. whenever you authenticate via the hub, you will only pay the usual provider API rates. There is no additional markup from us, we just go through the provider costs directly. (In the long run, we may establish revenue-sharing agreements with our provider partners.)

Vital Note ‼️ PRO users get $2 price of Inference credits every month. You need to use them across providers. 🔥

Subscribe to the Hugging Face PRO plan to get access to Inference credits, ZeroGPU, Spaces Dev Mode, 20x higher limits, and more.

We also provide free inference with a small quota for our signed-in free users, but please upgrade to PRO for those who can!

Feedback and next steps

We might like to get your feedback! Here’s a Hub discussion you should utilize: https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/49

Source link

Welcome to Inference Providers on the Hub 🔥

How it really works

In the web site UI

From the client SDKs

from Python, using huggingface_hub

from JS using @huggingface/inference

From HTTP calls

Billing

Feedback and next steps

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Anthropic launches enterprise ‘Agent Skills’ and opens the usual, difficult OpenAI in workplace AI

4 Ways to Supercharge Your Data Science Workflow with Google AI Studio

We now support VLMs in smolagents!

Palona goes vertical, launching Vision, Workflow features: 4 key lessons for AI builders

The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel

Welcome to Inference Providers on the Hub 🔥

How it really works

In the web site UI

From the client SDKs

from Python, using huggingface_hub

from JS using @huggingface/inference

From HTTP calls

Billing

Feedback and next steps

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.