Serverless Inference with Hugging Face and NVIDIA NIM

-


Philipp Schmid's avatar

Jeff Boudier's avatar

Update: This service is deprecated and now not available as of April tenth, 2025. For another, you must consider Inference Providers

Today, we’re thrilled to announce the launch of Hugging Face NVIDIA NIM API (serverless), a brand new service on the Hugging Face Hub, available to Enterprise Hub organizations. This latest service makes it easy to make use of open models with the accelerated compute platform, of NVIDIA DGX Cloud accelerated compute platform for inference serving. We built this solution in order that Enterprise Hub users can easily access the most recent NVIDIA AI technology in a serverless approach to run inference on popular Generative AI models including Llama and Mistral, using standardized APIs and a couple of lines of code throughout the Hugging Face Hub.

Thumbnail



Serverless Inference powered by NVIDIA NIM

This latest experience builds on our collaboration with NVIDIA to simplify the access and use of open Generative AI models on NVIDIA accelerated computing. One in all the predominant challenges developers and organizations face is the upfront cost of infrastructure and the complexity of optimizing inference workloads for LLM. With Hugging Face NVIDIA NIM API (serverless), we provide a straightforward solution to those challenges, providing fast access to state-of-the-art open Generative AI models optimized for NVIDIA infrastructure with an easy API for running inference. The pay-as-you-go pricing model ensures that you just only pay for the request time you employ, making it a cheap alternative for businesses of all sizes.

NVIDIA NIM API (serverless) complements Train on DGX Cloud, an AI training service already available on Hugging Face.



How it really works

Running serverless inference with Hugging Face models has never been easier. Here’s a step-by-step guide to get you began:

Note: You would like access to an Organization with a Hugging Face Enterprise Hub subscription to run Inference.

Before you start, make sure you meet the next requirements:

  1. You’re member of an Enterprise Hub organization.
  2. You could have created a fine-grained token on your organization. Follow the steps below to create your token.



Create a Nice-Grained Token

Nice-grained tokens allow users to create tokens with specific permissions for precise access control to resources and namespaces. First, go to Hugging Face Access Tokens and click on on “Create latest Token” and choose “fine-grained”.

Create Token

Enter a “Token name” and choose your Enterprise organization in “org permissions” as scope after which click “Create token”. You needn’t select any additional scopes.

Scope Token

Now, ensure that to avoid wasting this token value to authenticate your requests later.



Find your NIM

You could find “NVIDIA NIM API (serverless)” on the model page of supported Generative AI models. You could find all supported models on this NVIDIA NIM Collection, and within the Pricing section.

We are going to use the meta-llama/Meta-Llama-3-8B-Instruct. Go the meta-llama/Meta-Llama-3-8B-Instruct model card open “Deploy” menu, and choose “NVIDIA NIM API (serverless)” – it will open an interface with pre-generated code snippets for Python, Javascript or Curl.

inference-modal



Send your requests

NVIDIA NIM API (serverless) is standardized on the OpenAI API. This means that you can use the openai' sdk for inference. Replace the YOUR_FINE_GRAINED_TOKEN_HERE along with your fine-grained token and you might be able to run inference.

from openai import OpenAI

client = OpenAI(
    base_url="https://huggingface.co/api/integrations/dgx/v1",
    api_key="YOUR_FINE_GRAINED_TOKEN_HERE"
)

chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 500"}
    ],
    stream=True,
    max_tokens=1024
)


for message in chat_completion:
    print(message.decisions[0].delta.content, end='')

Congrats! 🎉 You possibly can now start constructing your Generative AI applications using open models. 🔥

NVIDIA NIM API (serverless) currently only supports the chat.completions.create and models.list API. We’re working on extending this while adding more models. The models.list will be used to examine which models are currently available for Inference.

models = client.models.list()
for m in models.data:
    print(m.id)



Supported Models and Pricing

Usage of Hugging Face NVIDIA NIM API (serverless) is billed based on the compute time spent per request. We exclusively use NVIDIA H100 Tensor Core GPUs, that are priced at $8.25 per hour. To make this easier to grasp for per-request pricing, we are able to convert this to a per-second.

$8.25 per hour = $0.0023 per second (rounded to 4 decimal places)

The overall cost for a request will depend upon the model size, the variety of GPUs required, and the time taken to process the request. Here’s a breakdown of our current model offerings, their GPU requirements, typical response times, and estimated cost per request:

Model ID Variety of NVIDIA H100 GPUs Typical Response Time (500 input tokens, 100 output tokens) Estimated Cost per Request
meta-llama/Meta-Llama-3-8B-Instruct 1 1 seconds $0.0023
meta-llama/Meta-Llama-3-70B-Instruct 4 2 seconds $0.0184
meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 8 5 seconds $0.0917

Usage fees accrue to your Enterprise Hub Organizations’ current monthly billing cycle. You possibly can check your current and past usage at any time throughout the billing settings of your Enterprise Hub Organization.

Supported Models



Accelerating AI Inference with NVIDIA TensorRT-LLM

We’re excited to proceed our collaboration with NVIDIA to push the boundaries of AI inference performance and accessibility. A key focus of our ongoing efforts is the combination of the NVIDIA TensorRT-LLM library into Hugging Face’s Text Generation Inference (TGI) framework.

We’ll be sharing more details, benchmarks, and best practices for using TGI with NVIDIA TensorRT-LLM within the near future. Stay tuned for more exciting developments as we proceed to expand our collaboration with NVIDIA and produce more powerful AI capabilities to developers and organizations worldwide!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x