Inference for PROs

Today, we’re introducing Inference for PRO users – a community offering that provides you access to APIs of curated endpoints for a few of the most fun models available, in addition to improved rate limits for the usage of free Inference API. Use the next page to subscribe to PRO.

Hugging Face PRO users now have access to exclusive API endpoints for a curated list of powerful models that profit from ultra-fast inference powered by text-generation-inference. This can be a profit on top of the free inference API, which is obtainable to all Hugging Face users to facilitate testing and prototyping on 200,000+ models. PRO users enjoy higher rate limits on these models, in addition to exclusive access to a few of the most effective models available today.

Supported Models

Along with hundreds of public models available within the Hub, PRO users get free access and better rate limits to the next state-of-the-art models:

Model	Size	Context Length	Use
Meta Llama 3 Instruct	8B, 70B	8k tokens	Among the finest chat models
Mixtral 8x7B Instruct	45B MOE	32k tokens	Performance comparable to top proprietary models
Nous Hermes 2 Mixtral 8x7B DPO	45B MOE	32k tokens	Further trained over Mixtral 8x7B MoE
Zephyr 7B β	7B	4k tokens	Among the finest chat models on the 7B weight
Llama 2 Chat	7B, 13B	4k tokens	Among the finest conversational models
Mistral 7B Instruct v0.2	7B	4k tokens	Among the finest chat models on the 7B weight
Code Llama Base	7B and 13B	4k tokens	Autocomplete and infill code
Code Llama Instruct	34B	16k tokens	Conversational code assistant
Stable Diffusion XL	3B UNet	–	Generate images
Bark	0.9B	–	Text to audio generation

Inference for PROs makes it easy to experiment and prototype with latest models without having to deploy them on your personal infrastructure. It gives PRO users access to ready-to-use HTTP endpoints for all of the models listed above. It’s not meant for use for heavy production applications – for that, we recommend using Inference Endpoints. Inference for PROs also allows using applications that rely upon an LLM endpoint, corresponding to using a VS Code extension for code completion, or have your personal version of Hugging Chat.

Getting began with Inference For PROs

Using Inference for PROs is so simple as sending a POST request to the API endpoint for the model you ought to run. You will also must get a PRO account authentication token from your token settings page and use it within the request. For instance, to generate text using Meta Llama 3 8B Instruct in a terminal session, you’d do something like:

curl https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8b-Instruct 
    -X POST 
    -d '{"inputs": "In a surprising turn of events, "}' 
    -H "Content-Type: application/json" 
    -H "Authorization: Bearer "

Which might print something like this:

[
  {
    "generated_text": "In a surprising turn of events, 2021 has brought us not one, but TWO seasons of our beloved TV show, "Stranger Things.""
  }
]

It’s also possible to use lots of the familiar transformers generation parameters, like temperature or max_new_tokens:

curl https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8b-Instruct 
    -X POST 
    -d '{"inputs": "In a surprising turn of events, ", "parameters": {"temperature": 0.7, "max_new_tokens": 100}}' 
    -H "Content-Type: application/json" 
    -H "Authorization: Bearer "

For more details on the generation parameters, please take a have a look at Controlling Text Generation below.

To send your requests in Python, you’ll be able to make the most of InferenceClient, a convenient utility available within the huggingface_hub Python library:

pip install huggingface_hub

InferenceClient is a helpful wrapper that means that you can make calls to the Inference API and Inference Endpoints easily:

from huggingface_hub import InferenceClient

client = InferenceClient(model="meta-llama/Meta-Llama-3-8b-Instruct", token=YOUR_TOKEN)

output = client.text_generation("Are you able to please tell us more details about your ")
print(output)

If you happen to don’t need to pass the token explicitly each time you instantiate the client, you need to use notebook_login() (in Jupyter notebooks), huggingface-cli login (within the terminal), or login(token=YOUR_TOKEN) (in all places else) to log in a single time. The token will then be mechanically used from here.

Along with Python, it’s also possible to use JavaScript to integrate inference calls inside your JS or node apps. Take a have a look at huggingface.js to start!

Applications

Chat with Llama 2 and Code Llama 34B

Models prepared to follow chat conversations are trained with very particular and specific chat templates that depend upon the model used. It’s essential to watch out in regards to the format the model expects and replicate it in your queries.

The next example was taken from our Llama 2 blog post, that describes in full detail how one can query the model for conversation:

prompt = """[INST] <>
You're a helpful, respectful and honest assistant. At all times answer as helpfully as possible, while being secure.  Your answers mustn't include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please be certain that your responses are socially unbiased and positive in nature.

If an issue doesn't make any sense, or shouldn't be factually coherent, explain why as a substitute of answering something not correct. If you happen to do not know the reply to an issue, please don't share false information.
<>

There is a llama in my garden 😱 What should I do? [/INST]
"""

client = InferenceClient(model="codellama/CodeLlama-13b-hf", token=YOUR_TOKEN)
response = client.text_generation(prompt, max_new_tokens=200)
print(response)

This instance shows the structure of the primary message in a multi-turn conversation. Note how the <> delimiter is used to offer the system prompt, which tells the model how we expect it to behave. Then our query is inserted between [INST] delimiters.

If we want to proceed the conversation, we’ve got to append the model response to the sequence, and issue a brand new followup instruction afterwards. That is the overall structure of the prompt template we want to make use of for Llama 2:

[INST] <>
{{ system_prompt }}
<>

{{ user_msg_1 }} [/INST] {{ model_answer_1 }} [INST] {{ user_msg_2 }} [/INST]

This same format could be used with Code Llama Instruct to have interaction in technical conversations with a code-savvy assistant!

Please, confer with our Llama 2 blog post for more details.

Code infilling with Code Llama

Code models like Code Llama could be used for code completion using the identical generation strategy we utilized in the previous examples: you provide a starting string that will contain code or comments, and the model will attempt to proceed the sequence with plausible content. Code models can be used for infilling, a more specialized task where you provide prefix and suffix sequences, and the model will predict what should go in between. That is great for applications corresponding to IDE extensions. Let’s have a look at an example using Code Llama:

client = InferenceClient(model="codellama/CodeLlama-13b-hf", token=YOUR_TOKEN) prompt_prefix = 'def remove_non_ascii(s: str) -> str:n """ ' prompt_suffix = "n return result" prompt = f" {prompt_prefix} {prompt_suffix} "

infilled = client.text_generation(prompt, max_new_tokens=150) infilled = infilled.rstrip(" ") print(f"{prompt_prefix}{infilled}{prompt_suffix}")

def remove_non_ascii(s: str) -> str:
    """ Remove non-ASCII characters from a string.

    Args:
        s (str): The string to remove non-ASCII characters from.

    Returns:
        str: The string with non-ASCII characters removed.
    """
    result = ""
    for c in s:
        if ord(c) < 128:
            result += c
    return result

As you’ll be able to see, the format used for infilling follows this pattern:

prompt = f" {prompt_prefix} {prompt_suffix} "

For more details on how this task works, please take a have a look at https://huggingface.co/blog/codellama#code-completion.

Stable Diffusion XL

SDXL can be available for PRO users. The response returned by the endpoint consists of a byte stream representing the generated image. If you happen to use InferenceClient, it is going to mechanically decode to a PIL image for you:

sdxl = InferenceClient(model="stabilityai/stable-diffusion-xl-base-1.0", token=YOUR_TOKEN)
image = sdxl.text_to_image(
    "Dark gothic city in a misty night, lit by street lamps. A person in a cape is walking away from us",
    guidance_scale=9,
)

For more details on how one can control generation, please take a have a look at this section.

Messages API

All text generation models now support the Messages API, in order that they are compatible with OpenAI client libraries, including LangChain and LlamaIndex. The next snippet shows how one can use the official openai client library with Llama 3.1 70B:

from openai import OpenAI
import huggingface_hub


client = OpenAI(
    base_url="https://api-inference.huggingface.co/v1/",
    api_key=huggingface_hub.get_token(),
)
chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful an honest programming assistant."},
        {"role": "user", "content": "Is Rust better than Python?"},
    ],
    stream=True,
    max_tokens=500
)


for message in chat_completion:
    print(message.decisions[0].delta.content, end="")

For more details in regards to the use of the Messages API, please check this post.

Generation Parameters

Controlling Text Generation

Text generation is a wealthy topic, and there exist several generation strategies for various purposes. We recommend this excellent overview on the topic. Many generation algorithms are supported by the text generation endpoints, and so they could be configured using the next parameters:

do_sample: If set to False (the default), the generation method will likely be greedy search, which selects essentially the most probable continuation sequence after the prompt you provide. Greedy search is deterministic, so the identical results will all the time be returned from the identical input. When do_sample is True, tokens will likely be sampled from a probability distribution and can subsequently vary across invocations.
temperature: Controls the quantity of variation we desire from the generation. A temperature of 0 is comparable to greedy search. If we set a price for temperature, then do_sample will mechanically be enabled. The identical thing happens for top_k and top_p. When doing code-related tasks, we would like less variability and hence recommend a low temperature. For other tasks, corresponding to open-ended text generation, we recommend a better one.
top_k. Enables “Top-K” sampling: the model will select from the K most probable tokens that will occur after the input sequence. Typical values are between 10 to 50.
top_p. Enables “nucleus sampling”: the model will select from as many tokens as vital to cover a specific probability mass. If top_p is 0.9, the 90% most probable tokens will likely be considered for sampling, and the trailing 10% will likely be ignored.
repetition_penalty: Tries to avoid repeated words within the generated sequence.
seed: Random seed that you would be able to use together with sampling, for reproducibility purposes.

Along with the sampling parameters above, it’s also possible to control general points of the generation with the next:

max_new_tokens: maximum number of latest tokens to generate. The default is 20, be happy to extend in the event you want longer sequences.
return_full_text: whether to incorporate the input sequence within the output returned by the endpoint. The default utilized by InferenceClient is False, however the endpoint itself uses True by default.
stop_sequences: an inventory of sequences that may cause generation to stop when encountered within the output.

Controlling Image Generation

If you happen to want finer-grained control over images generated with the SDXL endpoint, you need to use the next parameters:

negative_prompt: A text describing content that you simply want the model to steer away from.
guidance_scale: How closely you wish the model to match the prompt. Lower numbers are less accurate, very high numbers might decrease image quality or generate artifacts.
width and height: The specified image dimensions. SDXL works best for sizes between 768 and 1024.
num_inference_steps: The variety of denoising steps to run. Larger numbers may produce higher quality but will likely be slower. Typical values are between 20 and 50 steps.

For added details on text-to-image generation, we recommend you check the diffusers library documentation.

Caching

If you happen to run the identical generation multiple times, you’ll see that the result returned by the API is identical (even in the event you are using sampling as a substitute of greedy decoding). It’s because recent results are cached. To force a distinct response every time, we are able to use an HTTP header to inform the server to run a brand new generation every time: x-use-cache: 0.

If you happen to are using InferenceClient, you’ll be able to simply append it to the headers client property:

client = InferenceClient(model="meta-llama/Meta-Llama-3-8b-Instruct", token=YOUR_TOKEN)
client.headers["x-use-cache"] = "0"

output = client.text_generation("In a surprising turn of events, ", do_sample=True)
print(output)

Streaming

Token streaming is the mode by which the server returns the tokens one after the other because the model generates them. This allows showing progressive generations to the user relatively than waiting for the entire generation. Streaming is a vital aspect of the end-user experience because it reduces latency, one of the critical points of a smooth experience.

To stream tokens with InferenceClient, simply pass stream=True and iterate over the response.

for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True):
    print(token)

To make use of the generate_stream endpoint with curl, you’ll be able to add the -N/--no-buffer flag, which disables curl default buffering and shows data because it arrives from the server.

curl -N https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8b-Instruct 
    -X POST 
    -d '{"inputs": "In a surprising turn of events, ", "parameters": {"temperature": 0.7, "max_new_tokens": 100}}' 
    -H "Content-Type: application/json" 
    -H "Authorization: Bearer "

Subscribe to PRO

You may enroll today for a PRO subscription here. Profit from higher rate limits, custom accelerated endpoints for the most recent models, and early access to features. If you happen to’ve built some exciting projects with the Inference API or are on the lookout for a model not available in Inference for PROs, please use this discussion. Enterprise users also profit from PRO Inference API on top of other features, corresponding to SSO.

FAQ

Does this affect the free Inference API?

No. We still expose hundreds of models through free APIs that allow people to prototype and explore model capabilities quickly.

Does this affect Enterprise users?

Users with an Enterprise subscription also profit from accelerated inference API for curated models.

Can I take advantage of my very own models with PRO Inference API?

The free Inference API already supports a wide selection of small and medium models from a wide range of libraries (corresponding to diffusers, transformers, and sentence transformers). If you’ve a custom model or custom inference logic, we recommend using Inference Endpoints.

Source link

Inference for PROs

Contents

Supported Models

Getting began with Inference For PROs

Applications

Chat with Llama 2 and Code Llama 34B

Code infilling with Code Llama

Stable Diffusion XL

Messages API

Generation Parameters

Controlling Text Generation

Controlling Image Generation

Caching

Streaming

Subscribe to PRO

FAQ

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Inference for PROs

Contents

Supported Models

Getting began with Inference For PROs

Applications

Chat with Llama 2 and Code Llama 34B

Code infilling with Code Llama

Stable Diffusion XL

Messages API

Generation Parameters

Controlling Text Generation

Controlling Image Generation

Caching

Streaming

Subscribe to PRO

FAQ

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.