Today, we’re introducing Inference for PRO users – a community offering that provides you access to APIs of curated endpoints for a few of the most fun models available, in addition to improved rate limits for the usage of free Inference API. Use the next page to subscribe to PRO.
Hugging Face PRO users now have access to exclusive API endpoints for a curated list of powerful models that profit from ultra-fast inference powered by text-generation-inference. This can be a profit on top of the free inference API, which is obtainable to all Hugging Face users to facilitate testing and prototyping on 200,000+ models. PRO users enjoy higher rate limits on these models, in addition to exclusive access to a few of the most effective models available today.
Contents
Supported Models
Along with hundreds of public models available within the Hub, PRO users get free access and better rate limits to the next state-of-the-art models:
| Model | Size | Context Length | Use |
|---|---|---|---|
| Meta Llama 3 Instruct | 8B, 70B | 8k tokens | Among the finest chat models |
| Mixtral 8x7B Instruct | 45B MOE | 32k tokens | Performance comparable to top proprietary models |
| Nous Hermes 2 Mixtral 8x7B DPO | 45B MOE | 32k tokens | Further trained over Mixtral 8x7B MoE |
| Zephyr 7B β | 7B | 4k tokens | Among the finest chat models on the 7B weight |
| Llama 2 Chat | 7B, 13B | 4k tokens | Among the finest conversational models |
| Mistral 7B Instruct v0.2 | 7B | 4k tokens | Among the finest chat models on the 7B weight |
| Code Llama Base | 7B and 13B | 4k tokens | Autocomplete and infill code |
| Code Llama Instruct | 34B | 16k tokens | Conversational code assistant |
| Stable Diffusion XL | 3B UNet | – | Generate images |
| Bark | 0.9B | – | Text to audio generation |
Inference for PROs makes it easy to experiment and prototype with latest models without having to deploy them on your personal infrastructure. It gives PRO users access to ready-to-use HTTP endpoints for all of the models listed above. It’s not meant for use for heavy production applications – for that, we recommend using Inference Endpoints. Inference for PROs also allows using applications that rely upon an LLM endpoint, corresponding to using a VS Code extension for code completion, or have your personal version of Hugging Chat.
Getting began with Inference For PROs
Using Inference for PROs is so simple as sending a POST request to the API endpoint for the model you ought to run. You will also must get a PRO account authentication token from your token settings page and use it within the request. For instance, to generate text using Meta Llama 3 8B Instruct in a terminal session, you’d do something like:
curl https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8b-Instruct
-X POST
-d '{"inputs": "In a surprising turn of events, "}'
-H "Content-Type: application/json"
-H "Authorization: Bearer "
Which might print something like this:
[
{
"generated_text": "In a surprising turn of events, 2021 has brought us not one, but TWO seasons of our beloved TV show, "Stranger Things.""
}
]
It’s also possible to use lots of the familiar transformers generation parameters, like temperature or max_new_tokens:
curl https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8b-Instruct
-X POST
-d '{"inputs": "In a surprising turn of events, ", "parameters": {"temperature": 0.7, "max_new_tokens": 100}}'
-H "Content-Type: application/json"
-H "Authorization: Bearer "
For more details on the generation parameters, please take a have a look at Controlling Text Generation below.
To send your requests in Python, you’ll be able to make the most of InferenceClient, a convenient utility available within the huggingface_hub Python library:
pip install huggingface_hub
InferenceClient is a helpful wrapper that means that you can make calls to the Inference API and Inference Endpoints easily:
from huggingface_hub import InferenceClient
client = InferenceClient(model="meta-llama/Meta-Llama-3-8b-Instruct", token=YOUR_TOKEN)
output = client.text_generation("Are you able to please tell us more details about your ")
print(output)
If you happen to don’t need to pass the token explicitly each time you instantiate the client, you need to use notebook_login() (in Jupyter notebooks), huggingface-cli login (within the terminal), or login(token=YOUR_TOKEN) (in all places else) to log in a single time. The token will then be mechanically used from here.
Along with Python, it’s also possible to use JavaScript to integrate inference calls inside your JS or node apps. Take a have a look at huggingface.js to start!
Applications
Chat with Llama 2 and Code Llama 34B
Models prepared to follow chat conversations are trained with very particular and specific chat templates that depend upon the model used. It’s essential to watch out in regards to the format the model expects and replicate it in your queries.
The next example was taken from our Llama 2 blog post, that describes in full detail how one can query the model for conversation:
prompt = """[INST] <>
You're a helpful, respectful and honest assistant. At all times answer as helpfully as possible, while being secure. Your answers mustn't include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please be certain that your responses are socially unbiased and positive in nature.
If an issue doesn't make any sense, or shouldn't be factually coherent, explain why as a substitute of answering something not correct. If you happen to do not know the reply to an issue, please don't share false information.
<>
There is a llama in my garden 😱 What should I do? [/INST]
"""
client = InferenceClient(model="codellama/CodeLlama-13b-hf", token=YOUR_TOKEN)
response = client.text_generation(prompt, max_new_tokens=200)
print(response)
This instance shows the structure of the primary message in a multi-turn conversation. Note how the < delimiter is used to offer the system prompt, which tells the model how we expect it to behave. Then our query is inserted between [INST] delimiters.
If we want to proceed the conversation, we’ve got to append the model response to the sequence, and issue a brand new followup instruction afterwards. That is the overall structure of the prompt template we want to make use of for Llama 2:
[INST] <>
{{ system_prompt }}
< >
{{ user_msg_1 }} [/INST] {{ model_answer_1 }} [INST] {{ user_msg_2 }} [/INST]
This same format could be used with Code Llama Instruct to have interaction in technical conversations with a code-savvy assistant!
Please, confer with our Llama 2 blog post for more details.
Code infilling with Code Llama
Code models like Code Llama could be used for code completion using the identical generation strategy we utilized in the previous examples: you provide a starting string that will contain code or comments, and the model will attempt to proceed the sequence with plausible content. Code models can be used for infilling, a more specialized task where you provide prefix and suffix sequences, and the model will predict what should go in between. That is great for applications corresponding to IDE extensions. Let’s have a look at an example using Code Llama:
client = InferenceClient(model="codellama/CodeLlama-13b-hf", token=YOUR_TOKEN)
prompt_prefix = 'def remove_non_ascii(s: str) -> str:n """ '
prompt_suffix = "n return result"
prompt = f" {prompt_prefix} {prompt_suffix} "
infilled = client.text_generation(prompt, max_new_tokens=150)
infilled = infilled.rstrip(" " )
print(f"{prompt_prefix}{infilled}{prompt_suffix}")
def remove_non_ascii(s: str) -> str:
""" Remove non-ASCII characters from a string.
Args:
s (str): The string to remove non-ASCII characters from.
Returns:
str: The string with non-ASCII characters removed.
"""
result = ""
for c in s:
if ord(c) < 128:
result += c
return result
As you’ll be able to see, the format used for infilling follows this pattern:
prompt = f" {prompt_prefix} {prompt_suffix} "
For more details on how this task works, please take a have a look at https://huggingface.co/blog/codellama#code-completion.
Stable Diffusion XL
SDXL can be available for PRO users. The response returned by the endpoint consists of a byte stream representing the generated image. If you happen to use InferenceClient, it is going to mechanically decode to a PIL image for you:
sdxl = InferenceClient(model="stabilityai/stable-diffusion-xl-base-1.0", token=YOUR_TOKEN)
image = sdxl.text_to_image(
"Dark gothic city in a misty night, lit by street lamps. A person in a cape is walking away from us",
guidance_scale=9,
)
For more details on how one can control generation, please take a have a look at this section.
Messages API
All text generation models now support the Messages API, in order that they are compatible with OpenAI client libraries, including LangChain and LlamaIndex. The next snippet shows how one can use the official openai client library with Llama 3.1 70B:
from openai import OpenAI
import huggingface_hub
client = OpenAI(
base_url="https://api-inference.huggingface.co/v1/",
api_key=huggingface_hub.get_token(),
)
chat_completion = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful an honest programming assistant."},
{"role": "user", "content": "Is Rust better than Python?"},
],
stream=True,
max_tokens=500
)
for message in chat_completion:
print(message.decisions[0].delta.content, end="")
For more details in regards to the use of the Messages API, please check this post.
Generation Parameters
Controlling Text Generation
Text generation is a wealthy topic, and there exist several generation strategies for various purposes. We recommend this excellent overview on the topic. Many generation algorithms are supported by the text generation endpoints, and so they could be configured using the next parameters:
do_sample: If set toFalse(the default), the generation method will likely be greedy search, which selects essentially the most probable continuation sequence after the prompt you provide. Greedy search is deterministic, so the identical results will all the time be returned from the identical input. Whendo_sampleisTrue, tokens will likely be sampled from a probability distribution and can subsequently vary across invocations.temperature: Controls the quantity of variation we desire from the generation. A temperature of0is comparable to greedy search. If we set a price fortemperature, thendo_samplewill mechanically be enabled. The identical thing happens fortop_kandtop_p. When doing code-related tasks, we would like less variability and hence recommend a lowtemperature. For other tasks, corresponding to open-ended text generation, we recommend a better one.top_k. Enables “Top-K” sampling: the model will select from theKmost probable tokens that will occur after the input sequence. Typical values are between 10 to 50.top_p. Enables “nucleus sampling”: the model will select from as many tokens as vital to cover a specific probability mass. Iftop_pis 0.9, the 90% most probable tokens will likely be considered for sampling, and the trailing 10% will likely be ignored.repetition_penalty: Tries to avoid repeated words within the generated sequence.seed: Random seed that you would be able to use together with sampling, for reproducibility purposes.
Along with the sampling parameters above, it’s also possible to control general points of the generation with the next:
max_new_tokens: maximum number of latest tokens to generate. The default is20, be happy to extend in the event you want longer sequences.return_full_text: whether to incorporate the input sequence within the output returned by the endpoint. The default utilized byInferenceClientisFalse, however the endpoint itself usesTrueby default.stop_sequences: an inventory of sequences that may cause generation to stop when encountered within the output.
Controlling Image Generation
If you happen to want finer-grained control over images generated with the SDXL endpoint, you need to use the next parameters:
negative_prompt: A text describing content that you simply want the model to steer away from.guidance_scale: How closely you wish the model to match the prompt. Lower numbers are less accurate, very high numbers might decrease image quality or generate artifacts.widthandheight: The specified image dimensions. SDXL works best for sizes between 768 and 1024.num_inference_steps: The variety of denoising steps to run. Larger numbers may produce higher quality but will likely be slower. Typical values are between 20 and 50 steps.
For added details on text-to-image generation, we recommend you check the diffusers library documentation.
Caching
If you happen to run the identical generation multiple times, you’ll see that the result returned by the API is identical (even in the event you are using sampling as a substitute of greedy decoding). It’s because recent results are cached. To force a distinct response every time, we are able to use an HTTP header to inform the server to run a brand new generation every time: x-use-cache: 0.
If you happen to are using InferenceClient, you’ll be able to simply append it to the headers client property:
client = InferenceClient(model="meta-llama/Meta-Llama-3-8b-Instruct", token=YOUR_TOKEN)
client.headers["x-use-cache"] = "0"
output = client.text_generation("In a surprising turn of events, ", do_sample=True)
print(output)
Streaming
Token streaming is the mode by which the server returns the tokens one after the other because the model generates them. This allows showing progressive generations to the user relatively than waiting for the entire generation. Streaming is a vital aspect of the end-user experience because it reduces latency, one of the critical points of a smooth experience.

To stream tokens with InferenceClient, simply pass stream=True and iterate over the response.
for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True):
print(token)
To make use of the generate_stream endpoint with curl, you’ll be able to add the -N/--no-buffer flag, which disables curl default buffering and shows data because it arrives from the server.
curl -N https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8b-Instruct
-X POST
-d '{"inputs": "In a surprising turn of events, ", "parameters": {"temperature": 0.7, "max_new_tokens": 100}}'
-H "Content-Type: application/json"
-H "Authorization: Bearer "
Subscribe to PRO
You may enroll today for a PRO subscription here. Profit from higher rate limits, custom accelerated endpoints for the most recent models, and early access to features. If you happen to’ve built some exciting projects with the Inference API or are on the lookout for a model not available in Inference for PROs, please use this discussion. Enterprise users also profit from PRO Inference API on top of other features, corresponding to SSO.
FAQ
Does this affect the free Inference API?
No. We still expose hundreds of models through free APIs that allow people to prototype and explore model capabilities quickly.
Does this affect Enterprise users?
Users with an Enterprise subscription also profit from accelerated inference API for curated models.
Can I take advantage of my very own models with PRO Inference API?
The free Inference API already supports a wide selection of small and medium models from a wide range of libraries (corresponding to diffusers, transformers, and sentence transformers). If you’ve a custom model or custom inference logic, we recommend using Inference Endpoints.


