From OpenAI to Open LLMs with Messages API on Hugging Face

We’re excited to introduce the Messages API to supply OpenAI compatibility with Text Generation Inference (TGI) and Inference Endpoints.

Starting with version 1.4.0, TGI offers an API compatible with the OpenAI Chat Completion API. The brand new Messages API allows customers and users to transition seamlessly from OpenAI models to open LLMs. The API might be directly used with OpenAI’s client libraries or third-party tools, like LangChain or LlamaIndex.

“The brand new Messages API with OpenAI compatibility makes it easy for Ryght’s real-time GenAI orchestration platform to change LLM use cases from OpenAI to open models. Our migration from GPT4 to Mixtral/Llama2 on Inference Endpoints is effortless, and now we’ve a simplified workflow with more control over our AI solutions.” – Johnny Crupi, CTO at Ryght

The brand new Messages API can also be now available in Inference Endpoints, on each dedicated and serverless flavors. To get you began quickly, we’ve included detailed examples of find out how to:

Limitations: The Messages API doesn’t currently support function calling and can only work for LLMs with a chat_template defined of their tokenizer configuration, like within the case of Mixtral 8x7B Instruct.

Create an Inference Endpoint

Inference Endpoints offers a secure, production solution to simply deploy any machine learning model from the Hub on dedicated infrastructure managed by Hugging Face.

In this instance, we’ll deploy Nous-Hermes-2-Mixtral-8x7B-DPO, a fine-tuned Mixtral model, to Inference Endpoints using Text Generation Inference.

We will deploy the model in only a couple of clicks from the UI, or make the most of the huggingface_hub Python library to programmatically create and manage Inference Endpoints. We exhibit the usage of the Hub library here.

In our API call shown below, we’d like to specify the endpoint name and model repository, together with the duty of text-generation. In this instance we use a protected type so access to the deployed endpoint would require a legitimate Hugging Face token. We also have to configure the hardware requirements like vendor, region, accelerator, instance type, and size. You may try the list of obtainable resource options using this API call, and consider really helpful configurations for select models in our catalog here.

Note: It’s possible you’ll have to request a quota upgrade by sending an email to api-enterprise@huggingface.co

from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "nous-hermes-2-mixtral-8x7b-demo",
    repository="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
    framework="pytorch",
    task="text-generation",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_type="nvidia-a100",
    instance_size="x2",
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_INPUT_LENGTH": "4096",
            "MAX_BATCH_PREFILL_TOKENS": "4096",
            "MAX_TOTAL_TOKENS": "32000",
            "MAX_BATCH_TOTAL_TOKENS": "1024000",
            "MODEL_ID": "/repository",
        },
        "url": "ghcr.io/huggingface/text-generation-inference:sha-1734540", 
    },
)

endpoint.wait()
print(endpoint.status)

It can take a couple of minutes for our deployment to spin up. We will use the .wait() utility to dam the running thread until the endpoint reaches a final “running” state. Once running, we will confirm its status and take it for a spin via the UI Playground:

Great, we now have a working endpoint!

💡 When deploying with huggingface_hub, your endpoint will scale-to-zero after quarter-hour of idle time by default to optimize cost in periods of inactivity. Take a look at the Hub Python Library documentation to see all of the functionality available for managing your endpoint lifecycle.

Using Inference Endpoints with OpenAI client libraries

Messages support in TGI makes Inference Endpoints directly compatible with the OpenAI Chat Completion API. Which means any existing scripts that use OpenAI models via the OpenAI client libraries might be directly swapped out to make use of any open LLM running on a TGI endpoint!

With this seamless transition, you may immediately make the most of the various advantages offered by open models:

Complete control and transparency over models and data
No more worrying about rate limits
The power to completely customize systems based on your specific needs

Lets see how.

With the Python client

The instance below shows find out how to make this transition using the OpenAI Python Library. Simply replace the together with your endpoint URL (you should definitely include the v1/ suffix) and populate the field with a legitimate Hugging Face user token. The might be gathered from Inference Endpoints UI, or from the endpoint object we created above with endpoint.url.

We will then use the client as usual, passing an inventory of messages to stream responses from our Inference Endpoint.

from openai import OpenAI


client = OpenAI(
    base_url="" + "/v1/",  
    api_key="",  
)
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Why is open-source software important?"},
    ],
    stream=True,
    max_tokens=500
)


for message in chat_completion:
    print(message.selections[0].delta.content, end="")

Behind the scenes, TGI’s Messages API mechanically converts the list of messages into the model’s required instruction format using its chat template.

💡 Certain OpenAI features, like function calling, aren’t compatible with TGI. Currently, the Messages API supports the next chat completion parameters: stream, max_tokens, frequency_penalty, logprobs, seed, temperature, and top_p.

With the JavaScript client

Here’s the identical streaming example above, but using the OpenAI Javascript/Typescript Library.

import OpenAI from "openai";

const openai = recent OpenAI({
  baseURL: "" + "/v1/", 
  apiKey: "", 
});

async function essential() {
  const stream = await openai.chat.completions.create({
    model: "tgi",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Why is open-source software important?" },
    ],
    stream: true,
    max_tokens: 500,
  });
  for await (const chunk of stream)  "");
  
}

essential();

Integrate with LangChain and LlamaIndex

Now, let’s see find out how to use this newly created endpoint together with your preferred RAG framework.

Easy methods to use with LangChain

To make use of it in LangChain, simply create an instance of ChatOpenAI and pass your and as follows:

from langchain_community.chat_models.openai import ChatOpenAI

llm = ChatOpenAI(
    model_name="tgi",
    openai_api_key="",
    openai_api_base="" + "/v1/",
)
llm.invoke("Why is open-source software vital?")

We’re in a position to directly leverage the identical ChatOpenAI class that we might have used with the OpenAI models. This permits all previous code to work with our endpoint by changing only one line of code.
Let’s now use the LLM declared this manner in an easy RAG pipeline to reply an issue over the contents of a HF blog post.

from langchain_core.runnables import RunnableParallel
from langchain_community.embeddings import HuggingFaceEmbeddings


loader = WebBaseLoader(
    web_paths=("https://huggingface.co/blog/open-source-llms-as-agents",),
)
docs = loader.load()


hf_embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)


retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "nn".join(doc.page_content for doc in docs)

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "query": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

rag_chain_with_source.invoke("In line with this text which open-source model is the perfect for an agent behaviour?")

{
    "context": [...],
    "query": "In line with this text which open-source model is the perfect for an agent behaviour?",
    "answer": " In line with the article, Mixtral-8x7B is the perfect open-source model for agent behavior, because it performs well and even beats GPT-3.5. The authors recommend fine-tuning Mixtral for agents to potentially surpass the following challenger, GPT-4.",
}

Easy methods to use with LlamaIndex

Similarly, you too can use a TGI endpoint in LlamaIndex. We’ll use the OpenAILike class, and instantiate it by configuring some additional arguments (i.e. is_local, is_function_calling_model, is_chat_model, context_window). Note that the context window argument should match the worth previously set for MAX_TOTAL_TOKENS of your endpoint.

from llama_index.llms import OpenAILike


llm = OpenAILike(
    model="tgi",
    api_key="",
    api_base="" + "/v1/",
    is_chat_model=True,
    is_local=False,
    is_function_calling_model=False,
    context_window=32000,
)


llm.complete("Why is open-source software vital?")

We will now use it in the same RAG pipeline. Take into accout that the previous selection of MAX_INPUT_LENGTH in your Inference Endpoint will directly influence the variety of retrieved chunk (similarity_top_k) the model can process.

from llama_index import (
    ServiceContext,
    VectorStoreIndex,
)
from llama_index import download_loader
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.query_engine import CitationQueryEngine

SimpleWebPageReader = download_loader("SimpleWebPageReader")

documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["https://huggingface.co/blog/open-source-llms-as-agents"]
)


embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")


service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context, show_progress=True
)


query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=2,
)
response = query_engine.query(
    "In line with this text which open-source model is the perfect for an agent behaviour?"
)

In line with the article, Mixtral-8x7B is the perfect performing open-source model for an agent behavior [5]. It even beats GPT-3.5 on this task. Nonetheless, it's price noting that Mixtral's performance could possibly be further improved with proper fine-tuning for function calling and task planning skills [5].

Cleansing up

After you might be done together with your endpoint, you may either pause or delete it. This step might be accomplished via the UI, or programmatically like follows.


endpoint.pause()


endpoint.delete()

Conclusion

The brand new Messages API in Text Generation Inference provides a smooth transition path from OpenAI models to open LLMs. We will’t wait to see what use cases you’ll power with open LLMs running on TGI!

See this notebook for a runnable version of the code outlined within the post.

Source link

From OpenAI to Open LLMs with Messages API on Hugging Face

Create an Inference Endpoint

💡 When deploying with `huggingface_hub`, your endpoint will scale-to-zero after quarter-hour of idle time by default to optimize cost in periods of inactivity. Take a look at the Hub Python Library documentation to see all of the functionality available for managing your endpoint lifecycle.

Using Inference Endpoints with OpenAI client libraries

With the Python client

💡 Certain OpenAI features, like function calling, aren’t compatible with TGI. Currently, the Messages API supports the next chat completion parameters: `stream`, `max_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`.

With the JavaScript client

Integrate with LangChain and LlamaIndex

Easy methods to use with LangChain

Easy methods to use with LlamaIndex

Cleansing up

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Current Status of The Quantum Software Stack

The Multi-Agent Trap

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

From OpenAI to Open LLMs with Messages API on Hugging Face

Create an Inference Endpoint

💡 When deploying with huggingface_hub, your endpoint will scale-to-zero after quarter-hour of idle time by default to optimize cost in periods of inactivity. Take a look at the Hub Python Library documentation to see all of the functionality available for managing your endpoint lifecycle.

Using Inference Endpoints with OpenAI client libraries

With the Python client

💡 Certain OpenAI features, like function calling, aren’t compatible with TGI. Currently, the Messages API supports the next chat completion parameters: stream, max_tokens, frequency_penalty, logprobs, seed, temperature, and top_p.

With the JavaScript client

Integrate with LangChain and LlamaIndex

Easy methods to use with LangChain

Easy methods to use with LlamaIndex

Cleansing up

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

💡 When deploying with `huggingface_hub`, your endpoint will scale-to-zero after quarter-hour of idle time by default to optimize cost in periods of inactivity. Take a look at the Hub Python Library documentation to see all of the functionality available for managing your endpoint lifecycle.

💡 Certain OpenAI features, like function calling, aren’t compatible with TGI. Currently, the Messages API supports the next chat completion parameters: `stream`, `max_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`.

What are your thoughts on this topic?
Let us know in the comments below.