Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

-


Morgan Funtowicz's avatar

Hugo Larcher's avatar

Since its initial release in 2022, Text-Generation-Inference (TGI) has provided Hugging Face and the AI Community with a performance-focused solution to simply deploy large-language models (LLMs). TGI initially offered an almost no-code solution to load models from the Hugging Face Hub and deploy them in production on NVIDIA GPUs. Over time, support expanded to incorporate AMD Instinct GPUs, Intel GPUs, AWS Trainium/Inferentia, Google TPU, and Intel Gaudi.
Through the years, multiple inferencing solutions have emerged, including vLLM, SGLang, llama.cpp, TensorRT-LLM, etc., splitting up the general ecosystem. Different models, hardware, and use cases may require a selected backend to attain optimal performance. Nevertheless, configuring each backend accurately, managing licenses, and integrating them into existing infrastructure may be difficult for users.

To handle this, we’re excited to introduce the concept of TGI Backends. This recent architecture gives the flexibleness to integrate with any of the solutions above through TGI as a single unified frontend layer. This variation makes it easier for the community to get the most effective performance for his or her production workloads, switching backends based on their modeling, hardware, and performance requirements.

The Hugging Face team is worked up to contribute to and collaborate with the teams that construct vLLM, llama.cpp, TensorRT-LLM, and the teams at AWS, Google, NVIDIA, AMD and Intel to supply a strong and consistent user experience for TGI users whichever backend and hardware they wish to use.

TGI multi-backend stack



TGI Backend: under the hood

TGI is made from multiple components, primarily written in Rust and Python. Rust powers the HTTP and scheduling layers, and Python stays the go-to for modeling.

Long story short: Rust allows us to enhance the general robustness of the serving layer with static evaluation and compiler-based memory safety enforcement: it brings the flexibility to scale to multiple cores with the identical safety guarantees more easily. Leveraging Rust’s strong type system for the HTTP layer and scheduler makes it possible to avoid memory issues while maximizing the concurrency, bypassing Global Interpreter Lock (GIL) in Python-based environments.

Speaking about Rust… Surprise, that is the TGI place to begin to integrate a brand new backend – 🤗

Earlier this 12 months, the TGI team worked on exposing the foundational knobs to disentangle how the actual HTTP server and the scheduler were coupled together.
This work introduced the brand new Rust trait Backend to interface current inference engine and the one to come back.

Having this recent Backend interface (or trait in Rusty terms) paves the best way for modularity and makes it possible to truly route the incoming requests towards different modeling and execution engines.



Looking forward: 2025

The brand new multi-backend capabilities of TGI open up many impactful roadmap opportunities. As we sit up for 2025 we’re excited to share among the TGI developments we’re most enthusiastic about:

  • NVIDIA TensorRT-LLM backend: We’re collaborating with the NVIDIA TensorRT-LLM team to bring all of the optimized NVIDIA GPUs + TensorRT performances to the community. This work can be covered more extensively in an upcoming blog post. It closely pertains to our mission to empower AI builders with the open-source availability of each optimum-nvidia quantize/construct/evaluate TensorRT compatible artifacts alongside TGI+TRT-LLM to simply deploy, execute, and scale deployments on NVIDIA GPUs.
  • Llama.cpp backend: We’re collaborating with the llama.cpp team to increase the support for server production use cases. The llama.cpp backend for TGI will provide a powerful CPU-based option for anyone willing to deploy on Intel, AMD, or ARM CPU servers.
  • vLLM backend: We’re contributing to the vLLM project and need to integrate vLLM as a TGI backend in Q1 ’25.
  • AWS Neuron backend: we’re working with the Neuron teams at AWS to enable Inferentia 2 and Trainium 2 support natively in TGI.
  • Google TPU backend: We’re working with the Google Jetstream & TPU teams to supply the most effective performance through TGI.

We’re confident TGI Backends will help simplify the deployments of LLMs, bringing versatility and performance to all TGI users.
You will soon give you the option to make use of TGI Backends directly inside Inference Endpoints. Customers will give you the option to simply deploy models with TGI Backends on various hardware with top-tier performance and reliability out of the box.

Stay tuned for the subsequent blog post where we’ll dig into technical details and performance benchmarks of upcoming backends!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x