🚀 Accelerating LLM Inference with TGI on Intel Gaudi

-



We’re excited to announce the native integration of Intel Gaudi hardware support directly into Text Generation Inference (TGI), our production-ready serving solution for Large Language Models (LLMs). This integration brings the facility of Intel’s specialized AI accelerators to our high-performance inference stack, enabling more deployment options for the open-source AI community 🎉



✨ What’s Recent?

We have fully integrated Gaudi support into TGI’s principal codebase in PR #3091. Previously, we maintained a separate fork for Gaudi devices at tgi-gaudi. This was cumbersome for users and prevented us from supporting the newest TGI features at launch. Now using the brand new TGI multi-backend architecture, we support Gaudi directly on TGI – no more finicking on a custom repository 🙌

This integration supports Intel’s full line of Gaudi hardware:

It’s also possible to find more information on Gaudi hardware on Intel’s Gaudi product page



🌟 Why This Matters

The Gaudi backend for TGI provides several key advantages:

  • Hardware Diversity 🔄: More options for deploying LLMs in production beyond traditional GPUs
  • Cost Efficiency 💰: Gaudi hardware often provides compelling price-performance for specific workloads
  • Production-Ready ⚙️: All of the robustness of TGI (dynamic batching, streamed responses, etc.) now available on Gaudi
  • Model Support 🤖: Run popular models like Llama 3.1, Mixtral, Mistral, and more on Gaudi hardware
  • Advanced Features 🔥: Support for multi-card inference (sharding), vision-language models, and FP8 precision



🚦 Getting Began with TGI on Gaudi

The best option to run TGI on Gaudi is to make use of our official Docker image. It is advisable run the image on a Gaudi hardware machine. Here’s a basic example to get you began:

model=meta-llama/Meta-Llama-3.1-8B-Instruct 
volume=$PWD/data 
hf_token=YOUR_HF_ACCESS_TOKEN

docker run --runtime=habana --cap-add=sys_nice --ipc=host 
 -p 8080:80 
 -v $volume:/data 
 -e HF_TOKEN=$hf_token 
 -e HABANA_VISIBLE_DEVICES=all 
 ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi 
 --model-id $model 

Once the server is running, you possibly can send inference requests:

curl 127.0.0.1:8080/generate
 -X POST
 -d '{"inputs":"What's Deep Learning?","parameters":{"max_new_tokens":32}}'
 -H 'Content-Type: application/json'

For comprehensive documentation on using TGI with Gaudi, including how-to guides and advanced configurations, discuss with the brand new dedicated Gaudi backend documentation.



🎉 Top features

We’ve got optimized the next models for each single and multi-card configurations. This implies these models run as fast as possible on Intel Gaudi. We have specifically optimized the modeling code to focus on Intel Gaudi hardware, ensuring we provide the perfect performance and fully utilize Gaudi’s capabilities:

  • Llama 3.1 (8B and 70B)
  • Llama 3.3 (70B)
  • Llama 3.2 Vision (11B)
  • Mistral (7B)
  • Mixtral (8x7B)
  • CodeLlama (13B)
  • Falcon (180B)
  • Qwen2 (72B)
  • Starcoder and Starcoder2
  • Gemma (7B)
  • Llava-v1.6-Mistral-7B
  • Phi-2

🏃‍♂️ We also offer many advanced features on Gaudi hardware, corresponding to FP8 quantization because of Intel Neural Compressor (INC), enabling even greater performance optimizations.

✨ Coming soon! We’re excited to expand our model lineup with cutting-edge additions including DeepSeek-r1/v3, QWen-VL, and more powerful models to power your AI applications! 🚀



💪 Getting Involved

We invite the community to check out TGI on Gaudi hardware and supply feedback. The total documentation is offered within the TGI Gaudi backend documentation. 📚 In case you’re involved in contributing, take a look at our contribution guidelines or open a difficulty together with your feedback on GitHub. 🤝 By bringing Intel Gaudi support directly into TGI, we’re continuing our mission to supply flexible, efficient, and production-ready tools for deploying LLMs. We’re excited to see what you will construct with this latest capability! 🎉



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x