Constructing Cost-Efficient Enterprise RAG applications with Intel Gaudi 2 and Intel Xeon

Retrieval-augmented generation (RAG) enhances text generation with a big language model by incorporating fresh domain knowledge stored in an external datastore. Separating your organization data from the knowledge learned by language models during training is important to balance performance, accuracy, and security privacy goals.

On this blog, you’ll learn the way Intel can make it easier to develop and deploy RAG applications as a part of OPEA, the Open Platform for Enterprise AI. You may even discover how Intel Gaudi 2 AI accelerators and Xeon CPUs can significantly enhance enterprise performance through a real-world RAG use case.

Before diving into the main points, let’s access the hardware first. Intel Gaudi 2 is purposely built to speed up deep learning training and inference in the information center and cloud. It’s publicly available on the Intel Developer Cloud (IDC) and for on-premises implementations. IDC is the best technique to start with Gaudi 2. If you happen to don’t have an account yet, please register for one, subscribe to “Premium,” after which apply for access.

On the software side, we are going to construct our application with LangChain, an open-source framework designed to simplify the creation of AI applications with LLMs. It provides template-based solutions, allowing developers to construct RAG applications with custom embeddings, vector databases, and LLMs. The LangChain documentation provides more information. Intel has been actively contributing multiple optimizations to LangChain, enabling developers to deploy GenAI applications efficiently on Intel platforms.

In LangChain, we are going to use the rag-redis template to create our RAG application, with the BAAI/bge-base-en-v1.5 embedding model and Redis because the default vector database. The diagram below shows the high-level architecture.

The embedding model will run on an Intel Granite Rapids CPU. The Intel Granite Rapids architecture is optimized to deliver the bottom total cost of ownership (TCO) for high-core performance-sensitive workloads and general-purpose compute workloads. GNR also supports the AMX-FP16 instruction set, leading to a 2-3x performance increase for mixed AI workloads.

The LLM will run on an Intel Gaudi 2 accelerator. Regarding Hugging Face models, the Optimum Habana library is the interface between the Hugging Face Transformers and Diffusers libraries and Gaudi. It offers tools for simple model loading, training, and inference on single- and multi-card settings for various downstream tasks.

We offer a Dockerfile to streamline the setup of the LangChain development environment. Once you have got launched the Docker container, you may start constructing the vector database, the RAG pipeline, and the LangChain application throughout the Docker environment. For an in depth step-by-step, follow the ChatQnA example.

To populate the vector database, we use public financial documents from Nike. Here is the sample code.

# Ingest PDF files that contain Edgar 10k filings data for Nike.
company_name = "Nike"
data_path = "data"
doc_path = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]
content = pdf_loader(doc_path)
chunks = text_splitter.split_text(content)

# Create vectorstore
embedder = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

_ = Redis.from_texts(
    texts=[f"Company: {company_name}. " + chunk for chunk in chunks],
    embedding=embedder,
    index_name=INDEX_NAME,
    index_schema=INDEX_SCHEMA,
    redis_url=REDIS_URL,
)

In LangChain, we use the Chain API to attach the prompt, the vector database, and the embedding model.

The entire code is offered within the repository.

# Embedding model running on Xeon CPU
embedder = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

# Redis vector database
vectorstore = Redis.from_existing_index(
    embedding=embedder, index_name=INDEX_NAME, schema=INDEX_SCHEMA, redis_url=REDIS_URL
)

# Retriever
retriever = vectorstore.as_retriever(search_type="mmr")

# Prompt template
template = """…"""
prompt = ChatPromptTemplate.from_template(template)

# Hugging Face LLM running on Gaudi 2
model = HuggingFaceEndpoint(endpoint_url=TGI_LLM_ENDPOINT, …)

# RAG chain
chain = (
    RunnableParallel({"context": retriever, "query": RunnablePassthrough()}) | prompt | model | StrOutputParser()
).with_types(input_type=Query)

We’ll run our chat model on Gaudi2 with the Hugging Face Text Generation Inference (TGI) server. This mix enables high-performance text generation for popular open-source LLMs on Gaudi2 hardware, akin to MPT, Llama, and Mistral.

No setup is required. We are able to use a pre-built Docker image and pass the model name (e.g., Intel NeuralChat).

model=Intel/neural-chat-7b-v3-3
volume=$PWD/data
docker run -p 8080:80 -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi --model-id $model

The service uses a single Gaudi accelerator by default. Multiple accelerators could also be required to run a bigger model (e.g., 70B). In that case, please add the suitable parameters, e.g. --sharded true and --num_shard 8. For gated models akin to Llama or StarCoder, you may even have to specify -e HUGGING_FACE_HUB_TOKEN= using your Hugging Face token.

Once the container runs, we check that the service works by sending a request to the TGI endpoint.

curl localhost:8080/generate -X POST 
-d '{"inputs":"Which NFL team won the Super Bowl within the 2010 season?", 
"parameters":{"max_new_tokens":128, "do_sample": true}}' 
-H 'Content-Type: application/json'

If you happen to see a generated response, the LLM is running appropriately and you may now enjoy high-performance inference on Gaudi 2!

The TGI Gaudi container utilizes the bfloat16 data type by default. For higher throughput, chances are you’ll wish to enable FP8 quantization. In accordance with our test results, FP8 quantization should yield a 1.8x throughput increase gain in comparison with BF16. FP8 instructions can be found within the README file.

Lastly, you may enable content moderation with the Meta Llama Guard model. The README file provides instructions for deploying Llama Guard on TGI Gaudi.

We use the next instructions to launch the RAG application backend service. The server.py script defines the service endpoints using fastAPI.

docker exec -it qna-rag-redis-server bash
nohup python app/server.py &

By default, the TGI Gaudi endpoint is predicted to run on localhost at port 8080 (i.e. http://127.0.0.1:8080). Whether it is running at a unique address or port, please set the TGI_ENDPOINT environment variable accordingly.

We use the instructions below to put in the frontend GUI components.

sudo apt-get install npm && 
    npm install -g n && 
    n stable && 
    hash -r && 
    npm install -g npm@latest

Then, we update the DOC_BASE_URL environment variable within the .env file by replacing the localhost IP address (127.0.0.1) with the actual IP address of the server where the GUI runs.

We run the next command to put in the required dependencies:

npm install

Finally, we start the GUI server with the next command:

nohup npm run dev &

This may run the frontend service and launch the applying.

We did intensive experiments with different models and configurations. The 2 figures below show the relative end-to-end throughput and performance per dollar comparison for the Llama2-70B model with 16 concurrent users on 4 Intel Gaudi 2 and 4 Nvidia H100 platforms.

In each cases, the identical Intel Granite Rapids CPU platform is used for vector databases and embedding models. For performance per dollar comparison, we use publicly available pricing to compute a mean training performance per dollar, the identical because the one reported by the MosaicML team in January 2024.

As you may see, the H100-based system has 1.13x more throughput but can only deliver 0.44x performance per dollar in comparison with Gaudi 2. These comparisons may vary based on customer-specific discounts on different cloud providers. Detailed benchmark configurations are listed at the tip of the post.

The instance above deployment successfully demonstrates a RAG-based chatbot on Intel platforms. Moreover, as Intel keeps releasing ready-to-go GenAI examples, developers profit from validated tools that simplify the creation and deployment process. These examples offer versatility and ease of customization, making them ideal for a wide selection of applications on Intel platforms.

When running enterprise AI applications, the full cost of ownership is more favorable with systems based on Intel Granite Rapids CPUs and Gaudi 2 accelerators. Further improvements could be achieved with FP8 optimization.

The next developer resources should make it easier to kickstart your GenAI projects confidently.

If you have got questions or feedback, we might love to reply them on the Hugging Face forum. Thanks for reading!

Acknowledgements:
We wish to thank Chaitanya Khened, Suyue Chen, Mikolaj Zyczynski, Wenjiao Yue, Wenxin Zhang, Letong Han, Sihan Chen, Hanwen Cheng, Yuan Wu, and Yi Wang for his or her outstanding contributions to constructing enterprise-grade RAG systems on Intel Gaudi 2.

Benchmark configurations

Gaudi2 configurations: HLS-Gaudi2 with eight Habana Gaudi2 HL-225H Mezzanine cards and two Intel® Xeon® Platinum 8380 CPU @ 2.30GHz, and 1TB of System Memory; OS: Ubuntu 22.04.03, 5.15.0 kernel
H100 SXM Configurations: Lambda labs instance gpu_8x_h100_sxm5; 8xH100 SXM and Two Intel Xeon® Platinum 8480 CPU@2 GHz and 1.8TB of system memory; OS ubuntu 20.04.6 LTS, 5.15.0 kernel
Intel Xeon: Pre-production Granite Rapids platform with 2Sx120C @ 1.9GHz and 8800 MCR DIMMs with 1.5TB system memory. OS: Cent OS 9, 6.2.0 kernel
Llama2 70B is deployed to 4 cards (queries normalized to eight cards). BF16 for Gaudi2 and FP16 for H100.
Embedding model is BAAI/bge-base v1.5. Tested with: TGI-gaudi 1.2.1, TGI-GPU 1.4.5 Python 3.11.7, Langchain 0.1.11, sentence-transformers 2.5.1, langchain benchmarks 0.0.10, redis 5.0.2, cuda 12.2.r12.2/compiler.32965470 0, TEI 1.2.0,
RAG queries max input length 1024, max output length 128. Test dataset: langsmith Q&A. Variety of concurrent clients 16
TGI parameters for Gaudi2 (70B): batch_bucket_size=22, prefill_batch_bucket_size=4, max_batch_prefill_tokens=5102, max_batch_total_tokens=32256, max_waiting_tokens=5, streaming=false
TGI parameters for H100 (70B): batch_bucket_size=8, prefill_batch_bucket_size=4, max_batch_prefill_tokens=4096, max_batch_total_tokens=131072, max_waiting_tokens=20, max_batch_size=128, streaming=false
TCO Reference: https://www.databricks.com/blog/llm-training-and-inference-intel-gaudi2-ai-accelerators