CPU Optimized Embeddings with 🤗 Optimum Intel and fastRAG

Embedding models are useful for a lot of applications reminiscent of retrieval, reranking, clustering, and classification. The research community has witnessed significant advancements lately in embedding models, resulting in substantial enhancements in all applications constructing on semantic representation. Models reminiscent of BGE, GTE, and E5 are placed at the highest of the MTEB benchmark and in some cases outperform proprietary embedding services. There are a number of model sizes present in Hugging Face’s Model hub, from lightweight (100-350M parameters) to 7B models (reminiscent of Salesforce/SFR-Embedding-Mistral). The lightweight models based on an encoder architecture are ideal candidates for optimization and utilization on CPU backends running semantic search-based applications, reminiscent of Retrieval Augmented Generation (RAG).

On this blog, we are going to show the way to unlock significant performance boost on Xeon based CPUs, and show how easy it’s to integrate optimized models into existing RAG pipelines using fastRAG.

Information Retrieval with Embedding Models

Embedding models encode textual data into dense vectors, capturing semantic and contextual meaning. This allows accurate information retrieval by representing word and document relationships more contextually. Typically, semantic similarity will likely be measured by cosine similarity between the embedding vectors.

Should dense vectors at all times be used for information retrieval? The 2 dominant approaches have trade-offs:

Sparse retrieval matches n-grams, phrases, or metadata to go looking large collections efficiently and at scale. Nonetheless, it could miss relevant documents as a result of lexical gaps between the query and the document.
Semantic retrieval encodes text into dense vectors, capturing context and meaning higher than bag-of-words. It might probably retrieve semantically related documents despite lexical mismatches. Nonetheless, it’s computationally intensive, has higher latency, and requires sophisticated encoding models in comparison with lexical matching like BM25.

Embedding models and RAG

Embedding models serve multiple and significant purposes in RAG applications:

Offline Process: Encoding documents into dense vectors during indexing/updating of the retrieval document store (index).
Query Encoding: At query time, they encode the input query right into a dense vector representation for retrieval.
Reranking: After initial retrieval, they will rerank the retrieved documents by encoding them into dense vectors and comparing against the query vector. This enables reranking documents that originally lacked dense representations.

Optimizing the embedding model component in RAG pipelines is extremely desirable for the next efficiency experience, more particularly:

Document Indexing/Updating: Higher throughput allows encoding and indexing large document collections more rapidly during initial setup or periodic updates.
Query Encoding: Lower query encoding latency is critical for responsive real-time retrieval. Higher throughput supports encoding many concurrent queries efficiently, enabling scalability.
Reranking Retrieved Documents: After initial retrieval, embedding models have to quickly encode the retrieved candidates for reranking. Lower latency allows rapid reranking of documents for time-sensitive applications. Higher throughput supports reranking larger candidate sets in parallel for more comprehensive reranking.

Optimizing Embedding Models with Optimum Intel and IPEX

Optimum Intel is an open-source library that accelerates end-to-end pipelines built with Hugging Face libraries on Intel Hardware. Optimum Intel includes several techniques to speed up models reminiscent of low-bit quantization, model weight pruning, distillation, and an accelerated runtime.

The runtime and optimizations included in Optimum Intel benefit from Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs to speed up models. Specifically, it has built-in BFloat16 (bf16) and int8 GEMM accelerators in every core to speed up deep learning training and inference workloads. AMX accelerated inference is introduced in PyTorch 2.0 and Intel Extension for PyTorch (IPEX) along with other optimizations for various common operators.

Optimizing pre-trained models might be done easily with Optimum Intel; many easy examples might be found here.

Example: Optimizing BGE Embedding Models

On this blog, we give attention to recently released embedding models by researchers on the Beijing Academy of Artificial Intelligence, as their models show competitive results on the widely adopted MTEB leaderboard.

BGE Technical Details

Bi-encoder models are Transformer-based encoders trained to reduce a similarity metric, reminiscent of cosine-similarity, between two semantically similar texts as vectors. For instance, popular embedding models use a BERT model as a base pre-trained model and fine-tune it for embedding documents. The vector representing the encoded text is created from the model outputs; for instance, it may well be the [CLS] token vector or a mean of all of the token vectors.

Unlike more complex embedding architectures, bi-encoders encode only single documents, thus they lack contextual interaction between encoded entities reminiscent of query-document and document-document. Nonetheless, state-of-the-art bi-encoder embedding models present competitive performance and are extremely fast as a result of their easy architecture.

We give attention to 3 BGE models: small, base, and large consisting of 45M, 110M, and 355M parameters encoding to 384/768/1024 sized embedding vectors, respectively.

We note that the optimization process we showcase below is generic and might be applied to other embedding models (including bi-encoders, cross-encoders, and such).

Step-by-step: Optimization by Quantization

We present a step-by-step guide for enhancing the performance of embedding models, specializing in reducing latency (with a batch size of 1) and increasing throughput (measured in documents encoded per second). This recipe utilizes optimum-intel and Intel Neural Compressor to quantize the model, and uses IPEX for optimized runtime on Intel-based hardware.

Step 1: Installing Packages

To put in optimum-intel and intel-extension-for-transformers run the next command:

pip install -U optimum[neural-compressor] intel-extension-for-transformers

Step 2: Post-training Static Quantization

Post-training static quantization requires a calibration set to find out the dynamic range of weights and activations. The calibration is completed by running a representative set of knowledge samples through the model, collecting statistics, after which quantizing the model based on the gathered info to reduce the accuracy loss.

The next snippet shows a code snippet for quantization:

def quantize(model_name: str, output_path: str, calibration_set: "datasets.Dataset"):
    model = AutoModel.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def preprocess_function(examples):
        return tokenizer(examples["text"], padding="max_length", max_length=512, truncation=True)

    vectorized_ds = calibration_set.map(preprocess_function, num_proc=10)
    vectorized_ds = vectorized_ds.remove_columns(["text"])

    quantizer = INCQuantizer.from_pretrained(model)
    quantization_config = PostTrainingQuantConfig(approach="static", backend="ipex", domain="nlp")
    quantizer.quantize(
        quantization_config=quantization_config,
        calibration_dataset=vectorized_ds,
        save_directory=output_path,
        batch_size=1,
    )
    tokenizer.save_pretrained(output_path)

In our calibration process we use a subset of the qasper dataset.

Step 3: Loading and running inference

Loading a quantized model might be done by simply running:

from optimum.intel import IPEXModel

model = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")

Encoding sentences into vectors might be done similarly to what we’re used to with the Transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
inputs = tokenizer(sentences, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    
    embeddings = outputs[0][:, 0]

We offer additional and essential details on the way to configure the CPU-backend setup within the evaluation section below (correct machine setup).

Model Evaluation with MTEB

Quantizing the models’ weights to a lower precision introduces accuracy loss, as we lose precision moving from fp32 weights to int8. Due to this fact, we aim to validate the accuracy of the optimized models by comparing them to the unique models with two MTEB tasks:

Retrieval – where a corpus is encoded and ranked lists are created by searching the index given a question.
Reranking – reranking the retrieval’s results for higher relevance given a question.

The table below shows the common accuracy (on multiple datasets) of every task type (MAP for Reranking, NDCG@10 for Retrieval), where int8 is our quantized model and fp32 is the unique model (results taken from the official MTEB leaderboard). The quantized models show lower than 1% error rate in comparison with the unique model within the Reranking task and lower than 1.55% within the Retrieval task.

Reranking

Retrieval


BGE-small
BGE-base
BGE-large

int8	fp32	diff
0.5826	0.5836	-0.17%
0.5886	0.5886	0%
0.5985	0.6003	-0.3%

int8	fp32	diff
0.5138	0.5168	-0.58%
0.5242	0.5325	-1.55%
0.5346	0.5429	-1.53%

Speed and Latency

We compare the performance of our models with two other common methods of usage of models:

Using PyTorch and Huggingface’s Transformers library with bf16.
Using Intel extension for PyTorch (IPEX) runtime with bf16 and tracing the model using torchscript.

Experimental setup notes:

Hardware (CPU): 4th gen Intel Xeon 8480+ with 2 sockets, 56 cores per socket.
The Pytorch model was evaluated with 56 cores on 1 CPU socket.
IPEX/Optimum setups were evaluated with ipexrun, 1 CPU socket, and cores starting from 22-56.
TCMalloc was installed and defined as an environment variable in all runs.

How did we run the evaluation?

We created a script that generated random examples using the vocabulary of the model. We loaded the unique model and the optimized model and compared how much time it takes to encode those examples within the two scenarios we mentioned above: latency when encoding in batch size 1, and throughput using batched example encoding.

Baseline PyTorch and Hugging Face:

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5")

@torch.inference_mode()
def encode_text():
    outputs = model(inputs)

with torch.cpu.amp.autocast(dtype=torch.bfloat16):
    encode_text()

IPEX torchscript and bf16:

import torch
from transformers import AutoModel
import intel_extension_for_pytorch as ipex


model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5")
model = ipex.optimize(model, dtype=torch.bfloat16)

vocab_size = model.config.vocab_size
batch_size = 1
seq_length = 512
d = torch.randint(vocab_size, size=[batch_size, seq_length])
model = torch.jit.trace(model, (d,), check_trace=False, strict=False)
model = torch.jit.freeze(model)

@torch.inference_mode()
def encode_text():
    outputs = model(inputs)

with torch.cpu.amp.autocast(dtype=torch.bfloat16):
    encode_text()

Optimum Intel with IPEX and int8 model:

import torch
from optimum.intel import IPEXModel

model = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")

@torch.inference_mode()
def encode_text():
    outputs = model(inputs)

encode_text()

Latency performance

On this evaluation, we aim to measure how briskly the models respond. That is an example use case for encoding queries in RAG pipelines. On this evaluation we set the batch size to 1 and we measure latency of various document lengths.

We will see that the quantized model has the very best latency overall, under 10 ms for the small and base models and <20 ms for the big model. In comparison with the unique model, the quantized model shows as much as 4.5x speedup in latency.

latency
Figure 1. Latency for BGE models.

Throughput Performance

In our throughput evaluation, we aim to go looking for peak encoding performance when it comes to documents per second. We set text lengths to be 256 tokens longs, because it is an excellent estimate of a median document in a RAG pipeline, and evaluate with different batch sizes (4, 8, 16, 32, 64, 128, 256).

Results show that the quantized models reach higher throughput values in comparison with the opposite models, and reach peak throughput at batch size 128. Overall, for all model sizes, the quantized model shows as much as 4x improvement in comparison with the baseline bf16 model in various batch sizes.

throughput small
Figure 2. Throughput for BGE small.

throughput base
Figure 3. Throughput for BGE base.

throughput large
Figure 4. Throughput for BGE large.

Optimized Embedding Models with fastRAG

For example, we are going to display the way to integrate the optimized Retrieval/Reranking models into fastRAG (which will also be easily integrated into other RAG frameworks reminiscent of Langchain and LlamaIndex).

fastRAG is a research framework, developed by Intel Labs, for efficient and optimized retrieval augmented generative pipelines, incorporating state-of-the-art LLMs and Information Retrieval. fastRAG is fully compatible with Haystack and includes novel and efficient RAG modules for efficient deployment on Intel hardware.
To start with fastRAG we invite readers to see the installation instructions here and start with fastRAG using our guide.

We integrated the optimized bi-encoder embedding models in two modules:

QuantizedBiEncoderRetriever – for indexing and retrieving documents from a dense index
QuantizedBiEncoderRanker – for reranking an inventory of documents using the embedding model as a part of an elaborate retrieval pipeline.

Fast indexing using the optimized Retriever

Let’s create a dense index by utilizing a dense retriever that utilizes an optimized embedding model.

First, create a document store:

from haystack.document_store import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_gpu=False, use_bm25=False, embedding_dim=384, return_embedding=True)

Then, add some documents to it:

from haystack.schema import Document


examples = [
   "There is a blue house on Oxford Street.",
   "Paris is the capital of France.",
   "The first commit in fastRAG was in 2022"  
]

documents = []
for i, d in enumerate(examples):
    documents.append(Document(content=d, id=i))
document_store.write_documents(documents)

Load a Retriever with an optimized bi-encoder embedding model, and encode all of the documents within the document store:

from fastrag.retrievers import QuantizedBiEncoderRetriever

model_id = "Intel/bge-small-en-v1.5-rag-int8-static"
retriever = QuantizedBiEncoderRetriever(document_store=document_store, embedding_model=model_id)
document_store.update_embeddings(retriever=retriever)

Reranking using the Optimized Ranker

Below is an example of loading an optimized model right into a ranker node that encodes and re-ranks all of the documents it retrieves from an index given a question:

from haystack import Pipeline
from fastrag.rankers import QuantizedBiEncoderRanker

ranker = QuantizedBiEncoderRanker("Intel/bge-large-en-v1.5-rag-int8-static")

p = Pipeline()
p.add_node(component=retriever, name="retriever", inputs=["Query"])
p.add_node(component=ranker, name="ranker", inputs=["retriever"])
results = p.run(query="What's the capital of France?")


print(results)

Done! The created pipeline might be used to retrieve documents from a document store and rank the retrieved documents using (one other) embedding models to re-order the documents.
A more complete example is provided on this notebook.

For more RAG-related methods, models and examples we invite the readers to explore fastRAG/examples notebooks.

Source link

CPU Optimized Embeddings with 🤗 Optimum Intel and fastRAG

Information Retrieval with Embedding Models

Embedding models and RAG

Optimizing Embedding Models with Optimum Intel and IPEX

Example: Optimizing BGE Embedding Models

BGE Technical Details

Step-by-step: Optimization by Quantization

Step 1: Installing Packages

Step 2: Post-training Static Quantization

Step 3: Loading and running inference

Model Evaluation with MTEB

Speed and Latency

How did we run the evaluation?

Latency performance

Throughput Performance

Optimized Embedding Models with fastRAG

Fast indexing using the optimized Retriever

Reranking using the Optimized Ranker

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Welcome Gemma – Google’s recent open LLM

Beyond the Flat Table: Constructing an Enterprise-Grade Financial Model in Power BI

Introducing the Red-Teaming Resistance Leaderboard

Federated Learning, Part 1: The Basics of Training Models Where the Data Lives

🪆 Introduction to Matryoshka Embedding Models

CPU Optimized Embeddings with 🤗 Optimum Intel and fastRAG

Information Retrieval with Embedding Models

Embedding models and RAG

Optimizing Embedding Models with Optimum Intel and IPEX

Example: Optimizing BGE Embedding Models

BGE Technical Details

Step-by-step: Optimization by Quantization

Step 1: Installing Packages

Step 2: Post-training Static Quantization

Step 3: Loading and running inference

Model Evaluation with MTEB

Speed and Latency

How did we run the evaluation?

Latency performance

Throughput Performance

Optimized Embedding Models with fastRAG

Fast indexing using the optimized Retriever

Reranking using the Optimized Ranker

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.