With regards to natural language processing (NLP) and data retrieval, the power to efficiently and accurately retrieve relevant information is paramount. As the sphere continues to evolve, recent techniques and methodologies are being developed to reinforce the performance of retrieval systems, particularly within the context of Retrieval Augmented Generation (RAG). One such technique, often called two-stage retrieval with rerankers, has emerged as a robust solution to deal with the inherent limitations of traditional retrieval methods.
On this comprehensive blog post, we’ll delve into the intricacies of two-stage retrieval and rerankers, exploring their underlying principles, implementation strategies, and the advantages they provide in enhancing the accuracy and efficiency of RAG systems. We’ll also provide practical examples and code snippets for example the concepts and facilitate a deeper understanding of this cutting-edge technique.
Understanding Retrieval Augmented Generation (RAG)
Before diving into the specifics of two-stage retrieval and rerankers, let’s briefly revisit the concept of Retrieval Augmented Generation (RAG). RAG is a way that extends the knowledge and capabilities of enormous language models (LLMs) by providing them with access to external information sources, corresponding to databases or document collections. Refer more from the article “A Deep Dive into Retrieval Augmented Generation in LLM“.
“RAFT: A Tremendous-Tuning and RAG Approach to Domain-Specific Query Answering” “A Full Guide to Tremendous-Tuning Large Language Models” “The Rise of Mixture of Experts for Efficient Large Language Models” and “A Guide to Mastering Large Language Models”
The everyday RAG process involves the next steps:
- Query: A user poses an issue or provides an instruction to the system.
- Retrieval: The system queries a vector database or document collection to search out information relevant to the user’s query.
- Augmentation: The retrieved information is combined with the user’s original query or instruction.
- Generation: The language model processes the augmented input and generates a response, leveraging the external information to reinforce the accuracy and comprehensiveness of its output.
While RAG has proven to be a robust technique, it is just not without its challenges. One among the important thing issues lies within the retrieval stage, where traditional retrieval methods may fail to discover essentially the most relevant documents, resulting in suboptimal or inaccurate responses from the language model.
The Need for Two-Stage Retrieval and Rerankers
Traditional retrieval methods, corresponding to those based on keyword matching or vector space models, often struggle to capture the nuanced semantic relationships between queries and documents. This limitation may end up in the retrieval of documents which might be only superficially relevant or miss crucial information that might significantly improve the standard of the generated response.
To handle this challenge, researchers and practitioners have turned to two-stage retrieval with rerankers. This approach involves a two-step process:
- Initial Retrieval: In the primary stage, a comparatively large set of probably relevant documents is retrieved using a quick and efficient retrieval method, corresponding to a vector space model or a keyword-based search.
- Reranking: Within the second stage, a more sophisticated reranking model is employed to reorder the initially retrieved documents based on their relevance to the query, effectively bringing essentially the most relevant documents to the highest of the list.
The reranking model, often a neural network or a transformer-based architecture, is specifically trained to evaluate the relevance of a document to a given query. By leveraging advanced natural language understanding capabilities, the reranker can capture the semantic nuances and contextual relationships between the query and the documents, leading to a more accurate and relevant rating.
Advantages of Two-Stage Retrieval and Rerankers
The adoption of two-stage retrieval with rerankers offers several significant advantages within the context of RAG systems:
- Improved Accuracy: By reranking the initially retrieved documents and promoting essentially the most relevant ones to the highest, the system can provide more accurate and precise information to the language model, resulting in higher-quality generated responses.
- Mitigated Out-of-Domain Issues: Embedding models used for traditional retrieval are sometimes trained on general-purpose text corpora, which can not adequately capture domain-specific language and semantics. Reranking models, then again, might be trained on domain-specific data, mitigating the “out-of-domain” problem and improving the relevance of retrieved documents inside specialized domains.
- Scalability: The 2-stage approach allows for efficient scaling by leveraging fast and light-weight retrieval methods within the initial stage, while reserving the more computationally intensive reranking process for a smaller subset of documents.
- Flexibility: Reranking models might be swapped or updated independently of the initial retrieval method, providing flexibility and flexibility to the evolving needs of the system.
ColBERT: Efficient and Effective Late Interaction
One among the standout models within the realm of reranking is ColBERT (Contextualized Late Interaction over BERT). ColBERT is a document reranker model that leverages the deep language understanding capabilities of BERT while introducing a novel interaction mechanism often called “late interaction.”
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
The late interaction mechanism in ColBERT allows for efficient and precise retrieval by processing queries and documents individually until the ultimate stages of the retrieval process. Specifically, ColBERT independently encodes the query and the document using BERT, after which employs a light-weight yet powerful interaction step that models their fine-grained similarity. By delaying but retaining this fine-grained interaction, ColBERT can leverage the expressiveness of deep language models while concurrently gaining the power to pre-compute document representations offline, considerably speeding up query processing.
ColBERT’s late interaction architecture offers several advantages, including improved computational efficiency, scalability with document collection size, and practical applicability for real-world scenarios. Moreover, ColBERT has been further enhanced with techniques like denoised supervision and residual compression (in ColBERTv2), which refine the training process and reduce the model’s space footprint while maintaining high retrieval effectiveness.
This code snippet demonstrates methods to configure and use the jina-colbert-v1-en model for indexing a set of documents, leveraging its ability to handle long contexts efficiently.
Implementing Two-Stage Retrieval with Rerankers
Now that we’ve got an understanding of the principles behind two-stage retrieval and rerankers, let’s explore their practical implementation inside the context of a RAG system. We’ll leverage popular libraries and frameworks to reveal the combination of those techniques.
Organising the Environment
Before we dive into the code, let’s arrange our development environment. We’ll be using Python and a number of other popular NLP libraries, including Hugging Face Transformers, Sentence Transformers, and LanceDB.
# Install required libraries
!pip install datasets huggingface_hub sentence_transformers lancedb
Data Preparation
For demonstration purposes, we’ll use the “ai-arxiv-chunked” dataset from Hugging Face Datasets, which incorporates over 400 ArXiv papers on machine learning, natural language processing, and huge language models.
from datasets import load_dataset
dataset = load_dataset(“jamescalam/ai-arxiv-chunked”, split=”train”)
Next, we’ll preprocess the information and split it into smaller chunks to facilitate efficient retrieval and processing.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)
def chunk_text(text, chunk_size=512, overlap=64):
tokens = tokenizer.encode(text, return_tensors=”pt”, truncation=True)
chunks = tokens.split(chunk_size – overlap)
texts = [tokenizer.decode(chunk) for chunk in chunks]
return texts
chunked_data = []
for doc in dataset:
text = doc[“chunk”]
chunked_texts = chunk_text(text)
chunked_data.extend(chunked_texts)
For the initial retrieval stage, we'll use a Sentence Transformer model to encode our documents and queries into dense vector representations, after which perform approximate nearest neighbor search using a vector database like LanceDB.
from sentence_transformers import SentenceTransformer
from lancedb import lancedb
# Load Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create LanceDB vector store
db = lancedb.lancedb('/path/to/store')
db.create_collection('docs', vector_dimension=model.get_sentence_embedding_dimension())
# Index documents
for text in chunked_data:
vector = model.encode(text).tolist()
db.insert_document('docs', vector, text)
from sentence_transformers import SentenceTransformer
from lancedb import lancedb
# Load Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create LanceDB vector store
db = lancedb.lancedb('/path/to/store')
db.create_collection('docs', vector_dimension=model.get_sentence_embedding_dimension())
# Index documents
for text in chunked_data:
vector = model.encode(text).tolist()
db.insert_document('docs', vector, text)
With our documents indexed, we will perform the initial retrieval by finding the closest neighbors to a given query vector.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)
def chunk_text(text, chunk_size=512, overlap=64):
tokens = tokenizer.encode(text, return_tensors=”pt”, truncation=True)
chunks = tokens.split(chunk_size – overlap)
texts = [tokenizer.decode(chunk) for chunk in chunks]
return texts
chunked_data = []
for doc in dataset:
text = doc[“chunk”]
chunked_texts = chunk_text(text)
chunked_data.extend(chunked_texts)
Reranking
After the initial retrieval, we’ll employ a reranking model to reorder the retrieved documents based on their relevance to the query. In this instance, we’ll use the ColBERT reranker, a quick and accurate transformer-based model specifically designed for document rating.
from lancedb.rerankers import ColbertReranker
reranker = ColbertReranker()
# Rerank initial documents
reranked_docs = reranker.rerank(query, initial_docs)
The reranked_docs
list now incorporates the documents reordered based on their relevance to the query, as determined by the ColBERT reranker.
Augmentation and Generation
With the reranked and relevant documents in hand, we will proceed to the augmentation and generation stages of the RAG pipeline. We’ll use a language model from the Hugging Face Transformers library to generate the ultimate response.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(“t5-base”)
model = AutoModelForSeq2SeqLM.from_pretrained(“t5-base”)
# Augment query with reranked documents
augmented_query = query + ” ” + ” “.join(reranked_docs[:3])
# Generate response from language model
input_ids = tokenizer.encode(augmented_query, return_tensors=”pt”)
output_ids = model.generate(input_ids, max_length=500)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)