Using Self-Organizing Map To Bolster Retrieval-Augmented Generation In Large Language Models

Artificial Intelligence

Using Self-Organizing Map To Bolster Retrieval-Augmented Generation In Large Language Models

admin

March 16, 2024

Using Self-Organizing Map To Bolster Retrieval-Augmented Generation In Large Language Models

SOM is proposed to bolster efficient retrieval of LLM context for RAG…

Background

Large volumes of knowledge are used to coach Large Language Models (LLM) containing thousands and thousands and billions of model parameters with the goal of text generation, similar to text completion, text summarization, language translations, and answering questions. While LLMs develop a knowledge base per se from the training data sources, there’s at all times a cut-off training date post which LLM is not going to know any newly generated data. For instance, the cut-off date for training OpenAI’s GPT-3.5-turbo-instruct LLM is September 2021 (Ref: https://platform.openai.com/docs/models/gpt-3-5-turbo), and as such, GPT-3.5-turbo-instruct LLM may not answer questions on 2022, 2023, or 2024 events accurately. Such data not a part of the LLM’s original training data is known as external data. Retrieval-Augmented Generation (RAG) is a method meant to assist in such cases by retrieving appropriate information contextual to the input prompt from authorized external sources and augmenting input in order that LLM can generate accurate and relevant responses. Effectively, RAG forms the gateway between the LLM and the external data. Such augmentation eliminates the necessity to retrain or further fine-tune the LLM model.

LLM’s Typical M.O.

LLMs are auto-regressive, generating a latest token based on the input prompt tokenized right into a sequence of tokens. The generation of the subsequent best token is probability-based and may be expressed as follows:

P( Yn ∣ X0, X1, ... Xn-1, θ )

Essentially, the probability of the newly generated nth token, Yn, is conditioned on the probability of the occurrence of the sequence of n-1 previous tokens X and the learned model parameters θ. It must be noted here that the tokenized input sequence X plays a vital role in generating the subsequent token. As well as, self-attention mechanisms complement effective auto-regression, where each input token within the sequence computes its representation by attending to and weighing the importance of other tokens within the sequence. Such intricate relationships and dependencies among the many tokens within the sequence also enable the LLM to decipher essentially the most probable next-best token that ‘gels well’ with the tokens within the input sequence. The LLM appends the brand new token to the previous tokens to form a latest input sequence and repeats the auto-regressive process until a completion condition is met, similar to reaching the utmost token count.

Such a self-attention-driven auto-regression implies that the LLM relies predominantly on the input sequence to generate the subsequent best token. So long as the input sequence helps determine the next-best token through self-attention, the LLM continues in a ‘virtuous’ loop, generating coherent, comprehensible, and relevant outputs. Quite the opposite, the LLM will start counting on the model parameters if the prompt inputs don’t help determine the subsequent best token. In such a case, the model may achieve generating the subsequent best token if the model has been trained to contain sufficient ‘knowledge’ contextual to the input prompt. Conversely, the model may go right into a ‘vicious’ loop, generating non-coherent, incomprehensible, and possibly irrelevant outputs if the prompt inputs pertain to ‘external data’ that the LLM has never been trained on.

Various techniques tackle this issue. Prompt engineering is one in every of them, where the goal is to deal with the ‘missing context’ by adjusting the prompt to boost the context in order that the LLM can generate relevant output. RAG is one other technique where the goal is to specifically address the ‘missing context because of external data’ by retrieving essentially the most appropriate information contextual to the input prompt from external data sources in an automatic manner and augmenting the prompt.

RAG’s Challenge

The first responsibility of RAG is to look and retrieve data that’s contextually related to the input prompt from external data sources similar to informational databases, APIs, and other document repositories like Wikipedia. A straightforward keyword search wouldn’t cut it. As a substitute, RAG requires a semantic search. To facilitate semantic search, the textual information retrieved from external sources is transformed into numerical representations or vectors, commonly called text embeddings, and stored in vector databases. There are numerous models or algorithms for creating these embeddings from text. The prompt is first transformed into its vector representation to look and retrieve closest matching external data vectors. Vector similarities (or vector distances) are then computed between the prompt vector and the previously stored external data vectors. Probably the most similar or nearest vectors are sorted and filtered using a threshold, and their corresponding textual information is retrieved to reinforce the prompt’s context. The next conceptual diagram captures the standard interactions between different components for enabling RAG:

Conceptual View of Primary System Component Interactions for Enabling RAG — Image by Writer

RAG’s challenge is that conducting a vector-driven semantic search is non-trivial and requires significant computational resources since it involves calculating vector similarities or distances against potentially an unlimited variety of vectors inside the database. Computing similarity or distance measures for every stored vector from an unlimited vector database for each input prompt will turn into infeasible. Besides, the lower the semantic match quality, the lower the LLM’s generative output quality. Subsequently, finding a technique to conduct the semantic search efficiently becomes crucial.

Solution

Several algorithmic solutions are employed to conduct efficient semantic searches. The standard approach of such algorithms is to group or cluster external data vectors as nearest neighbors and index them by mapping to such clusters. Such indexing is obtainable as a built-in capability by most vector databases. The matched clusters are first evaluated for the input prompt vector during semantic search. For every evaluated cluster, indexed vectors are chosen. Similarities between the input prompt vector and the chosen vectors are then computed. The expectation here is that finding the ‘nearest neighbors’ as an intermediate step reduces the variety of similarity computations significantly. Finally, the textual information is retrieved corresponding to essentially the most similar or nearest vectors filtered through thresholding. Algorithms similar to k-Nearest Neighbors, Ball-of-Radius-R, Locality-Sensitive-Hashing, DBSCAN-Clustering, Tree-Like hierarchies, and Graph-Like hierarchies are typically implemented by vector databases to facilitate semantic searches.

There isn’t a one-size-fits-all solution because different families of algorithms have different trade-offs when it comes to memory efficiency, compute efficiency, latency, accuracy, vector dimensionality, dataset sizing, etc. For instance, clustering methods enable speed by narrowing the vector space for semantic search, while tree-like or graph-like methods offer improved accuracy for low-dimensional vector data.

Self-Organizing Maps

A Self-Organizing Map (SOM) is a neural network-based dimensionality reduction algorithm developed by Teuvo Kohonen within the Nineteen Eighties. It is often used to scale back high-dimensional feature vectors to low-dimensional (typically two-dimensional) feature vectors. The core idea behind SOM is to represent high-dimensional data vectors as specific nodes in a low-dimensional space while retaining the vectors’ topology in the unique space. The variety of nodes within the low-dimensional space (SOM Nodes) is fixed (hyper-parameter). The precise locations of SOM nodes are evaluated through multiple training epochs. The goal of the iterative training is to regulate the locations of the SOM nodes within the low-dimensional space in order that they get mapped to the closest neighboring vectors within the high-dimensional feature space. In other words, the goal is to map nearest-neighbor vectors within the high-dimensional space to SOM nodes which are also nearest neighbors within the low-dimensional space.

SOM for RAG

On this write-up, I desired to share notes and findings from my experiments with SOM as a possible algorithm to propel RAG’s semantic search. There are three crucial reasons SOM could possibly be ideal in comparison with other algorithms:

Vectors’ high dimensionality can turn into a bottleneck for many other algorithms, similar to Trees and Graphs—the so-called curse of dimensionality. Quite the opposite, SOM is built for dimensionality reduction, and due to this fact, it will probably be effectively applied in each high-dimensional and low-dimensional scenarios.
SOM is less sensitive to random variations which will trickle into the unique high-dimensional vector space, leading to noise. Other algorithms may be sensitive to such noise, impacting the best way they cluster or group high-dimensional vectors as nearest neighbors. Since SOM employs intermediate SOM nodes in a lower-dimensional vector space which get evaluated as local averages of the mapped vectors from the higher-dimensional space, it effectively reduces noise.
The big size of the external dataset may constrain other algorithms to create semantic vector spaces, which may impact semantic matching’s latency and accuracy. Then again, SOM can tackle massive datasets since the variety of SOM nodes within the low-dimensional space may be fine-tuned through a hyper-parameter proportional to the underlying dataset size. While training a SOM using a big dataset may take longer, query time mapping stays quicker once training is completed.

I reveal a straightforward example of using SOM to conduct RAG’s semantic search to reinforce the context for query/answer using OpenAI’s GPT-3.5-turbo-instruct LLM. The first reason for using OpenAI’s GPT-3.5-turbo-instruct LLM is since the cut-off date for training OpenAI’s GPT-3.5-turbo-instruct LLM is September 2021 (Ref: https://platform.openai.com/docs/models/gpt-3-5-turbo), and as such, GPT-3.5-turbo-instruct LLM may not answer questions on 2022, 2023, or 2024 events accurately. Subsequently, details about 2022, 2023, 0r 2024 events can turn into ‘external data’ for OpenAI’s GPT-3.5-turbo-instruct LLM. I used Wikipedia API because the source for such ‘external data’ to fetch events’ information. The next are the steps I used to develop and train the instance, together with the sample code.

Step 1: PyTorch-Based Kohonen’s SOM implementation

I utilized PyTorch Tensors to represent vectors and implemented Kohonen’s SOM using PyTorch. This algorithm uses a two-dimensional lattice whose size becomes a hyper-parameter. The algorithm’s mathematical features were derived from a well-crafted perspective with lucid explanations mentioned in the next article:

The next code snippet shows the Python class for Kohonen’s SOM. The entire code is out there at this GitHub location. It’s value noting that this implementation is standalone, so it will probably be used outside of RAG example.

class KohonenSOM():
"""
The code is developed based on the next article:
http://www.ai-junkie.com/ann/som/som1.htmlThe vector and matrix operations are developed using PyTorch Tensors.
"""
def __init__( ... )
...
def find_topk_best_matching_units( self, data_points : torch.Tensor, topk : int = 1 ) -> List[ List[ int ] ] :
if len( data_points.size() ) == 1:
#batching 
data_points = data_points.view( 1, data_points.shape[0] )
topk = int( topk )
distances = self.dist_evaluator( data_points, self.lattice_node_weights )
topk_best_matching_unit_indexes = torch.topk( distances, topk, dim=1, largest=False ).indices
topk_best_matching_units = []
for i in range( data_points.shape[0] ):
best_matching_unit_indexes = topk_best_matching_unit_indexes[i]
best_matching_units = [ self.lattice_coordinates[ bmu_index.item() ].tolist() for bmu_index in best_matching_unit_indexes ]
topk_best_matching_units.append( best_matching_units )
return topk_best_matching_units

Step 2: SOM-Based Vector Indexer Implementation

The vector indexer is a utility that uses Kohonen’s SOM to coach SOM nodes with data vectors from an external dataset. Its primary purpose is to map each data vector to the closest top-k SOM nodes, enabling efficient indexing of the info vectors. The next code snippet shows the train and indexing function of the vector indexer Python class. Its complete code is out there at this GitHub location. Although its implementation is currently limited to the instance’s needs, it will probably be prolonged to fulfill other requirements.

class SOMBasedVectorIndexer():
...def train_n_gen_indexes( 
self, input_vectors : torch.Tensor, 
train_epochs : int = 100 
):
if self.generated_indexes:
print( "WARNING: Indexes were already generated. Ignoring the request..." )
return
self.som.train( input_vectors, train_epochs )
topk_bmu_indexes = self.som.find_topk_best_matching_units( input_vectors, topk = self.topk_bmu_for_indexing )
for idx in tqdm( range( len( topk_bmu_indexes ) ), desc="SOM-Based Indexed Vectors"  ):
bmu_indexes = topk_bmu_indexes[ idx ]
for bmu_index in bmu_indexes:
bmu_index_key = tuple( bmu_index )
idx_set = self.som_node_idx_map.get( bmu_index_key, set() )
idx_set.add( idx )
self.som_node_idx_map[ bmu_index_key ] = idx_set
self.generated_indexes = True

Step 3: OpenAI Embeddings-Based Text-To-Vector Encoder

The encoder’s primary function is to convert text into vector representations using OpenAI’s text embedding API. It’s value noting that an OpenAI account and API key are required to make use of the embedding API. Upon opening an account for the primary time, OpenAI provides complementary credit grants, that are good enough to access the API for testing purposes. Below is a code snippet showcasing the batch encode function of the OpenAI encoder Python class. The entire code is out there at this GitHub location.

import openai
from openai.embeddings_utils import get_embedding
...
from vector_encoder_parent import VectorEncoder
...class OpenAIEmbeddingsVectorEncoder( VectorEncoder ):
def __init__( ... )
...
def encode_batch( self, list_of_text : List[ str ] ) -> torch.Tensor :
if list_of_text == None or len( list_of_text ) == 0:
raise ValueError( "ERROR: Required list_of_text is None or empty" )
list_of_text = [ str( text ) for text in list_of_text ]
openai.api_key = self.openai_key
response = openai.Embedding.create(
input = list_of_text,
engine = self.vector_encoder_id
)
embeddings = [ data["embedding"] for data in response["data"] ] 
vectors = torch.tensor( embeddings, dtype=torch.float )
return vectors

Note that the OpenAI vector encoder class extends a generic parent class, ‘VectorEncoder,’ that defines abstract encoding functions to be implemented through inheritance. It is feasible to implement other varieties of vector encoders by inheriting from this parent class for the pluggability of other encoding schemes. The entire code for the parent vector encoder class may be found at this GitHub location.

Step 4: Wikipedia API-Driven DataSource Implementation

This utility class is designed to encapsulate the info retrieval logic that integrates with Wikipedia API. Its major function is to fetch events for a specified array of calendar years, format the retrieved events, and cargo them right into a Pandas dataframe. The code snippet below captures the first function of the utility class, while the entire code is out there at this GitHub location.

import requests
import pandas as pd
from dateutil.parser import parse
...
class WikiEventsDataSource():
...
def fetch_n_prepare_data( self ):
if self.fetched:
print( "WARNING: Wiki events for the desired years already fetched. Ignoring the request..." )
returnmain_df = pd.DataFrame()
for 12 months in self.event_years_to_fetch:
wiki_api_params = {
"motion": "query", 
"prop": "extracts",
"exlimit": 1,
"titles": 12 months,
"explaintext": 1,
"formatversion": 2,
"format": "json"
}
response = requests.get( "https://en.wikipedia.org/w/api.php", params=wiki_api_params )
response_dict = response.json()
df = pd.DataFrame()
df[ "text" ] = response_dict["query"]["pages"][0]["extract"].split("n")
df = self.__clean_df__( df, 12 months )
main_df = pd.concat( [ main_df, df ] )
self.df = main_df.reset_index(drop=True)
self.fetched = True

Step 5: SOM-Based RAG Utility Implementation

The SOM-based RAG utility is an important element of the instance implementation. It utilizes the vector encoder, indexer, and data source to implement the core logic for the underlying semantic search. The entire code for the SOM-based RAG utility is out there at this GitHub location.

The utility implements three primary functions. The primary function is to load data from an external data source and encode it into vectors, as shown in the next code snippet.

...
from vector_encoder_parent import VectorEncoder
from vector_indexer import SOMBasedVectorIndexerclass SOM_Based_RAG_Util():
...
def load_n_vectorize_data( self, data_source ):
if self.data_loaded_n_vectorized:
print( "WARNING: Data already loaded and vectorized. Ignoring the request..." )
return
data_source.fetch_n_prepare_data()
self.df = data_source.get_data()
vectors = None
for i in tqdm( range(0, len(self.df), self.vectorize_batch_size ), desc="Vectorized Data Batch" ):
list_of_text = self.df.iloc[ i:i+self.vectorize_batch_size ]["text"].tolist()
batch_encoded_vectors = self.vector_encoder.encode_batch( list_of_text )
if vectors == None:
vectors = batch_encoded_vectors
else:
vectors = torch.cat( [ vectors, batch_encoded_vectors], dim=0 )
self.vectors = vectors.to( self.device )
self.data_loaded_n_vectorized = True

The second function is to coach the SOM-based indexer to construct Kohonen’s SOM nodes after which index the info vectors, as shown in the next code snippet.

def train_n_index_data_vectors( self, train_epochs : int = 100  ):
if not self.data_loaded_n_vectorized:
raise ValueError( "ERROR: Data not loaded and vectorized." )if self.data_vectors_indexed:
print( "WARNING: Data vectors already indexed. Ignoring the request..." )
return
self.vector_indexer.train_n_gen_indexes( self.vectors, train_epochs )
self.data_vectors_indexed = True

The third function is to seek out similar information from the previously stored external dataset based on a question text. This function uses the encoder to convert the query text right into a vector after which searches through the SOM-based indexer for the most certainly matches. This function then calculates the similarity between the query vector and the discovered data vectors using Cosine similarity or one other specified similarity evaluator. Finally, this function filters the info vectors whose similarities are greater than or equal to the desired similarity threshold. The next code snippet captures the function implementation.

def find_semantically_similar_data( self, query: str, sim_evaluator = None, sim_threshold : float = 0.8  ):
if not self.data_vectors_indexed:
raise ValueError( "ERROR: Data vectors not indexed." )if query == None or len( query.strip() ) == 0:
raise ValueError( "ERROR: Required query text shouldn't be specified." )
sim_threshold = float( sim_threshold )
if sim_evaluator == None:
sim_evaluator = nn.CosineSimilarity(dim=0, eps=1e-6)
query_vector = self.vector_encoder.encode( query )
query_vector = query_vector.view( self.vector_encoder.get_encoded_vector_dimensions() )
query_vector = query_vector.to( self.device )
nearest_indexes = self.vector_indexer.find_nearest_indexes( query_vector )
nearest_indexes = nearest_indexes[0]
sim_scores = []
for idx in nearest_indexes:
data_vector = self.vectors[ idx ]
data_vector = data_vector.view( self.vector_encoder.get_encoded_vector_dimensions() )
sim_score = sim_evaluator( query_vector, data_vector )
if sim_score >= sim_threshold:
sim_score_tuple = (idx, sim_score.item() )
sim_scores.append( sim_score_tuple )
sim_scores.sort( key = lambda x: x[1], reverse=True )
semantically_similar_data = [ 
{ 
'text': self.df[ 'text' ][ idx ],
'sim_score' : sim_score
} for idx, sim_score in sim_scores
]
return semantically_similar_data

An example output from a semantic search by SOM-based RAG utility function is shown below:

An Example Semantic Search Output — Image by Writer

Step 6: Abstract Query/Answer ChatBot And Its OpenAI-Based Implementation

An abstract ‘QuestionAnswerChatBot’ Python class is developed to facilitate chatbot-like implementations. It augments the prompted query by utilizing a normal instruction template and populating it with contextually similar information retrieved from the RAG utility.

The required maximum number of latest tokens limits the text size for context augmentation, while token counting is deferred to underlying implementations. In LLM economics, tokens are like currency. Each token the model processes requires computational resources — memory, processing power, and time. Thus, the more tokens an LLM has to process, the greater the computational cost.

Finally, this class delegates prompting of the LLM model to the underlying implementation once the QA instruction has been populated. The next code snippet captures the first function; the entire code is out there at this GitHub location.

from abc import ABC, abstractmethod
import torch
import mathclass QuestionAnswerChatBot( ABC ):
...
def find_answer_to_question( self, query : str, sim_threshold = 0.68, max_new_tokens : int = 5 ):
if query == None or len( query.strip() ) == 0:
raise ValueError( "ERROR: Required query shouldn't be specified" )
sim_threshold = float( sim_threshold )
max_new_tokens = int( max_new_tokens )
qa_instruction = self.get_qa_instruction( query, sim_threshold = sim_threshold )
answer_text = self.__get_answer_text__( qa_instruction, max_new_tokens = max_new_tokens )
answer_text = self.__clean_answer_text__( qa_instruction, answer_text )
return answer_text
...
def __qa_template__( self ):
qa_template = """Context: 
{}
---
Query: {}
Answer:"""
return qa_template

The Python class ‘OpenAIQuestionAnswerChatBot’ extends the abstract ‘QuestionAnswerChatBot’ and implements the chatbot functionality using the OpenAI LLM API. The next code snippet shows the category’s primary function. The entire code is out there at this GitHub location.

import openai
import tiktoken
from qa_chatbot import QuestionAnswerChatBotclass OpenAIQuestionAnswerChatBot( QuestionAnswerChatBot ):
...
def __get_answer_text__( self, qa_instruction : str, max_new_tokens : int = 5 ) -> str :
openai.api_key = self.openai_key
basic_answer = openai.Completion.create(
model = self.openai_model_name,
prompt = qa_instruction, 
)
answer_text = basic_answer[ "choices" ][0][ "text" ]
return answer_text
def __token_count__( self, text : str ):    
return len( self.tokenizer.encode( text ) )

The next is an example of how a prompted query gets augmented with context using similar information retrieved through semantic search:

An Example Context Augmented Query Prompt — Image by Writer

Step 7: Sample Questions for Testing

The next are sample questions for testing the RAG using OpenAI’s GPT-3.5-turbo-instruct LLM. They were developed to be sure that their answers pertain to events that occurred in 2022, 2023, and 2024.

sample_questions = [
"Who won the 2022 soccer world cup?",
"When did Sweden join NATO?",
"Who joined NATO in 2023?",
"Who joined NATO in 2024?",
"Which is the 31st member of NATO?",
"Which is the 32nd member of NATO?",
"Who won the Cricket World Cup in 2023?",
"Who defeated India in Cricket World Cup final in 2023?",
"Name the former prime minister of Japan that was assassinated in 2022?",
"When did Chandrayaan-3 land near the south pole of the Moon?",
"Where did Chandrayaan-3 land on the Moon?",
"Who acquired Twitter in 2022?",
"Who owns Twitter?",
"Who acquired Activision Blizzard in 2023?"
]

Step 8: Putting Every thing Together

The entire Jupyter notebook that brings all of the components together may be found at this GitHub location. The next code snippet shows the initiation of the major OpenAI-based QA chatbot. Note that OpenAI’s text embedding algorithm, “text-embedding-ada-002,” is used for vector encoding. Likewise, the chatbot uses OpenAI’s tokenizer, “cl100k_base,” to count the tokens to limit the contextual text to reinforce the query prompt by leveraging the inbuilt functions of the TikToken Python library.

openai_vector_encoder_id = "text-embedding-ada-002"
openai_encoded_vector_dimensions = 1536
openai_tokenizer_name = "cl100k_base" 
openai_model_name = "gpt-3.5-turbo-instruct"vector_encoder = OpenAIEmbeddingsVectorEncoder( openai_encoded_vector_dimensions, openai_vector_encoder_id, openai_key )
event_years_to_fetch = [ 2022, 2023, 2024 ]
data_source = WikiEventsDataSource( event_years_to_fetch  )
...
som_driven_rag_util = SOM_Based_RAG_Util( 
vector_encoder = vector_encoder,
som_lattice_height = 20,
som_lattice_width = 30,
learning_rate = 0.3,
topk_bmu_for_indexing = 10,
device = device
)
...
openai_chatbot = OpenAIQuestionAnswerChatBot( 
vector_db_util = som_driven_rag_util,
openai_tokenizer_name = openai_tokenizer_name,
openai_model_name = openai_model_name,
openai_key = openai_key,
question_input_max_token_count = 100,
context_trim_percent = 0.1,
device = device
)

The next sequence diagrams help visualize all of the component interactions throughout the initialization and actual query/answering phases.

Interactions of Various Components During Initialization — Image by Writer

Interactions of Various Components During Query/Answering — Image by Writer

Findings

The next image captures the query/answers from OpenAI’s GPT-3.5-turbo-instruct LLM with and without context augmentation.

OpenAI’s GPT-3.5-turbo-instruct LLM’s Answers With and Without Context Augmentation — Image by Writer

Understandably, the LLM finds it difficult to reply questions on events that occurred after its September 2021 cut-off date. Typically, it clearly responds that the questions are from a future time relative to its training cut-off date. Quite the opposite, the identical LLM answers all of the questions accurately to perfection when the context of the prompted questions is augmented with relevant information from years 2022, 2023, and 2024 retrieved from Wikipedia. The actual credit here goes to the SOM that formed the premise for RAG’s semantic search to retrieve and augment the prompted query’s context with relevant information.

Suggested Next Steps

While the above example served as a proof-of-concept to evaluate the suitability of a Self-Organizing Map to enable Retrieval-Augmented Generation of text by an LLM, a more comprehensive benchmarking is recommended to guage its performance compared to other algorithms using a much larger external dataset, where performance is measured when it comes to the standard of LLM outputs (something like perplexity + accuracy). As well as, for the reason that current example enables a pluggable framework, it is recommended that other open-source and free QA LLMs be used to conduct such benchmarking to attenuate the LLM usage expenses.

To assist run the instance in local environments, I included the ‘requirements.txt’ file, which accommodates various versions of Python libraries I utilized in my environment to run and test the above example. This file is out there at this GitHub location.

I conclude by promising to share my findings in a separate write-up if I conduct any such benchmarks. Please stay tuned!!