Home Artificial Intelligence Text search vs. Vector search: Higher together? Preparing the dataset Hybrid search implementation Search Interpolate results and apply boost

Text search vs. Vector search: Higher together? Preparing the dataset Hybrid search implementation Search Interpolate results and apply boost

4
Text search vs. Vector search: Higher together?
Preparing the dataset
Hybrid search implementation
Search
Interpolate results and apply boost

Photo by Aarón Blanco Tejedor on Unsplash

Text databases play a critical role in lots of business workloads, especially in e-commerce where customers depend on product descriptions and reviews to make informed purchasing decisions. Vector search, a way that utilizes embeddings of text to seek out semantically similar documents is one other powerful tool on the market. Nonetheless, resulting from concerns concerning the complexity of implementing it into their current workflow some businesses could also be hesitant to check out vector search. But what if I told you that it may very well be done easily and with significant advantages?

On this blog post, I’ll show you learn how to easily create a hybrid setup that mixes the ability of text and vector search. This setup provides you with essentially the most comprehensive and accurate search results. I’ll be using OpenSearch because the search engine and Hugging Face’s Sentence Transformers for generating embeddings. The dataset I selected for this task is the ”XMarket” dataset (which is described in greater depth here), where we are going to embed the title field right into a vector representation throughout the indexing process.

First, let’s start by indexing our documents using Sentence Transformers. This library has pre-trained models that may generate embeddings for sentences or paragraphs. These embeddings act as a singular fingerprint for a bit of text. In the course of the indexing process, I converted the title field to a vector representation and indexed it in OpenSearch. You’ll be able to do that by simply importing the model and encoding any textual field.

The model might be imported by writing the next two lines:

from sentence_transformers import SentenceTransformer 

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

embedding = model.encode(text_field)

It’s that straightforward!

We are going to create an index named “products” by passing the next mapping:

{ 
"products":{
"mappings":{
"properties":{
"asin":{
"type":"keyword"
},
"description_vector":{
"type":"knn_vector",
"dimension":384
},
"item_image":{
"type":"keyword"
},
"text_field":{
"type":"text",
"fields":{
"keyword_field":{
"type":"keyword"
}
},
"analyzer":"standard"
}
}
}
}
}

asin — the document unique ID which is taken from the product metadata.

description_vector — that is where we are going to store our encoded product title field.

item_image- that is a picture url of the product

text_field — that is the title of the product

Note that we’re using standard OpenSearch analyzer, which knows to tokenize each word in a field into single keywords. OpenSearch takes these keywords and uses them for the Okapi BM25 algorithm. I also took the title field and saved it twice within the document; once in its raw format and once as a vector representation.

I’ll then use the model to encode the title field and create documents which shall be bulked to OpenSearch:

def store_index(index_name: str, data: np.array, metadata: list, os_client: OpenSearch): 
documents = []
for index_num, vector in enumerate(data):
metadata_line = metadata[index_num]
text_field = metadata_line["title"]
embedding = model.encode(text_field)
norm_text_vector_np = normalize_data(embedding)
document = {
"_index": index_name,
"_id": index_num,
"asin": metadata_line["asin"],
"description_vector": norm_text_vector_np.tolist(),
"item_image": metadata_line["imgUrl"],
"text_field": text_field
}
documents.append(document)
if index_num % 1000 == 0 or index_num == len(data):
helpers.bulk(os_client, documents, request_timeout=1800)
documents = []
print(f"bulk {index_num} indexed successfully")
os_client.indices.refresh(INDEX_NAME)

os_client.indices.refresh(INDEX_NAME)

The plan is to create a client which is able to take input from the user, generate an embedding using the Sentence Transformers model and perform our hybrid search. The user may also be asked to supply a lift level, which is the quantity of significance they need to offer to either text or vector search. This manner, the user can decide to prioritize one kind of search over the opposite. So if for instance the user wants the semantic meaning of his query to be taken into consideration greater than the straightforward textual appearance in the outline then he would give vector search a better boost than text search.

We’ll first run a text search on the index using OpenSearch’s search method. This method takes in a question string and returns a listing of documents that match the query. OpenSearch obtains the outcomes for text search by utilizing the Okapi BM25 because the rating algorithm. Text search using OpenSearch is performed by sending the next request body:

bm25_query = {
"size": 20,
"query": {
"match": {
"text_field": query
}
},
"_source": ["asin", "text_field", "item_image"],
}

Where textual_query is the text written by the user. For my results to return back in a clean manner I added “_source” so that OpenSearch will only return the particular fields I’m thinking about seeing.

Since text and vector search’s rating rating algorithm are different we are going to have to bring the scores to the identical scale as a way to mix the outcomes. To do this we’ll normalize the scores for every document from the text search. The utmost BM25 rating is the very best rating that might be assigned to a document in a set for a given query. It represents the utmost relevance of a document for the query. The worth of the utmost BM25 rating will depend on the parameters of the BM25 formula, resembling the typical document length, the term frequency, and the inverse document frequency. For that reason, I took the max rating received from OpenSearch for every query and divided each of the outcomes scores by it, giving us scores on the dimensions between 0 and 1. The next function demonstrates our normalization algorithm:

def normalize_bm25_formula(rating, max_score):
return rating / max_score

Next, we’ll conduct a vector search using the vector search method. This method takes a listing of embeddings and returns a listing of documents which are semantically much like the embeddings.

The search query for OpenSearch looks just like the following:

cpu_request_body = {
"size": 20,
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "knn_score",
"lang": "knn",
"params": {
"field": "description_vector",
"query_value": get_vector_sentence_transformers(query).tolist(),
"space_type": "cosinesimil"
}
}
}
},
"_source": ["asin", "text_field", "item_image"],
}

Where get_vector_sentence_transformers sends the text to model.encode(text_input) which returns a vector representation of the text. Also note that the upper your topK results, the more accurate your results shall be, but this may increase latency as well.

Photo by Amol Tyagi on Unsplash

Now we’ll have to mix the 2 search results. To do this, we’ll interpolate the outcomes so every document that occurred in each searches will appear higher within the hybrid results list. This manner, we will make the most of the strengths of each text and vector search to get essentially the most comprehensive results.

The next function is used to interpolate the outcomes of keyword search and vector search. It returns a dictionary containing the common elements between the 2 sets of hits in addition to the scores for every document. If the document appears in just one in all the search results, then we are going to assign it the bottom rating that was retrieved.

def interpolate_results(vector_hits, bm25_hits):
# gather all product ids
bm25_ids_list = []
vector_ids_list = []
for hit in bm25_hits:
bm25_ids_list.append(hit["_source"]["asin"])
for hit in vector_hits:
vector_ids_list.append(hit["_source"]["asin"])
# find common product ids
common_results = set(bm25_ids_list) & set(vector_ids_list)
results_dictionary = dict((key, []) for key in common_results)
for common_result in common_results:
for index, vector_hit in enumerate(vector_hits):
if vector_hit["_source"]["asin"] == common_result:
results_dictionary[common_result].append(vector_hit["_score"])
for index, BM_hit in enumerate(bm25_hits):
if BM_hit["_source"]["asin"] == common_result:
results_dictionary[common_result].append(BM_hit["_score"])
min_value = get_min_score(common_results, results_dictionary)
# assign minimum value scores for all unique results
for vector_hit in vector_hits:
if vector_hit["_source"]["asin"] not in common_results:
new_scored_element_id = vector_hit["_source"]["asin"]
results_dictionary[new_scored_element_id] = [min_value]
for BM_hit in bm25_hits:
if BM_hit["_source"]["asin"] not in common_results:
new_scored_element_id = BM_hit["_source"]["asin"]
results_dictionary[new_scored_element_id] = [min_value]

return results_dictionary

Ultimately we could have a dictionary with the document ID as a key and an array of rating values as a worth. The primary element within the array is the vector search rating and the second element is the text search normalized rating.

Finally, we apply a lift to our search results. We are going to iterate over the scores of the outcomes and multiply the primary element by the vector boost level and the second element by the text boost level.

def apply_boost(combined_results, vector_boost_level, bm25_boost_level):
for element in combined_results:
if len(combined_results[element]) == 1:
combined_results[element] = combined_results[element][0] * vector_boost_level +
combined_results[element][0] * bm25_boost_level
else:
combined_results[element] = combined_results[element][0] * vector_boost_level +
combined_results[element][1] * bm25_boost_level
#sort the outcomes based on the brand new scores
sorted_results = [k for k, v in sorted(combined_results.items(), key=lambda item: item[1], reverse=True)]
return sorted_results

It’s time to see what we’ve got! That is what the whole workflow looks like:

GIF by the Creator

I looked for a sentence “an ice cream scoop” with a 0.5 boost for vector search and a 0.5 boost for text search, and that is what I got in the highest few results:

Vector search returned —

Images from XMarket Dataset

Text search returned —

Images from XMarket Dataset

Hybrid search returned —

Images from XMarket Dataset

In this instance, we looked for “an ice cream scoop” using each text and vector search. The text search returns documents containing the keywords “an”, “ ice”,“cream” and “scoop”. The result that got here in fourth for text search is an ice cream machine and it’s actually not a scoop. The rationale it got here in so high is because its title which is “Breville BCI600XL Smart Scoop Ice Cream Maker” contained three of the keywords within the sentence: “Scoop”, “Ice”, “Cream” and subsequently scored highly on BM25 even though it didn’t match our search. Vector search then again, returns results which are semantically much like the query, no matter whether the keywords appear within the document or not. It knew that the indisputable fact that “scoop” appeared before “ice cream” meant that it could not match as well. Thus, we get a more comprehensive set of results that features greater than documents that mention “an ice cream scoop”.

Clearly, when you were to only use one kind of search, you’ll miss out on priceless results or display inaccurate results and frustrate your customers. When using the benefits of each worlds we receive more accurate results. So, I do imagine that the reply to our query is that higher together has proven itself to be true.

But wait, can higher change into ? One method to improve search experience is by utilizing the ability of the APU (Associative Processing Unit) in OpenSearch. By conducting the vector search on the APU using Searchium.ai’s plugin, we will make the most of advanced algorithms and processing capabilities to further improve the latency and significantly cut costs (for instance, $0.23 vs. $8.76) of our search while still getting similar results for vector search.

We will install the plugin, upload the index to the APU and search by sending a rather modified request body:

apu_request_body = {
"size": 20,
"query": {
"gsi_knn": {
"field": "description_vector",
"vector": get_vector_sentence_transformers(query).tolist(),
}
},
"_source": ["asin", "text_field", "item_image"],
}

All the opposite steps are an identical!

In conclusion, by combining text and vector search using OpenSearch and Sentence Transformers, businesses can easily improve their search results. And, by utilizing the APU, businesses can take their search results to the subsequent level while also cutting infrastructure costs. Don’t let concerns about complexity hold you back. Give it a try to see for yourself the advantages it will possibly bring. Glad searching!

The complete code might be found here

An enormous because of Yaniv Vaknin and Daphna Idelson for all of their help!

4 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here