Similarity Search, Part 5: Locality Sensitive Hashing (LSH)

Artificial Intelligence

Similarity Search, Part 5: Locality Sensitive Hashing (LSH)

admin

June 25, 2023

Similarity Search, Part 5: Locality Sensitive Hashing (LSH)

Explore how similarity information may be incorporated into hash function

Similarity search is an issue where given a question the goal is to search out probably the most similar documents to it amongst all of the database documents.

In data science, similarity search often appears within the NLP domain, engines like google or recommender systems where probably the most relevant documents or items have to be retrieved for a question. There exists a big number of other ways to enhance search performance in massive volumes of knowledge.

In previous parts of this text series, we discussed inverted file index, product quantization and HNSW and the way they may be used together to enhance search quality. On this chapter, we’re going to have a look at a principally different approach that maintains each high search speed and quality.

Local Sensitive Hashing (LSH) is a set of methods that’s used to cut back the search scope by transforming data vectors into hash values while preserving details about their similarity.

We’re going to discuss the normal approach which consists of three steps:

Shingling: encoding original texts into vectors.
MinHashing: transforming vectors right into a special representation called signature which may be used to check similarity between them.
LSH function: hashing signature blocks into different buckets. If the signatures of a pair of vectors fall into the identical bucket no less than once, they’re considered candidates.

We’re going to progressively dive into the main points throughout the article of every of those steps.

Shingling is the strategy of collecting k-grams on given texts. k-gram is a bunch of k sequential tokens. Depending on the context, tokens may be words or symbols. The final word goal of shingling is through the use of collected k-grams to encode each document. We shall be using one-hot encoding for this. Nevertheless, other encoding methods can be applied.

Collecting unique shringles of length k = 3 for the sentence “learning data science is fascinating”

Firstly, unique k-grams for every document are collected. Secondly, to encode each document, a vocabulary is required which represents a set of unique k-grams in all documents. Then for every document, a vector of zeros with the length equal to the dimensions of the vocabulary is created. For each appearing k-gram within the document, its position within the vocabulary is identified and a “1” is placed on the respective position of the document vector. Even when the identical k-gram appears several times in a document, it doesn’t matter: the worth within the vector will all the time be 1.

At this stage, initial texts have been vectorised. The similarity of vectors may be compared via Jaccard index. Do not forget that Jaccard index of two sets is defined because the variety of common elements in each sets divided by the length of all the weather.

Jaccard Index is defined because the intersection over the union of two sets

If a pair of encoded vectors is taken, the intersection within the formula for Jaccard index is the variety of rows that each contain 1 (i.e. k-gram appears in each vectors) and the union is the variety of rows with no less than one 1 (k-gram is presented no less than in one in all the vectors).

Formula for Jaccard Index of two vectors

Example of calculating Jaccard Index for 2 vectors using the formula above

The present problem right away is the sparsity of encoded vectors. Computing a similarity rating between two one-hot encoded vectors would take quite a lot of time. Transforming them to a dense format would make it more efficient to operate on them later. Ultimately, the goal is to design such a function that may transform these vectors to a smaller dimension preserving the data about their similarity. The tactic that constructs such a function known as MinHashing.

MinHashing is a hash function that permutes the components of an input vector after which returns the primary index where the permutated vector component equals 1.

Example of calculating a minhash value for a given vector and permutation

For getting a dense representation of a vector consisting of n numbers, n minhash functions may be used to acquire n minhash values which form a signature.

It could not sound obvious at first but several minhash values may be used to approximate Jaccard similarity between vectors. In truth, the more minhash values are used, the more accurate the approximation is.

Calculation of signature matrix and the way it’s used to compute similarities between vectors. Similarities computed using Jaccard similarity and signatures should normally be roughly equal.

That is only a useful commentary. It seems that there’s a whole theorem behind the scenes. Allow us to understand why Jaccard index may be calculated through the use of signatures.

Statement proof

Allow us to assume that a given pair of vectors comprises only rows of type 01, 10 and 11. Then a random permutation on these vectors is performed. Since there exists no less than one 1 in all of the rows, then while computing each hash values, no less than one in all these two hash-value computation processes will stop at the primary row of a vector with the corresponding hash value equal to 1.

What’s the probability that the second hash value shall be equal to the primary one? Obviously, this may only occur if the second hash value can be equal to 1. Which means that the primary row must be of type 11. Because the permutation was taken randomly, the probability of such an event is the same as P = count(11) / (count(01) + count(10) + count(11)). This expression is totally similar to the Jaccard index formula. Due to this fact:

The probability of getting equal hash values for 2 binary vectors based on a random rows permutation equals the Jaccard index.

Nonetheless, by proving the statement above, we assumed that initial vectors didn’t contain rows of type 00. It is evident that rows of type 00 don’t change the worth of Jaccard index. Similarly, the probability of getting the identical hash values with rows of type 00 included doesn’t affect it. For instance, if the primary permutated row is 00, then minhash algorithm just ignores it and switches to the subsequent row until there exists no less than one 1 in a row. In fact, rows of type 00 can lead to different hash values than without them however the probability of getting the identical hash values stays the identical.

We’ve proven a crucial statement. But how the probability of getting the identical minhash values may be estimated? Definitely, it is feasible to generate all possible permutations for vectors after which calculate all minhash values to search out the specified probability. For obvious reasons, this just isn’t efficient since the variety of possible permutations for a vector of size n equals n!. Nevertheless, the probability may be evaluated roughly: allow us to just use many hash functions to generate that many hash values.

The Jaccard index of two binary vectors roughly equals the variety of corresponding values of their signatures.

Mathematical notation

It is simple to note that taking longer signatures ends in more accurate calculations.

At the present moment, we will transform raw texts into dense signatures of equal length preserving the data about similarity. Nevertheless, in practice, such dense signatures still often have high dimensions and it could be inefficient to directly compare them.

Consider n = 10⁶ documents with their signatures of length 100. Assuming that a single variety of a signature requires 4 bytes to store, then the entire signature would require 400 bytes. For storing n = 10⁶ documents, 400 MB of space is required which is doable in point of fact. But comparing each document with one another in a brute-force manner would require roughly 5 * 10¹¹ comparisons which is just too much, especially when n is even larger.

To avoid the issue, it is feasible to construct a hash table to speed up search performance but even when two signatures are very similar and differ only in 1 position, they’re still more likely to have a distinct hash (because vector remainders are more likely to be different). Nonetheless, we normally want them to fall into the identical bucket. That is where LSH involves the rescue.

LSH mechanism builds a hash table consisting of several parts which puts a pair of signatures into the identical bucket in the event that they have no less than one corresponding part.

LSH takes a signature matrix and horizontally divides it into equal b parts called bands each containing r rows. As a substitute of plugging the entire signature right into a single hash function, the signature is split by b parts and every subsignature is processed independently by a hash function. As a consequence, each of the subsignatures falls into separate buckets.

Example of using LSH. Two signatures of length 9 are divided into b = 3 bands each containing r = 3 rows. Each subvector is hashed into one in all k possible buckets. Since there may be a match within the second band (each subvectors have the identical hash value), we consider a pair of those signatures as candidates to be the closest neighbours.

If there may be no less than one collision between corresponding subvectors of two different signatures, the signatures are considered candidates. As we will see, this condition is more flexible since for considering vectors as candidates they don’t have to be absolutely equal. Nevertheless, this increases the variety of false positives: a pair of various signatures can have a single corresponding part but in overall be completely different. Depending on the issue, it’s all the time higher to optimize parameters b, r and k.

With LSH, it is feasible to estimate the probability that two signatures with similarity s shall be regarded as candidates given a lot of bands b and variety of rows r in each band. Allow us to find the formula for it in several steps.

The probability that one random row of each signatures is equal

The probability that one random band with r rows is equal

The probability that one random band with r rows is different

The probability that every one b bands within the table are different

The probability that no less than one in all b bands is equal, so two signatures are candidates

Note that the formula doesn’t consider collisions when different subvectors by chance hash into the identical bucket. For this reason, the true probability of signatures being the candidates might insignificantly differ.

Example

For getting a greater sense of the formula we have now just obtained, allow us to consider a straightforward example. Consider two signatures with the length of 35 symbols that are equally split into 5 bands with 7 rows each. Here is the table which represents the probability of getting no less than one equal band based on their Jaccard similarity:

Probability P of getting no less than one corresponding band of two signatures based on their similarity s

We notice that if two similar signatures have the Jaccard similarity of 80%, then they’ve a corresponding band in 93.8% of cases (true positives). In the remainder 6.2% of scenarios such a pair of signatures is false negative.

Now allow us to take two different signatures. For example, they’re similar only by 20%. Due to this fact, they’re false positive candidates in 0.224% of cases. In other 99.776% of cases, they wouldn’t have an identical band, so that they are true negatives.

Visualisation

Allow us to now visualise the connection between similarity s and probability P of two signatures becoming candidates. Normally with higher signature similarity s, signatures must have a better probability of being candidates. Ideally, it could appear like below:

Ideal scenario. A pair of signatures is taken into account candidates provided that their similarity is bigger than a certain threshold t

Based on the probability formula obtained above, a typical line would appear like within the figure below:

A typical line that slowly increases to start with and at the tip and has a steep slope at a threshold t given by the approximate probability formula within the figure

It is feasible to differ the variety of bands b to shift the road within the figure to the left or to the appropriate. Increasing b moves the road to the left and ends in more FP, decreasing — shifts it to the appropriate and results in more FN. It can be crucial to search out balance, depending on the issue.

With a better variety of bands the road moves to the left, with lower — to the appropriate

Moving the brink to the left increases FP while shifting it to the appropriate increases FN

Experimentations with different numbers of bands and rows

Several line plots are built below for various values of b and r. It’s all the time higher to regulate these parameters based on the precise task to successfully retrieve all pairs of comparable documents and ignore those with different signatures.

We’ve walked through a classical implementation of the LSH method. LSH significantly optimizes search speed through the use of lower dimensional signature representations and a quick hashing mechanism to cut back the candidates’ search scope. At the identical time, this comes at the price of search accuracy but in practice, the difference is frequently insignificant.

Nonetheless, LSH is vulnerable to high dimensional data: more dimensions require longer signature lengths and more computations to take care of search quality. In such cases, it is suggested to make use of one other index.

In truth, there exist different implementations of LSH, but all of them are based on the identical paradigm of transforming input vectors to hash values while preserving details about their similarity. Mainly, other algorithms simply define other ways of obtaining these hash values.

Random projections is one other LSH method that shall be covered in the subsequent chapter and which is implemented as an LSH index within the Faiss library for similarity search.

All images unless otherwise noted are by the writer.