Similarity Search, Part 5: Locality Sensitive Hashing (LSH) Introduction Shingling MinHashing LSH Function Error rate Conclusion Resources

Artificial Intelligence

Similarity Search, Part 5: Locality Sensitive Hashing (LSH) Introduction Shingling MinHashing LSH Function Error rate Conclusion Resources

admin

June 26, 2023

Similarity Search, Part 5: Locality Sensitive Hashing (LSH)
Introduction
Shingling
MinHashing
LSH Function
Error rate
Conclusion
Resources

Explore how similarity information might be incorporated into hash function

S is an issue where given a question the goal is to search out probably the most similar documents to it amongst all of the database documents.

In data science, similarity search often appears within the NLP domain, search engines like google and yahoo or recommender systems where probably the most relevant documents or items must be retrieved for a question. There exists a big number of alternative ways to enhance search performance in massive volumes of knowledge.

In previous parts of this text series, we discussed inverted file index, product quantization and HNSW and the way they might be used together to enhance search quality. On this chapter, we’re going to have a look at a principally different approach that maintains each high search speed and quality.

(LSH) is a set of methods that’s used to scale back the search scope by transforming data vectors into hash values while preserving details about their similarity.

We’re going to discuss the standard approach which consists of three steps:

: encoding original texts into vectors.
: transforming vectors right into a special representation called which might be used to check similarity between them.
: hashing signature blocks into different buckets. If the signatures of a pair of vectors fall into the identical bucket not less than once, they’re considered candidates.

We’re going to regularly dive into the small print throughout the article of every of those steps.

is the means of collecting k-grams on given texts. is a bunch of k sequential tokens. Depending on the context, tokens might be words or symbols. The final word goal of shingling is by utilizing collected k-grams to encode each document. We will likely be using one-hot encoding for this. Nevertheless, other encoding methods will also be applied.

Collecting unique shringles of length k = 3 for the sentence “learning data science is fascinating”

Firstly, unique k-grams for every document are collected. Secondly, to encode each document, a vocabulary is required which represents a set of unique k-grams in all documents. Then for every document, a vector of zeros with the length equal to the dimensions of the vocabulary is created. For each appearing k-gram within the document, its position within the vocabulary is identified and a “1” is placed on the respective position of the document vector. Even when the identical k-gram appears several times in a document, it doesn’t matter: the worth within the vector will all the time be 1.

At this stage, initial texts have been vectorised. The similarity of vectors might be compared via . Do not forget that Jaccard index of two sets is defined because the variety of common elements in each sets divided by the length of all the weather.

Jaccard Index is defined because the intersection over the union of two sets

If a pair of encoded vectors is taken, the intersection within the formula for Jaccard index is the variety of rows that each contain 1 (i.e. k-gram appears in each vectors) and the union is the variety of rows with not less than one 1 (k-gram is presented not less than in one in every of the vectors).

Formula for Jaccard Index of two vectors

Example of calculating Jaccard Index for 2 vectors using the formula above

The present problem right away is the sparsity of encoded vectors. Computing a similarity rating between two one-hot encoded vectors would take loads of time. Transforming them to a dense format would make it more efficient to operate on them later. Ultimately, the goal is to design such a function that can transform these vectors to a smaller dimension preserving the knowledge about their similarity. The tactic that constructs such a function known as MinHashing.

is a hash function that permutes the components of an input vector after which returns the primary index where the permutated vector component equals 1.

Example of calculating a minhash value for a given vector and permutation

For getting a dense representation of a vector consisting of n numbers, n minhash functions might be used to acquire n minhash values which form a .

It could not sound obvious at first but several minhash values might be used to approximate Jaccard similarity between vectors. The truth is, the more minhash values are used, the more accurate the approximation is.

Calculation of signature matrix and the way it’s used to compute similarities between vectors. Similarities computed using Jaccard similarity and signatures should normally be roughly equal.

That is only a useful statement. It seems that there’s a whole theorem behind the scenes. Allow us to understand why Jaccard index might be calculated by utilizing signatures.

Statement proof

Allow us to assume that a given pair of vectors accommodates only rows of type 01, 10 and 11. Then a random permutation on these vectors is performed. Since there exists not less than one 1 in all of the rows, then while computing each hash values, not less than one in every of these two hash-value computation processes will stop at the primary row of a vector with the corresponding hash value equal to 1.

What’s the probability that the second hash value will likely be equal to the primary one? Obviously, this may only occur if the second hash value can also be equal to 1. Which means the primary row needs to be of type 11. For the reason that permutation was taken randomly, the probability of such an event is the same as P = count(11) / (count(01) + count(10) + count(11)). This expression is totally similar to the Jaccard index formula. Subsequently:

.

Nevertheless, by proving the statement above, we assumed that initial vectors didn’t contain rows of type 00. It is obvious that rows of type 00 don’t change the worth of Jaccard index. Similarly, the probability of getting the identical hash values with rows of type 00 included doesn’t affect it. For instance, if the primary permutated row is 00, then minhash algorithm just ignores it and switches to the subsequent row until there exists not less than one 1 in a row. .

We have now proven a crucial statement. But how the probability of getting the identical minhash values might be estimated? Definitely, it is feasible to generate all possible permutations for vectors after which calculate all minhash values to search out the specified probability. For obvious reasons, this shouldn’t be efficient since the variety of possible permutations for a vector of size n equals n!. Nevertheless, the probability might be evaluated roughly: allow us to just use many hash functions to generate that many hash values.

The Jaccard index of two binary vectors roughly equals the variety of corresponding values of their signatures.

Mathematical notation

It is simple to note that taking longer signatures ends in more accurate calculations.

At the present moment, we are able to transform raw texts into dense signatures of equal length preserving the knowledge about similarity. Nevertheless, in practice, such dense signatures still normally have high dimensions and it could be inefficient to directly compare them.

Consider n = 10⁶ documents with their signatures of length 100. Assuming that a single variety of a signature requires 4 bytes to store, then the entire signature would require 400 bytes. For storing n = 10⁶ documents, 400 MB of space is required which is doable in point of fact. But comparing each document with one another in a brute-force manner would require roughly 5 * 10¹¹ comparisons which is just too much, especially when n is even larger.

To avoid the issue, it is feasible to construct a hash table to speed up search performance but even when two signatures are very similar and differ only in 1 position, they’re still more likely to have a distinct hash (because vector remainders are more likely to be different). Nevertheless, we normally want them to fall into the identical bucket. That is where LSH involves the rescue.

mechanism builds a hash table consisting of several parts which puts a pair of signatures into the identical bucket in the event that they have not less than one corresponding part.

LSH takes a signature matrix and horizontally divides it into equal b parts called each containing r . As an alternative of plugging the entire signature right into a single hash function, the signature is split by b parts and every subsignature is processed independently by a hash function. As a consequence, each of the subsignatures falls into separate buckets.

Example of using LSH. Two signatures of length 9 are divided into b = 3 bands each containing r = 3 rows. Each subvector is hashed into one in every of k possible buckets. Since there’s a match within the second band (each subvectors have the identical hash value), we consider a pair of those signatures as candidates to be the closest neighbours.

If there’s not less than one collision between corresponding subvectors of two different signatures, the signatures are considered candidates. As we are able to see, this condition is more flexible since for considering vectors as candidates they don’t must be absolutely equal. Nevertheless, this increases the variety of false positives: a pair of various signatures can have a single corresponding part but in overall be completely different. Depending on the issue, it’s all the time higher to optimize parameters b, r and k.

With LSH, it is feasible to estimate the probability that two signatures with similarity s will likely be regarded as candidates given quite a lot of bands b and variety of rows r in each band. Allow us to find the formula for it in several steps.

The probability that one random row of each signatures is equal

The probability that one random band with r rows is equal

The probability that one random band with r rows is different

The probability that each one b bands within the table are different

The probability that not less than one in every of b bands is equal, so two signatures are candidates

Note that the formula doesn’t consider collisions when different subvectors by accident hash into the identical bucket. For this reason, the true probability of signatures being the candidates might insignificantly differ.

Example

For getting a greater sense of the formula we’ve got just obtained, allow us to consider an easy example. Consider two signatures with the length of 35 symbols that are equally split into 5 bands with 7 rows each. Here is the table which represents the probability of getting not less than one equal band based on their Jaccard similarity:

Probability P of getting not less than one corresponding band of two signatures based on their similarity s

We notice that if two similar signatures have the Jaccard similarity of 80%, then they’ve a corresponding band in 93.8% of cases (true positives). In the remainder 6.2% of scenarios such a pair of signatures is false negative.

Now allow us to take two different signatures. As an illustration, they’re similar only by 20%. Subsequently, they’re false positive candidates in 0.224% of cases. In other 99.776% of cases, they wouldn’t have the same band, in order that they are true negatives.

Visualisation

Allow us to now visualise the connection between similarity s and probability P of two signatures becoming candidates. Normally with higher signature similarity s, signatures must have a better probability of being candidates. Ideally, it could appear like below:

Ideal scenario. A pair of signatures is taken into account candidates provided that their similarity is bigger than a certain threshold t

Based on the probability formula obtained above, a typical line would appear like within the figure below:

A typical line that slowly increases at first and at the top and has a steep slope at a threshold t given by the approximate probability formula within the figure

It is feasible to differ the variety of bands b to shift the road within the figure to the left or to the precise. Increasing b moves the road to the left and ends in more FP, decreasing — shifts it to the precise and results in more FN. It is crucial to search out a great balance, depending on the issue.

With a better variety of bands the road moves to the left, with lower — to the precise

Moving the brink to the left increases FP while shifting it to the precise increases FN

Experimentations with different numbers of bands and rows

Several line plots are built below for various values of b and r. It’s all the time higher to regulate these parameters based on the particular task to successfully retrieve all pairs of comparable documents and ignore those with different signatures.

We have now walked through a classical implementation of the LSH method. LSH significantly optimizes search speed by utilizing lower dimensional signature representations and a quick hashing mechanism to scale back the candidates’ search scope. At the identical time, this comes at the price of search accuracy but in practice, the difference is often insignificant.

Nevertheless, LSH is vulnerable to high dimensional data: more dimensions require longer signature lengths and more computations to take care of a great search quality. In such cases, it is strongly recommended to make use of one other index.

The truth is, there exist different implementations of LSH, but all of them are based on the identical paradigm of transforming input vectors to hash values while preserving details about their similarity. Principally, other algorithms simply define other ways of obtaining these hash values.

is one other LSH method that will likely be covered in the subsequent chapter and which is implemented as an LSH index within the Faiss library for similarity search.

All images unless otherwise noted are by the writer.

Explore how similarity information might be incorporated into hash function

Statement proof

Example

Visualisation

Experimentations with different numbers of bands and rows

LEAVE A REPLY Cancel reply