## Introducing ChunkDot

Success! I managed to seek out vector representations for my 1 million items… I just must calculate the cosine similarity between all pairs to seek out the 20 most similar items per item. Then, *from sklearn.metrics.pairwise import cosine_similarity, *throw my 1 million vectorized items into it, and bam!

My computer explodes with a *MemoryError *or* *the kernel of my *Jupyter* notebook dies. The problem, I just tried to carry 1 Trillion (10¹²) values in memory (~. Good luck!

This blog post introduces *ChunkDot*, a Python package I built to unravel this problem. *ChunkDot* can do that calculation in ~2 hours while keeping the memory consumption under 20 GB, as shown below.

In the event you would really like to skip the reason and jump directly into the usage instructions:

The primary time I encountered this problem was ~6 years ago while working for a client on an entity-matching use case. I used to be attempting to match hundreds of thousands of records saved as free text to create golden records with the aggregated information from similar records². Back then, *Spark* was a latest and really shiny tool, so I solved it using *Spark*, with some broadcasting and a few *C++* code wrapped in *Cython* wrapped in a *UDF*. I remember feeling that there is perhaps a more elegant and straightforward technique to solve this problem. I finally had time to look into it.

Since then, I even have seen this issue pop up in one-to-many use cases. The concept of getting items as vector representations after which trying to seek out the K most similar or dissimilar items per item appears in quite a few use cases in Machine Learning. A few of these use cases are:

- Similar product recommendations.
- Name and Entity matching.
- Forecasting cold-start problem.
- Advice systems cold-start problem.
- User segmentation.

The core of the issue is that the variety of calculations scales with the square of the variety of items*. *If we now have *N* items, then the variety of item pairs scales as *N².*There may be a silver lining, though, we only need the K most similar items per item. To find out that are probably the most similar K, we want to calculate all similarity pairs, but what if we do the calculation in batches (or chunks)? Keep the K most similar items for every batch, discard the remainder, after which proceed with the following batch until we now have processed all items. That is the concept behind

*ChunkDot*.

Let’s go over an example, suppose I even have 10⁶ items, and I only require the highest 100 most similar items per item. Then the variety of similarity values I would like isn’t 10¹² but 10⁶ x 100 = 10⁸, a discount of 99.99%.³ This doesn’t mean that the memory consumption can be reduced by 99.99%. It means the memory is not any longer set by the output matrix but by the memory consumption of the algorithm itself.

## Cosine Similarity Top K

Let’s go over how *ChunkDot* works. Though the core of the implementation is in `chunkdot._chunkdot`

, I’ll deal with the `chunkdot.cosine_similarity_top_k`

function as shown below.

The algorithm consists of three parts, splitting the embeddings matrix into the optimal variety of chunks, performing the parallelized operations over the chunks, and collecting the information from each parallel thread to form the similarity matrix.

## Splitting the input embeddings into chunks

The primary a part of the algorithm receives as input a matrix of embeddings where each row is the vector representation of an item. Then it calculates the optimal chunk size to make use of in the course of the parallelization. The optimal chunk size is the utmost variety of rows to process per thread without exceeding the given maximum memory to make use of. If no maximum memory to make use of is given, then the system’s available memory is used. In parallel, an optional L2 normalization of the rows might be done. Finally, the normalized embeddings matrix gets split into pieces where each bit comprises the variety of rows equal to the chunk size. Now the information is able to be parallelized.

## Divide and conquer

The second a part of the algorithm performs all of the parallel computations; each parallel thread calculates the cosine similarity between the chunk of things vs. all of the items. As shown within the figure above, matrix multiplication is sufficient since rows are L2 normalized. Then, only the indices and values of the K most similar items are retrieved in order that the density matrix is garbage collected to release memory.

There is no such thing as a free lunch; the worth to pay for the parallelization and memory footprint reduction is that as a substitute of getting one big matrix multiplication, we now have hundreds of smaller ones plus the overhead of parallelization. This added complexity can raise the execution time to a non-usable level in practice. To administer the parallelization overhead efficiently and to perform these calculations as fast as possible *ChunkDot* uses *Numba*.

“Numba generates optimized machine code from pure Python code using the LLVM compiler infrastructure. With a number of easy annotations, array-oriented and math-heavy Python code might be just-in-time optimized to performance similar as C, C++ and Fortran, without having to change languages or Python interpreters.”

To take probably the most out of *Numba, *the code to optimize must be supported in no-python mode. For the time being of this writing, the `numpy.argpartition`

function isn’t supported, which is why there’s a `numba_argpartition.py`

file in *ChunkDo*t’s source code. I submitted this implementation to *Numba* via the PR below, which has been merged. The support should turn out to be native on *Numba’s* next release.

After the cosine similarity calculations and the retrieval of the K most similar items, we will collect all of the values and indices to construct the ultimate sparse similarity matrix.

## Similarities as a CSR sparse matrix

The third a part of the algorithm consists of making the sparse similarity matrix containing the K most similar items per item. This can be a sparse matrix of dimensions *n_items x n_items* containing only *n_items x top_k* non-zero values. The ultimate output will the returned as a SciPy CSR sparse matrix.

The concept behind calculating the largest possible chunk size is that the larger the chunk, the less threads are needed, due to this fact reducing the execution time of the algorithm.

To scale back memory utilization, all threads outputs are collected in arrays which can be passed by reference to every thread.* ChunkDot*’s algorithm memory consumption is then described by⁴:

Where the variety of threads is given by *Numba*’s `get_num_threads`

function. The optimal chunk size is the answer for `chunk_size`

within the equation above.

I made a benchmark comparison with SciKit-Learn’s cosine similarity on memory consumption and execution time.

This isn’t entirely a good comparison because the purpose of SciKit-Learn’s cosine similarity function is to retrieve the similarities of all pairs of things and never just the K most similar. Having said that, I have no idea of some other similar and straightforward approach that might be executed on a standard laptop with no rigorous setup. I also think most of us can be looking first on the SciKit-Learn function to unravel this problem. Subsequently it seemed appropriate to benchmark against it.

The figure below shows the outcomes. For the *SciKit-Learn* implementation, I didn’t go above 50K items because it was already taking ~20 GB of memory. I did include a projection of the expected memory consumption.

The outcomes show that the algorithm works, it will probably do the calculation for numerous items while keeping the memory consumption capped. Within the above-left figure, *SciKit-Learn*’s implementation is compared with *ChunkDot* using 3 different maximum memories: 5 GB, 10 GB, and 20 GB.

Nowadays is common to have 20 GB of RAM; you’ll be able to easily compute 100K different items and retrieve their K most similar items in around 1min.

Quite unexpectedly, the *ChunkDot* algorithm can also be faster, thanks, *Numba*! and evidently the utmost memory to make use of doesn’t matter that much for execution time… weird.

I did an intensive review of the outcomes and methodology and reran them several times, but I wouldn’t claim that *ChunkDot* is univocally faster. What I do know for certain is that I used to be capable of do the calculation for 100K and 1 million items with no problem.

For 1 million items, that are 1 trillion comparisons, I only had to attend ~2 hours using 20 GB of memory, as shown within the figure below.

`pip install -U chunkdot`

Calculate the 50 most similar and dissimilar items for 100K items.

`import numpy as np`

from chunkdot import cosine_similarity_top_kembeddings = np.random.randn(100000, 256)

# using all you system's memory

cosine_similarity_top_k(embeddings, top_k=50)

# most dissimilar items using 20 GB

cosine_similarity_top_k(embeddings, top_k=-50, max_memory=20E9)

<100000x100000 sparse matrix of type ''

with 5000000 stored elements in Compressed Sparse Row format>

I hope you discover this blog post and *ChunkDot* useful! I enjoyed quite a bit working on this problem and getting satisfactory results. Some potential improvements:

- Add support for input embeddings as a sparse matrix. This might already be supported, but I just haven’t thoroughly unit tested for this. I made a decision to return a
`TypeError`

exception to avoid silent errors. This can especially enable NLP use cases where the vectorization of things is completed using n-grams, bag-of-words, or similar. - The cosine similarity between item
*i*and*j,*is the same as the similarity between*j*and*i. It’s*enough to perform the calculations that render the upper-triangular matrix. This might be done via the Einstein summation function in*Numpy*, unfortunately, isn’t supported by*Numba*for the time being. - As an alternative of returning the K most similar items, return items whose similarity is above/below a certain threshold.
- Add GPU support since
*Numba*supports it.