Cosine Similarity for 1 Trillion Pairs of Vectors Motivation ChunkDot Chunk size calculation Memory and speed Usage Conclusion


Photo by Tamas Pap on Unsplash

Success! I managed to seek out vector representations for my 1 million items… I just must calculate the cosine similarity between all pairs to seek out the 20 most similar items per item. Then, from sklearn.metrics.pairwise import cosine_similarity, throw my 1 million vectorized items into it, and bam!
My computer explodes with a MemoryError or the kernel of my Jupyter notebook dies. The problem, I just tried to carry 1 Trillion (10¹²) values in memory (~. Good luck!

This blog post introduces ChunkDot, a Python package I built to unravel this problem. ChunkDot can do that calculation in ~2 hours while keeping the memory consumption under 20 GB, as shown below.

In the event you would really like to skip the reason and jump directly into the usage instructions:

The primary time I encountered this problem was ~6 years ago while working for a client on an entity-matching use case. I used to be attempting to match hundreds of thousands of records saved as free text to create golden records with the aggregated information from similar records². Back then, Spark was a latest and really shiny tool, so I solved it using Spark, with some broadcasting and a few C++ code wrapped in Cython wrapped in a UDF. I remember feeling that there is perhaps a more elegant and straightforward technique to solve this problem. I finally had time to look into it.

Since then, I even have seen this issue pop up in one-to-many use cases. The concept of getting items as vector representations after which trying to seek out the K most similar or dissimilar items per item appears in quite a few use cases in Machine Learning. A few of these use cases are:

  • Similar product recommendations.
  • Name and Entity matching.
  • Forecasting cold-start problem.
  • Advice systems cold-start problem.
  • User segmentation.

The core of the issue is that the variety of calculations scales with the square of the variety of items. If we now have N items, then the variety of item pairs scales as N².
There may be a silver lining, though, we only need the K most similar items per item. To find out that are probably the most similar K, we want to calculate all similarity pairs, but what if we do the calculation in batches (or chunks)? Keep the K most similar items for every batch, discard the remainder, after which proceed with the following batch until we now have processed all items. That is the concept behind ChunkDot.

Let’s go over an example, suppose I even have 10⁶ items, and I only require the highest 100 most similar items per item. Then the variety of similarity values I would like isn’t 10¹² but 10⁶ x 100 = 10⁸, a discount of 99.99%.³ This doesn’t mean that the memory consumption can be reduced by 99.99%. It means the memory is not any longer set by the output matrix but by the memory consumption of the algorithm itself.

Cosine Similarity Top K

Let’s go over how ChunkDot works. Though the core of the implementation is in chunkdot._chunkdot, I’ll deal with the chunkdot.cosine_similarity_top_k function as shown below.

Diagram describing ChunkDot’s Cosine Similarity Top K algorithm

The algorithm consists of three parts, splitting the embeddings matrix into the optimal variety of chunks, performing the parallelized operations over the chunks, and collecting the information from each parallel thread to form the similarity matrix.

Splitting the input embeddings into chunks

First a part of ChunkDot’s Cosine Similarity Top K algorithm

The primary a part of the algorithm receives as input a matrix of embeddings where each row is the vector representation of an item. Then it calculates the optimal chunk size to make use of in the course of the parallelization. The optimal chunk size is the utmost variety of rows to process per thread without exceeding the given maximum memory to make use of. If no maximum memory to make use of is given, then the system’s available memory is used. In parallel, an optional L2 normalization of the rows might be done. Finally, the normalized embeddings matrix gets split into pieces where each bit comprises the variety of rows equal to the chunk size. Now the information is able to be parallelized.

Divide and conquer

Second a part of ChunkDot’s Cosine Similarity Top K algorithm

The second a part of the algorithm performs all of the parallel computations; each parallel thread calculates the cosine similarity between the chunk of things vs. all of the items. As shown within the figure above, matrix multiplication is sufficient since rows are L2 normalized. Then, only the indices and values of the K most similar items are retrieved in order that the density matrix is garbage collected to release memory.

There is no such thing as a free lunch; the worth to pay for the parallelization and memory footprint reduction is that as a substitute of getting one big matrix multiplication, we now have hundreds of smaller ones plus the overhead of parallelization. This added complexity can raise the execution time to a non-usable level in practice. To administer the parallelization overhead efficiently and to perform these calculations as fast as possible ChunkDot uses Numba.

“Numba generates optimized machine code from pure Python code using the LLVM compiler infrastructure. With a number of easy annotations, array-oriented and math-heavy Python code might be just-in-time optimized to performance similar as C, C++ and Fortran, without having to change languages or Python interpreters.”

To take probably the most out of Numba, the code to optimize must be supported in no-python mode. For the time being of this writing, the numpy.argpartition function isn’t supported, which is why there’s a file in ChunkDot’s source code. I submitted this implementation to Numba via the PR below, which has been merged. The support should turn out to be native on Numba’s next release.

After the cosine similarity calculations and the retrieval of the K most similar items, we will collect all of the values and indices to construct the ultimate sparse similarity matrix.

Similarities as a CSR sparse matrix

Third a part of ChunkDot’s Cosine Similarity Top K algorithm

The third a part of the algorithm consists of making the sparse similarity matrix containing the K most similar items per item. This can be a sparse matrix of dimensions n_items x n_items containing only n_items x top_k non-zero values. The ultimate output will the returned as a SciPy CSR sparse matrix.

The concept behind calculating the largest possible chunk size is that the larger the chunk, the less threads are needed, due to this fact reducing the execution time of the algorithm.
To scale back memory utilization, all threads outputs are collected in arrays which can be passed by reference to every thread. ChunkDot’s algorithm memory consumption is then described by⁴:

Where the variety of threads is given by Numba’s get_num_threads function. The optimal chunk size is the answer for chunk_size within the equation above.

I made a benchmark comparison with SciKit-Learn’s cosine similarity on memory consumption and execution time.

This isn’t entirely a good comparison because the purpose of SciKit-Learn’s cosine similarity function is to retrieve the similarities of all pairs of things and never just the K most similar. Having said that, I have no idea of some other similar and straightforward approach that might be executed on a standard laptop with no rigorous setup. I also think most of us can be looking first on the SciKit-Learn function to unravel this problem. Subsequently it seemed appropriate to benchmark against it.

The figure below shows the outcomes. For the SciKit-Learn implementation, I didn’t go above 50K items because it was already taking ~20 GB of memory. I did include a projection of the expected memory consumption.

The outcomes show that the algorithm works, it will probably do the calculation for numerous items while keeping the memory consumption capped. Within the above-left figure, SciKit-Learn’s implementation is compared with ChunkDot using 3 different maximum memories: 5 GB, 10 GB, and 20 GB.

Nowadays is common to have 20 GB of RAM; you’ll be able to easily compute 100K different items and retrieve their K most similar items in around 1min.
Quite unexpectedly, the ChunkDot algorithm can also be faster, thanks, Numba! and evidently the utmost memory to make use of doesn’t matter that much for execution time… weird.
I did an intensive review of the outcomes and methodology and reran them several times, but I wouldn’t claim that ChunkDot is univocally faster. What I do know for certain is that I used to be capable of do the calculation for 100K and 1 million items with no problem.
For 1 million items, that are 1 trillion comparisons, I only had to attend ~2 hours using 20 GB of memory, as shown within the figure below.

pip install -U chunkdot

Calculate the 50 most similar and dissimilar items for 100K items.

import numpy as np
from chunkdot import cosine_similarity_top_k

embeddings = np.random.randn(100000, 256)
# using all you system's memory
cosine_similarity_top_k(embeddings, top_k=50)
# most dissimilar items using 20 GB
cosine_similarity_top_k(embeddings, top_k=-50, max_memory=20E9)

<100000x100000 sparse matrix of type ''
with 5000000 stored elements in Compressed Sparse Row format>

I hope you discover this blog post and ChunkDot useful! I enjoyed quite a bit working on this problem and getting satisfactory results. Some potential improvements:

  • Add support for input embeddings as a sparse matrix. This might already be supported, but I just haven’t thoroughly unit tested for this. I made a decision to return a TypeError exception to avoid silent errors. This can especially enable NLP use cases where the vectorization of things is completed using n-grams, bag-of-words, or similar.
  • The cosine similarity between item i and j, is the same as the similarity between j and i. It’s enough to perform the calculations that render the upper-triangular matrix. This might be done via the Einstein summation function in Numpy, unfortunately, isn’t supported by Numba for the time being.
  • As an alternative of returning the K most similar items, return items whose similarity is above/below a certain threshold.
  • Add GPU support since Numba supports it.


What are your thoughts on this topic?
Let us know in the comments below.


Notify of
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments
how can i buy followers on instagram
how can i buy followers on instagram
6 months ago

An informative piece that leaves me wanting to learn more.

Share this article

Recent posts

Grey Wolf Optimizer — How It Can Be Used with Computer Vision

As a bonus, get the code to use feature extraction anywhereImage created by DALL·E 3 based on the prompt “Draw a pack of futuristic...

Artificial intelligence corporations flock to ‘AI representative city Gwangju’

Artificial intelligence (AI) specialized corporations are flocking to Gwangju, the representative city of artificial intelligence in Korea. Gwangju City (Mayor Kang Ki-jeong) held a gathering...

The Pillars of Responsible AI: Navigating Ethical Frameworks and Accountability in an AI-Driven World

Within the rapidly evolving realm of recent technology, the concept of ‘Responsible AI’ has surfaced to handle and mitigate the problems arising from AI...

Ministry of Culture-GIST, MOU to ascertain AI overseas news evaluation platform

The Ministry of Culture, Sports and Tourism (Minister Yoo In-chon) announced on the fifteenth that it could sign a business agreement with the Gwangju...

“Samsung significantly strengthens headset secret development team to reply to Apple’s ‘Vision Pro’”

A report has emerged that Samsung Electronics is significantly increasing the dimensions of its internal XR (mixed reality) headset development team following the launch...

Recent comments

бнанс рестраця для США on Model Evaluation in Time Series Forecasting
Bonus Pendaftaran Binance on Meet Our Fleet
Créer un compte gratuit on About Me — How I give AI artists a hand
To tài khon binance on China completely blocks ‘Chat GPT’
Regístrese para obtener 100 USDT on Reducing bias and improving safety in DALL·E 2
crystal teeth whitening on What babies can teach AI
binance referral bonus on DALL·E API now available in public beta prihlásení on Neural Networks and Life
Büyü Yapılmışsa Nasıl Bozulur on Introduction to PyTorch: from training loop to prediction
yıldızname on OpenAI Function Calling
Kısmet Bağlılığını Çözmek İçin Dua on Examining Flights within the U.S. with AWS and Power BI
Kısmet Bağlılığını Çözmek İçin Dua on How Meta’s AI Generates Music Based on a Reference Melody
Kısmet Bağlılığını Çözmek İçin Dua on ‘이루다’의 스캐터랩, 기업용 AI 시장에 도전장
uçak oyunu bahis on Thanks!
para kazandıran uçak oyunu on Make Machine Learning Work for You
medyum on Teaching with AI
aviator oyunu oyna on Machine Learning for Beginners !
yıldızname on Final DXA-nation
adet kanı büyüsü on ‘Fake ChatGPT’ app on the App Store
Eşini Eve Bağlamak İçin Dua on LLMs and the Emerging ML Tech Stack
aviator oyunu oyna on AI as Artist’s Augmentation
Büyü Yapılmışsa Nasıl Bozulur on Some Guy Is Trying To Turn $100 Into $100,000 With ChatGPT
Eşini Eve Bağlamak İçin Dua on Latest embedding models and API updates
Kısmet Bağlılığını Çözmek İçin Dua on Jorge Torres, Co-founder & CEO of MindsDB – Interview Series
gideni geri getiren büyü on Joining the battle against health care bias
uçak oyunu bahis on A faster method to teach a robot
uçak oyunu bahis on Introducing the GPT Store
para kazandıran uçak oyunu on Upgrading AI-powered travel products to first-class
para kazandıran uçak oyunu on 10 Best AI Scheduling Assistants (September 2023)
aviator oyunu oyna on 🤗Hugging Face Transformers Agent
Kısmet Bağlılığını Çözmek İçin Dua on Time Series Prediction with Transformers
para kazandıran uçak oyunu on How China is regulating robotaxis
bağlanma büyüsü on MLflow on Cloud
para kazandıran uçak oyunu on Can The 2024 US Elections Leverage Generative AI?
Canbar Büyüsü on The reverse imitation game
bağlanma büyüsü on The NYU AI School Returns Summer 2023
para kazandıran uçak oyunu on Beyond ChatGPT; AI Agent: A Recent World of Staff
Büyü Yapılmışsa Nasıl Bozulur on The Murky World of AI and Copyright
gideni geri getiren büyü on ‘Midjourney 5.2’ creates magical images
Büyü Yapılmışsa Nasıl Bozulur on Microsoft launches the brand new Bing, with ChatGPT inbuilt
gideni geri getiren büyü on MemCon 2023: We’ll Be There — Will You?
adet kanı büyüsü on Meet the Fellow: Umang Bhatt
aviator oyunu oyna on Meet the Fellow: Umang Bhatt
abrir uma conta na binance on The reverse imitation game
código de indicac~ao binance on Neural Networks and Life
Larry Devin Vaughn Wall on How China is regulating robotaxis
Jon Aron Devon Bond on How China is regulating robotaxis
otvorenie úctu na binance on Evolution of Blockchain by DLC
puravive reviews consumer reports on AI-Driven Platform Could Streamline Drug Development
puravive reviews consumer reports on How OpenAI is approaching 2024 worldwide elections Registrácia on DALL·E now available in beta