Modern Semantic Seek for Images

-

A how-to article leveraging Python, Pinecone, Hugging Face, and the Open AI CLIP model to create a semantic search application on your cloud photos.

Image by the creator

You wish to find “that one picture” from several years ago. You remember a number of specifics in regards to the setting. Apple Photos doesn’t offer semantic search, and Google Photos is restricted to a number of predetermined item classifiers. Neither will do well with this type of search. I’ll show the problem with two unusual queries of my Google Photos: “donut birthday cake” and “busted lip from a snowball fight”. Then I’ll share the best way to construct your personal semantic image search application.

Example #1

I like birthday cakes. I also like donuts. Last yr, I had the good idea to mix the 2 with a stack of donuts as my birthday cake. Let’s try to search out it.

Google Photos query: “donut birthday cake”

Results: Six pictures of cakes with no donuts followed by the one I wanted.

Image by the creator

Semantic Search App query: “donut birthday cake”

Results: Two images and a video that were exactly what I wanted.

Image by the creator

Example #2

I went to the snow with my teenage son and a giant group of his friends. They climbed on top of an abandoned train tunnel. “Throw snowballs all of sudden, and I’ll get a slow-motion video of it!”, I yelled. It was not my brightest moment as I didn’t foresee the plain conclusion that I’d find yourself being goal practice for twenty teenage boys with strong arms.

Google Photos query: “busted lip from a snowball fight”

Results:

Image created by the creator

The present Google image classification model is restricted to words it has been trained on.

Semantic Search App query: “busted lip from a snowball fight”

Results: The busted lip picture (not shown) and the video that preceded the busted lip were results one and two.

Image by the creator

CLIP allows the model to learn the best way to associate image pixels with text and offers it the flexibleness to search for things like “donut cakes” and “busted lips” — things that you just’d never think to incorporate when training a picture classifier. It stands for Constastive Language-Image Pretraining. It’s an open-source, multi-modal, zero-shot model. It has been trained on hundreds of thousands of images with descriptive captions.

Given a picture and text descriptions, the model can predict that image’s most relevant text description, without optimizing for a selected task.

Source: Nikos Karfitsas, Towards Data Science

The CLIP architecture that you just find in most online tutorials is sweet enough for a POC but will not be enterprise-ready. In these tutorials, CLIP and the Hugging Face processors hold embeddings in memory to act because the vector store for running similarity scores and retrieval.

Image created by the creator

A vector database like Pinecone is a key component to scaling an application like this. It provides simplified, robust, enterprise-ready features corresponding to batch and stream processing of images, enterprise management of embeddings, low latency retrieval, and metadata filtering.

Image created by the creator

The code and supporting files for this application could be found on GitHub at https://github.com/joshpoduska/llm-image-caption-semantic-search. Use them to construct a semantic search application on your cloud photos.

The applying runs locally on a laptop with sufficient memory. I tested it on a MacBook Pro.

Components needed to construct the app

  • Pinecone or similar vector database for embedding storage and semantic search (the free version of Pinecone is sufficient for this tutorial)
  • Hugging Face models and pipelines
  • OpenAI CLIP model for image and query text embedding creation (accessible from Hugging Face)
  • Google Photos API to access your personal Google Photos

Helpful information before you begin

Access your images

The Google Photos API has several key data fields of note. See the API reference for more details.

  • Id is Immutable
  • baseUrl lets you access the bytes of the media items. They’re valid for 60 minutes.

A mix of the pandas, JSON, and requests libraries could be used straightforwardly to load a DataFrame of your image IDs, URLs, and dates.

Generate image embeddings

With Hugging Face and the OpenAI CLIP model, this step is the best of all the application.

from sentence_transformers import SentenceTransformer
img_model = SentenceTransformer('clip-ViT-B-32')
embeddings = img_model.encode(images)

Creating metadata

Semantic search is usually enhanced with metadata filters. On this application, I exploit the date of the photo to extract the yr, month, and day. These are stored as a dictionary in a DataFrame field. Pinecone queries can use this dictionary to filter searches by metadata within the dictionary.

Here is the primary row of my pandas DataFrame with the image fields, vectors, and metadata dictionary field.

Image by the creator

Load embeddings

There are Pinecone optimizations for async and parallel loading. The bottom loading function is easy, as follows.

index.upsert(vectors=ids_vectors_chunk, async_req=True)

Query embeddings

To question the pictures with the CLIP model, we’d like to pass it the text of our semantic query. That is facilitated by loading the CLIP text embedding model.

text_model = SentenceTransformer(‘sentence-transformers/clip-ViT-B-32-multilingual-v1’)

Now we are able to create an embedding for our search phrase and compare that to the embeddings of the pictures stored in Pinecone.

# create the query vector
xq = text_model.encode(query).tolist()

# now query
xc = index.query(xq,
filter= {
"yr": {"$in":years_filter},
"month": {"$in":months_filter}
},
top_k= top_k,
include_metadata=True)

The CLIP model is amazing. It’s a general knowledge, zero-shot model that has learned to associate images with text in a way that frees it from the constraints of coaching a picture classifier on pre-defined classes. Once we mix this with the facility of an enterprise-grade vector database like Pinecone, we are able to create semantic image search applications with low latency and high fidelity. That is just certainly one of the exciting applications of generative AI sprouting up day by day.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

0 0 votes
Article Rating
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

1
0
Would love your thoughts, please comment.x
()
x