Image Similarity with Hugging Face Datasets and Transformers

On this post, you will learn to construct a picture similarity system with 🤗 Transformers. Checking out the similarity between a question image and potential candidates is a very important use case for information retrieval systems, reminiscent of reverse image search, for instance. All of the system is attempting to answer is that, given a query image and a set of candidate images, which images are probably the most much like the query image.

We’ll leverage the 🤗 datasets library because it seamlessly supports parallel processing which is able to come in useful when constructing this method.

Although the post uses a ViT-based model (nateraw/vit-base-beans) and a specific dataset (Beans), it might be prolonged to make use of other models supporting vision modality and other image datasets. Some notable models you could possibly try:

Also, the approach presented within the post can potentially be prolonged to other modalities as well.

To review the fully working image-similarity system, you may confer with the Colab Notebook linked firstly.

How will we define similarity?

To construct this method, we first have to define how we wish to compute the similarity between two images. One widely popular practice is to compute dense representations (embeddings) of the given images after which use the cosine similarity metric to find out how similar the 2 images are.

For this post, we’ll use “embeddings” to represent images in vector space. This provides us a pleasant strategy to meaningfully compress the high-dimensional pixel space of images (224 x 224 x 3, for instance) to something much lower dimensional (768, for instance). The first advantage of doing that is the reduced computation time in the next steps.

Computing embeddings

To compute the embeddings from the photographs, we’ll use a vision model that has some understanding of easy methods to represent the input images within the vector space. The sort of model can also be commonly known as image encoder.

For loading the model, we leverage the AutoModel class. It provides an interface for us to load any compatible model checkpoint from the Hugging Face Hub. Alongside the model, we also load the processor related to the model for data preprocessing.

from transformers import AutoImageProcessor, AutoModel


model_ckpt = "nateraw/vit-base-beans"
processor = AutoImageProcessor.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

On this case, the checkpoint was obtained by fine-tuning a Vision Transformer based model on the beans dataset.

Some questions which may arise here:

Q1: Why did we not use AutoModelForImageClassification?

It’s because we wish to acquire dense representations of the photographs and never discrete categories, that are what AutoModelForImageClassification would have provided.

Q2: Why this checkpoint particularly?

As mentioned earlier, we’re using a selected dataset to construct the system. So, as an alternative of using a generalist model (just like the ones trained on the ImageNet-1k dataset, for instance), it’s higher to make use of a model that has been fine-tuned on the dataset getting used. That way, the underlying model higher understands the input images.

Note that you may also use a checkpoint that was obtained through self-supervised pre-training. The checkpoint doesn’t necessarily have to come back from supervised learning. In reality, if pre-trained well, self-supervised models can yield impressive retrieval
performance.

Now that now we have a model for computing the embeddings, we want some candidate images to question against.

Loading a dataset for candidate images

In a while, we’ll be constructing hash tables mapping the candidate images to hashes. Through the query time, we’ll use these hash tables. We’ll talk more about hash tables within the respective section but for now, to have a set of candidate images, we’ll use the train split of the beans dataset.

from datasets import load_dataset


dataset = load_dataset("beans")

That is how a single sample from the training split looks like:

The dataset has three features:

dataset["train"].features
>>> {'image_file_path': Value(dtype='string', id=None),
 'image': Image(decode=True, id=None),
 'labels': ClassLabel(names=['angular_leaf_spot', 'bean_rust', 'healthy'], id=None)}

To display the image similarity system, we’ll use 100 samples from the candidate image dataset to maintain
the general runtime short.

num_samples = 100
seed = 42
candidate_subset = dataset["train"].shuffle(seed=seed).select(range(num_samples))

The strategy of finding similar images

Below, yow will discover a pictorial overview of the method underlying fetching similar images.

Breaking down the above figure a bit, now we have:

Extract the embeddings from the candidate images (candidate_subset), storing them in a matrix.
Take a question image and extract its embeddings.
Iterate over the embedding matrix (computed in step 1) and compute the similarity rating between the query embedding and the present candidate embeddings. We often maintain a dictionary-like mapping maintaining a correspondence between some identifier of the candidate image and the similarity scores.
Sort the mapping structure w.r.t the similarity scores and return the underlying identifiers. We use these identifiers to fetch the candidate samples.

We will write an easy utility and map() it to our dataset of candidate images to compute the embeddings efficiently.

import torch 

def extract_embeddings(model: torch.nn.Module):
    """Utility to compute embeddings."""
    device = model.device

    def pp(batch):
        images = batch["image"]
        
        
        
        image_batch_transformed = torch.stack(
            [transformation_chain(image) for image in images]
        )
        new_batch = {"pixel_values": image_batch_transformed.to(device)}
        with torch.no_grad():
            embeddings = model(**new_batch).last_hidden_state[:, 0].cpu()
        return {"embeddings": embeddings}

    return pp

And we will map extract_embeddings() like so:

device = "cuda" if torch.cuda.is_available() else "cpu"
extract_fn = extract_embeddings(model.to(device))
candidate_subset_emb = candidate_subset.map(extract_fn, batched=True, batch_size=batch_size)

Next, for convenience, we create a listing containing the identifiers of the candidate images.

candidate_ids = []

for id in tqdm(range(len(candidate_subset_emb))):
    label = candidate_subset_emb[id]["labels"]

    
    entry = str(id) + "_" + str(label)

    candidate_ids.append(entry)

We’ll use the matrix of the embeddings of all of the candidate images for computing the similarity scores with a question image. We now have already computed the candidate image embeddings. In the subsequent cell, we just gather them together in a matrix.

all_candidate_embeddings = np.array(candidate_subset_emb["embeddings"])
all_candidate_embeddings = torch.from_numpy(all_candidate_embeddings)

We’ll use cosine similarity to compute the similarity rating in between two embedding vectors. We’ll then use it to fetch similar candidate samples given a question sample.

def compute_scores(emb_one, emb_two):
    """Computes cosine similarity between two vectors."""
    scores = torch.nn.functional.cosine_similarity(emb_one, emb_two)
    return scores.numpy().tolist()


def fetch_similar(image, top_k=5):
    """Fetches the `top_k` similar images with `image` because the query."""
    
    image_transformed = transformation_chain(image).unsqueeze(0)
    new_batch = {"pixel_values": image_transformed.to(device)}

    
    with torch.no_grad():
        query_embeddings = model(**new_batch).last_hidden_state[:, 0].cpu()

    
    
    
    sim_scores = compute_scores(all_candidate_embeddings, query_embeddings)
    similarity_mapping = dict(zip(candidate_ids, sim_scores))
 
    
    similarity_mapping_sorted = dict(
        sorted(similarity_mapping.items(), key=lambda x: x[1], reverse=True)
    )
    id_entries = list(similarity_mapping_sorted.keys())[:top_k]

    ids = list(map(lambda x: int(x.split("_")[0]), id_entries))
    labels = list(map(lambda x: int(x.split("_")[-1]), id_entries))
    return ids, labels

Perform a question

Given all of the utilities, we’re equipped to do a similarity search. Let’s have a question image from the test split of
the beans dataset:

test_idx = np.random.alternative(len(dataset["test"]))
test_sample = dataset["test"][test_idx]["image"]
test_label = dataset["test"][test_idx]["labels"]

sim_ids, sim_labels = fetch_similar(test_sample)
print(f"Query label: {test_label}")
print(f"Top 5 candidate labels: {sim_labels}")

Results in:

Query label: 0
Top 5 candidate labels: [0, 0, 0, 0, 0]

Looks as if our system got the proper set of comparable images. When visualized, we would get:

Further extensions and conclusions

We now have a working image similarity system. But in point of fact, you will be coping with lots more candidate images. Taking that into consideration, our current procedure has got multiple drawbacks:

If we store the embeddings as is, the memory requirements can shoot up quickly, especially when coping with hundreds of thousands of candidate images. The embeddings are 768-d in our case, which might still be relatively high within the large-scale regime.
Having high-dimensional embeddings have a direct effect on the next computations involved within the retrieval part.

If we will someway reduce the dimensionality of the embeddings without disturbing their meaning, we will still maintain a very good trade-off between speed and retrieval quality. The accompanying Colab Notebook of this post implements and demonstrates utilities for achieving this with random projection and locality-sensitive hashing.

🤗 Datasets offers direct integrations with FAISS which further simplifies the strategy of constructing similarity systems. For instance you have already extracted the embeddings of the candidate images (the beans dataset) and stored them
inside a feature called embeddings. You possibly can now easily use the add_faiss_index() of the dataset to construct a dense index:

dataset_with_embeddings.add_faiss_index(column="embeddings")

Once the index is built, dataset_with_embeddings may be used to retrieve the closest examples given query embeddings with get_nearest_examples():

scores, retrieved_examples = dataset_with_embeddings.get_nearest_examples(
    "embeddings", qi_embedding, k=top_k
)

The strategy returns scores and corresponding candidate examples. To know more, you may try the official documentation and this notebook.

Finally, you may check out the next Space that builds a mini image similarity application:

On this post, we ran through a quickstart for constructing image similarity systems. For those who found this post interesting, we highly recommend constructing on top of the concepts we discussed here so you may get more comfortable with the inner workings.

Still trying to learn more? Listed below are some additional resources that may be useful for you:

Source link

Image Similarity with Hugging Face Datasets and Transformers

How will we define similarity?

Computing embeddings

Loading a dataset for candidate images

The strategy of finding similar images

Perform a question

Further extensions and conclusions

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The right way to Effectively Review Claude Code Output

NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

A Compact Hybrid Model for Efficient Local AI

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact

NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories

Image Similarity with Hugging Face Datasets and Transformers

How will we define similarity?

Computing embeddings

Loading a dataset for candidate images

The strategy of finding similar images

Perform a question

Further extensions and conclusions

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.