Imports & Data Loading
We start by importing a couple of handy libraries and modules.
import json
from transformers import CLIPProcessor, CLIPTextModelWithProjection
from torch import load, matmul, argsort
from torch.nn.functional import softmax
Next, we’ll import text and image chunks from the Multimodal LLMs and Multimodal Embeddings blog posts. These are saved in .json files, which could be loaded into Python as an inventory of dictionaries.
# load text chunks
with open('data/text_content.json', 'r', encoding='utf-8') as f:
text_content_list = json.load(f)# load images
with open('data/image_content.json', 'r', encoding='utf-8') as f:
image_content_list = json.load(f)
While I won’t review the information preparation process here, the code I used is on the GitHub repo.
We will even load the multimodal embeddings (from CLIP) for every item in text_content_list and image_content_list. These are saved as pytorch tensors.
# load embeddings
text_embeddings = load('data/text_embeddings.pt', weights_only=True)
image_embeddings = load('data/image_embeddings.pt', weights_only=True)print(text_embeddings.shape)
print(image_embeddings.shape)
# >> torch.Size([86, 512])
# >> torch.Size([17, 512])
Printing the form of those tensors, we see they’re represented via 512-dimensional embeddings. And we now have 86 text chunks and 17 images.
Multimodal Search
With our knowledge base loaded, we are able to now define a question for vector search. It will consist of translating an input query into an embedding using CLIP. We do that similarly to the examples from the previous post.
# query
query = "What's CLIP's contrastive loss function?"# embed query (4 steps)
# 1) load model
model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch16")
# 2) load data processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
# 3) pre-process text
inputs = processor(text=[text], return_tensors="pt", padding=True)
# 4) compute embeddings with CLIP
outputs = model(**inputs)
# extract embedding
query_embed = outputs.text_embeds
print(query_embed.shape)
# >> torch.Size([1, 512])
Printing the form, we see we now have a single vector representing the query.
To perform a vector search over the knowledge base, we want to do the next.
- Compute similarities between the query embedding and all of the text and image embeddings.
- Rescale the similarities to range from 0 to 1 via the softmax function.
- Sort the scaled similarities and return the highest k results.
- Finally, filter the outcomes to only keep items above a pre-defined similarity threshold.
Here’s what that appears like in code for the text chunks.
# define k and simiarlity threshold
k = 5
threshold = 0.05# multimodal search over articles
text_similarities = matmul(query_embed, text_embeddings.T)
# rescale similarities via softmax
temp=0.25
text_scores = softmax(text_similarities/temp, dim=1)
# return top k filtered text results
isorted_scores = argsort(text_scores, descending=True)[0]
sorted_scores = text_scores[0][isorted_scores]
itop_k_filtered = [idx.item()
for idx, score in zip(isorted_scores, sorted_scores)
if score.item() >= threshold][:k]
top_k = [text_content_list[i] for i in itop_k_filtered]
print(top_k)
# top k results[{'article_title': 'Multimodal Embeddings: An Introduction',
'section': 'Contrastive Learning',
'text': 'Two key aspects of CL contribute to its effectiveness'}]
Above, we see the highest text results. Notice we only have one item, despite the fact that k=5. It’s because the 2nd-Fifth items were below the 0.1 threshold.
Interestingly, this item doesn’t seem helpful to our initial query of “What’s CLIP’s contrastive loss function?” This highlights considered one of the important thing challenges of vector search: items much like a given query may not necessarily help answer it.
A method we are able to mitigate this issue is having less stringent restrictions on our search results by increasing k and lowering the similarity threshold, then hoping the LLM can work out what’s helpful vs. not.
To do that, I’ll first package the vector search steps right into a Python function.
def similarity_search(query_embed, target_embeddings, content_list,
k=5, threshold=0.05, temperature=0.5):
"""
Perform similarity search over embeddings and return top k results.
"""
# Calculate similarities
similarities = torch.matmul(query_embed, target_embeddings.T)# Rescale similarities via softmax
scores = torch.nn.functional.softmax(similarities/temperature, dim=1)
# Get sorted indices and scores
sorted_indices = scores.argsort(descending=True)[0]
sorted_scores = scores[0][sorted_indices]
# Filter by threshold and get top k
filtered_indices = [
idx.item() for idx, score in zip(sorted_indices, sorted_scores)
if score.item() >= threshold
][:k]
# Get corresponding content items and scores
top_results = [content_list[i] for i in filtered_indices]
result_scores = [scores[0][i].item() for i in filtered_indices]
return top_results, result_scores
Then, set more inclusive search parameters.
# search over text chunks
text_results, text_scores = similarity_search(query_embed, text_embeddings,
text_content_list, k=15, threshold=0.01, temperature=0.25)# search over images
image_results, image_scores = similarity_search(query_embed, image_embeddings,
image_content_list, k=5, threshold=0.25, temperature=0.5)
This leads to 15 text results and 1 image result.
1 - Two key features of CL contribute to its effectiveness
2 - To make a category prediction, we must extract the image logits and evaluate
which class corresponds to the utmost.
3 - Next, we are able to import a version of the clip model and its associated data
processor. Note: the processor handles tokenizing input text and image
preparation.
4 - The essential idea behind using CLIP for 0-shot image classification is to
pass a picture into the model together with a set of possible class labels. Then,
a classification could be made by evaluating which text input is most much like
the input image.
5 - We will then match one of the best image to the input text by extracting the text
logits and evaluating the image corresponding to the utmost.
6 - The code for these examples is freely available on the GitHub repository.
7 - We see that (again) the model nailed this easy example. But let’s try
some trickier examples.
8 - Next, we’ll preprocess the image/text inputs and pass them into the model.
9 - One other practical application of models like CLIP is multimodal RAG, which
consists of the automated retrieval of multimodal context to an LLM. Within the
next article of this series, we'll see how this works under the hood and
review a concrete example.
10 - One other application of CLIP is actually the inverse of Use Case 1.
Fairly than identifying which text label matches an input image, we are able to
evaluate which image (in a set) best matches a text input (i.e. query)—in
other words, performing a search over images.
11 - This has sparked efforts toward expanding LLM functionality to incorporate
multiple modalities.
12 - GPT-4o — Input: text, images, and audio. Output: text.FLUX — Input: text.
Output: images.Suno — Input: text. Output: audio.
13 - The usual approach to aligning disparate embedding spaces is
contrastive learning (CL). A key intuition of CL is to represent different
views of the identical information similarly [5].
14 - While the model is less confident about this prediction with a 54.64%
probability, it appropriately implies that the image is just not a meme.
15 - [8] Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Prompting MLLM
Although most of those text item results don’t seem helpful to our query, the image result is strictly what we’re on the lookout for. Nevertheless, given these search results, let’s see how LLaMA 3.2 Vision responds to this question.
We first will structure the search results as well-formatted strings.
text_context = ""
for text in text_results:
if text_results:
text_context = text_context + "**Article title:** "
+ text['article_title'] + "n"
text_context = text_context + "**Section:** "
+ text['section'] + "n"
text_context = text_context + "**Snippet:** "
+ text['text'] + "nn"
image_context = ""
for image in image_results:
if image_results:
image_context = image_context + "**Article title:** "
+ image['article_title'] + "n"
image_context = image_context + "**Section:** "
+ image['section'] + "n"
image_context = image_context + "**Image Path:** "
+ image['image_path'] + "n"
image_context = image_context + "**Image Caption:** "
+ image['caption'] + "nn"
Note the metadata that accompanies each text and image item. It will help the LLaMA higher understand the context of the content.
Next, we interleave the text and image leads to a prompt.
# construct prompt template
prompt = f"""Given the query "{query}" and the next relevant snippets:{text_context}
{image_context}
Please provide a concise and accurate answer to the query, incorporating
relevant information from the provided snippets where possible.
"""
The ultimate prompt is sort of long, so I won’t print it here. Nonetheless, it’s fully displayed within the example notebook on GitHub.
Finally, we are able to use ollama to pass this prompt to LLaMA 3.2 Vision.
ollama.pull('llama3.2-vision')response = ollama.chat(
model='llama3.2-vision',
messages=[{
'role': 'user',
'content': prompt,
'images': [image["image_path"] for image in image_results]
}]
)
print(response['message']['content'])
The image depicts a contrastive loss function for aligning text and image
representations in multimodal models. The function is designed to reduce the
difference between the similarity of positive pairs (text-image) and negative
pairs (text-text or image-image). This loss function is often utilized in CLIP,
which stands for Contrastive Language-Image Pre-training.**Key Components:**
* **Positive Pairs:** Text-image pairs where the text describes a picture.
* **Negative Pairs:** Text-text or image-image pairs that don't belong to
the identical class.
* **Contrastive Loss Function:** Calculates the difference between positive
and negative pairs' similarities.
**The way it Works:**
1. **Text-Image Embeddings:** Generate embeddings for each text and pictures
using a multimodal encoder (e.g., CLIP).
2. **Positive Pair Similarity:** Calculate the similarity rating between each
text-image pair.
3. **Negative Pair Similarity:** Calculate the similarity scores between all
negative pairs.
4. **Contrastive Loss Calculation:** Compute the contrastive loss by
minimizing the difference between positive and negative pairs' similarities.
**Advantages:**
* **Multimodal Alignment:** Aligns text and image representations for higher
understanding of visual content from text descriptions.
* **Improved Performance:** Enhances performance in downstream tasks like
image classification, retrieval, and generation.
The model appropriately picks up that the image accommodates the data it needs and explains the overall intuition of how it really works. Nonetheless, it misunderstands the meaning of positive and negative pairs, pondering that a negative pair corresponds to a pair of the identical modality.
While we went through the implementation details step-by-step, I packaged all the pieces right into a nice UI using Gradio on this notebook on the GitHub repo.
Multimodal RAG systems can synthesize knowledge stored in quite a lot of formats, expanding what’s possible with AI. Here, we reviewed 3 easy strategies for developing such a system after which saw an example implementation of a multimodal blog QA assistant.
Although the instance worked well enough for this demonstration, there are clear limitations to the search process. A couple of techniques that will improve this include using a reranker to refine similarity search results and to enhance search quality via fine-tuned multimodal embeddings.
If you ought to see future posts on these topics, let me know within the comments 🙂
More on Multimodal models 👇