This guide introduces BLIP-2 from Salesforce Research
that permits a set of state-of-the-art visual-language models which might be now available in 🤗 Transformers.
We’ll show you how you can use it for image captioning, prompted image captioning, visual question-answering, and chat-based prompting.
Table of contents
- Introduction
- What’s under the hood in BLIP-2?
- Using BLIP-2 with Hugging Face Transformers
- Conclusion
- Acknowledgments
Introduction
Recent years have seen rapid advancements in computer vision and natural language processing. Still, many real-world
problems are inherently multimodal – they involve several distinct forms of information, resembling images and text.
Visual-language models face the challenge of mixing modalities in order that they’ll open the door to a big selection of
applications. Among the image-to-text tasks that visual language models can tackle include image captioning, image-text
retrieval, and visual query answering. Image captioning can aid the visually impaired, create useful product descriptions,
discover inappropriate content beyond text, and more. Image-text retrieval could be applied in multimodal search, as well
as in applications resembling autonomous driving. Visual question-answering can aid in education, enable multimodal chatbots,
and assist in various domain-specific information retrieval applications.
Modern computer vision and natural language models have grow to be more capable; nonetheless, they’ve also significantly
grown in size in comparison with their predecessors. While pre-training a single-modality model is resource-consuming and expensive,
the price of end-to-end vision-and-language pre-training has grow to be increasingly prohibitive.
BLIP-2 tackles this challenge by introducing a brand new visual-language pre-training paradigm that may potentially leverage
any combination of pre-trained vision encoder and LLM without having to pre-train the entire architecture end to finish.
This permits achieving state-of-the-art results on multiple visual-language tasks while significantly reducing the number
of trainable parameters and pre-training costs. Furthermore, this approach paves the way in which for a multimodal ChatGPT-like model.
What’s under the hood in BLIP-2?
BLIP-2 bridges the modality gap between vision and language models by adding a light-weight Querying Transformer (Q-Former)
between an off-the-shelf frozen pre-trained image encoder and a frozen large language model. Q-Former is the one
trainable a part of BLIP-2; each the image encoder and language model remain frozen.
Q-Former is a transformer model that consists of two submodules that share the identical self-attention layers:
- a picture transformer that interacts with the frozen image encoder for visual feature extraction
- a text transformer that may function as each a text encoder and a text decoder
The image transformer extracts a hard and fast variety of output features from the image encoder, independent of input image resolution,
and receives learnable query embeddings as input. The queries can moreover interact with the text through the identical self-attention layers.
Q-Former is pre-trained in two stages. In the primary stage, the image encoder is frozen, and Q-Former is trained with three losses:
- Image-text contrastive loss: pairwise similarity between each query output and text output’s CLS token is calculated, and the very best one is picked. Query embeddings and text don’t “see” one another.
- Image-grounded text generation: queries can attend to one another but to not the text tokens, and text has a causal mask and might attend to all the queries.
- Image-text matching loss: queries and text can see others, and a logit is obtained to point whether the text matches the image or not. To acquire negative examples, hard negative mining is used.
Within the second pre-training stage, the query embeddings now have the relevant visual information to the text because it has
passed through an information bottleneck. These embeddings are actually used as a visible prefix to the input to the LLM. This
pre-training phase effectively involves an image-ground text generation task using the causal LM loss.
As a visible encoder, BLIP-2 uses ViT, and for an LLM, the paper authors used OPT and Flan T5 models. You could find
pre-trained checkpoints for each OPT and Flan T5 on Hugging Face Hub.
Nevertheless, as mentioned before, the introduced pre-training approach allows combining any visual backbone with any LLM.
Using BLIP-2 with Hugging Face Transformers
Using Hugging Face Transformers, you’ll be able to easily download and run a pre-trained BLIP-2 model in your images. Make certain to make use of a GPU environment with high RAM if you happen to’d prefer to follow together with the examples on this blog post.
Let’s start by installing Transformers. As this model has been added to Transformers very recently, we’d like to put in Transformers from the source:
pip install git+https://github.com/huggingface/transformers.git
Next, we’ll need an input image. Every week The Latest Yorker runs a cartoon captioning contest
amongst its readers, so let’s take considered one of these cartoons to place BLIP-2 to the test.
import requests
from PIL import Image
url="https://media.newyorker.com/cartoons/63dc6847be24a6a76d90eb99/master/w_1160,c_limit/230213_a26611_838.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
display(image.resize((596, 437)))
We’ve got an input image. Now we’d like a pre-trained BLIP-2 model and corresponding preprocessor to organize the inputs. You
can find the list of all available pre-trained checkpoints on Hugging Face Hub.
Here, we’ll load a BLIP-2 checkpoint that leverages the pre-trained OPT model by Meta AI, which has 2.7 billion parameters.
from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
Notice that BLIP-2 is a rare case where you can’t load the model with Auto API (e.g. AutoModelForXXX), and that you must
explicitly use Blip2ForConditionalGeneration. Nevertheless, you need to use AutoProcessor to fetch the suitable processor
class – Blip2Processor on this case.
Let’s use GPU to make text generation faster:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
Image Captioning
Let’s discover if BLIP-2 can caption a Latest Yorker cartoon in a zero-shot manner. To caption a picture, we should not have to
provide any text prompt to the model, only the preprocessed input image. With none text prompt, the model will start
generating text from the BOS (beginning-of-sequence) token thus making a caption.
inputs = processor(image, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
"two cartoon monsters sitting around a campfire"
That is an impressively accurate description for a model that wasn’t trained on Latest Yorker style cartoons!
Prompted image captioning
We will extend image captioning by providing a text prompt, which the model will proceed given the image.
prompt = "this can be a cartoon of"
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
"two monsters sitting around a campfire"
prompt = "they give the impression of being like they're"
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
"having time"
Visual query answering
For visual query answering the prompt has to follow a selected format:
“Query: {} Answer:”
prompt = "Query: What's a dinosaur holding? Answer:"
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=10)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
"A torch"
Chat-based prompting
Finally, we are able to create a ChatGPT-like interface by concatenating each generated response to the conversation. We prompt
the model with some text (like “What’s a dinosaur holding?”), the model generates a solution for it “a torch”), which we
can concatenate to the conversation. Then we do it again, build up the context.
Nevertheless, be certain that the context doesn’t exceed 512 tokens, as that is the context length of the language models utilized by BLIP-2 (OPT and T5).
context = [
("What is a dinosaur holding?", "a torch"),
("Where are they?", "In the woods.")
]
query = "What for?"
template = "Query: {} Answer: {}."
prompt = " ".join([template.format(context[i][0], context[i][1]) for i in range(len(context))]) + " Query: " + query + " Answer:"
print(prompt)
Query: What's a dinosaur holding? Answer: a torch. Query: Where are they? Answer: Within the woods.. Query: What for? Answer:
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=10)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
To light a fireplace.
Conclusion
BLIP-2 is a zero-shot visual-language model that could be used for multiple image-to-text tasks with image and image and
text prompts. It’s an efficient and efficient approach that could be applied to image understanding in quite a few scenarios,
especially when examples are scarce.
The model bridges the gap between vision and natural language modalities by adding a transformer between pre-trained models.
The brand new pre-training paradigm allows this model to maintain up with the advances in each individual modalities.
For those who’d prefer to learn how you can fine-tune BLIP-2 models for various vision-language tasks, try LAVIS library by Salesforce
that gives comprehensive support for model training.
To see BLIP-2 in motion, try its demo on Hugging Face Spaces.
Acknowledgments
Many due to the Salesforce Research team for working on BLIP-2, Niels Rogge for adding BLIP-2 to 🤗 Transformers, and
to Omar Sanseviero for reviewing this blog post.
