Multimodal models are architectures that concurrently integrate and process different data types, resembling text, images, and audio. Some examples include CLIP and DALL-E from OpenAI, each released in 2021. CLIP understands images and text jointly, allowing it to perform tasks like zero-shot image classification. DALL-E, however, generates images from textual descriptions, allowing the automation and enhancement of creative processes in gaming, promoting, and literature, amongst other sectors.
Visual language models (VLMs) are a special case of multimodal models. VLMs generate language based on visual inputs. One outstanding example is Paligemma, which Google introduced in May 2024. Paligemma may be used for Visual Query Answering, object detection, and image segmentation.
Some blog posts explore the capabilities of Paligemma in object detection, resembling this excellent read from Roboflow:
Nevertheless, by the point I wrote this blog, the present documentation on preparing data to make use of Paligemma for object segmentation was vague. That’s the reason I wanted to guage whether it is straightforward to make use of Paligemma for this task. Here, I share my experience.
Before going into detail on the use case, let’s briefly revisit the inner workings of Paligemma.
Paligemma combines a SigLIP-So400m vision encoder with a Gemma language model to process images and text (see figure above). Within the new edition of Paligemma released in December of this yr, the vision encoder can preprocess images at three different resolutions: 224px, 448px, or 896px. The vision encoder preprocesses a picture and outputs a sequence of image tokens, that are linearly combined with input text tokens. This mix of tokens is further processed by the Gemma language model, which outputs text tokens. The Gemma model has different sizes, from 2B to 27B parameters.
An example of model output is shown in the next figure.
The Paligemma model was trained on various datasets resembling WebLi, openImages, WIT, and others (see this Kaggle blog for more details). Which means that Paligemma can discover objects without fine-tuning. Nevertheless, such abilities are limited. That’s why Google recommends fine-tuning Paligemma in domain-specific use cases.
Input format
To fine-tune Paligemma, the input data must be in JSONL format. A dataset in JSONL format has each line as a separate JSON object, like an inventory of individual records. Each JSON object comprises the next keys:
Image: The image’s name.
Prefix: This specifies the duty you would like the model to perform.
Suffix: This provides the bottom truth the model learns to make predictions.
Depending on the duty, you will need to change the JSON object’s prefix and suffix accordingly. Listed here are some examples:
{"image": "some_filename.png",
"prefix": "caption en" (To point that the model should generate an English caption for a picture),
"suffix": "That is a picture of a giant, white boat traveling within the ocean."
}
{"image": "another_filename.jpg",
"prefix": "What number of individuals are within the image?",
"suffix": "ten"
}
{"image": "filename.jpeg",
"prefix": "detect airplane",
"suffix": " airplane" (4 corner bounding box coords)
}
If you may have several categories to be detected, add a semicolon (;) amongst each category within the prefix and suffix.
A whole and clear explanation of methods to prepare the info for object detection in Paligemma may be present in this Roboflow post.
{"image": "filename.jpeg",
"prefix": "detect airplane",
"suffix": " airplane"
}
Note that for segmentation, aside from the thing’s bounding box coordinates, that you must specify 16 extra segmentation tokens representing a mask that matches throughout the bounding box. In line with Google’s Big Vision repository, those tokens are codewords with 128 entries (
When you are all in favour of learning more about Paligemma, I like to recommend these blogs:
As mentioned above, Paligemma was trained on different datasets. Subsequently, this model is anticipated to be good at segmenting “traditional” objects resembling cars, people, or animals. But what about segmenting objects in satellite images? This query led me to explore Paligemma’s capabilities for segmenting water in satellite images.
Kaggle’s Satellite Image of Water Bodies dataset is suitable for this purpose. This dataset comprises 2841 images with their corresponding masks.
Some masks on this dataset were incorrect, and others needed further preprocessing. Faulty examples include masks with all values set to water, while only a small portion was present in the unique image. Other masks didn’t correspond to their RGB images. When a picture is rotated, some masks make these areas appear as in the event that they have water.
Given these data limitations, I chosen a sample of 164 images for which the masks didn’t have any of the issues mentioned above. This set of images is used to fine-tune Paligemma.
Preparing the JSONL dataset
As explained within the previous section, Paligemma needs entries that represent the thing’s bounding box coordinates in normalized image-space (
By the point I wrote this blog (starting of December), Google announced the second version of Paligemma. Following this event, Roboflow published a pleasant overview of preparing data to fine-tune Paligemma2 for various applications, including image segmentation. I exploit a part of their code to finally obtain the right segmentation codewords. What was my mistake? Well, initially, the masks must be resized to a tensor of shape [None, 64, 64, 1] after which use a pre-trained variational auto-encoder (VAE) to convert annotation masks into text labels. Although the usage of a VAE model was briefly mentioned within the Big Vision repository, there isn’t a explanation or examples on methods to use it.
The workflow I exploit to organize the info to fine-tune Paligemma is shown below:
As observed, the variety of steps needed to organize the info for Paligemma is large, so I don’t share code snippets here. Nevertheless, if you need to explore the code, you may visit this GitHub repository. The script convert.py has all of the steps mentioned within the workflow shown above. I also added the chosen images so you may play with this script immediately.
When preprocessing the segmentation codewords back to segmentation masks, we note how these masks cover the water bodies in the pictures:
Before fine-tuning Paligemma, I attempted its segmentation capabilities on the models uploaded to Hugging Face. This platform has a demo where you may upload images and interact with different Paligemma models.
The present version of Paligemma is usually good at segmenting water in satellite images, nevertheless it’s not perfect. Let’s see if we will improve these results!
There are two ways to fine-tune Paligemma, either through Hugging Face’s Transformer library or by utilizing Big Vision and JAX. I went for this last option. Big Vision provides a Colab notebook, which I modified for my use case. You’ll be able to open it by going to my GitHub repository:
I used a batch size of 8 and a learning rate of 0.003. I ran the training loop twice, which translates to 158 training steps. The whole running time using a T4 GPU machine was 24 minutes.
The outcomes weren’t as expected. Paligemma didn’t produce predictions in some images, and in others, the resulting masks were removed from the bottom truth. I also obtained segmentation codewords with greater than 16 tokens in two images.
It’s price mentioning that I exploit the primary Paligemma version. Perhaps the outcomes are improved when using Paligemma2 or by tweaking the batch size or learning rate further. In any case, these experiments are out of the scope of this blog.
The demo results show that the default Paligemma model is healthier at segmenting water than my finetuned model. For my part, UNET is a greater architecture if the aim is to construct a model specialized in segmenting objects. For more information on methods to train such a model, you may read my previous blog post:
Other limitations:
I would like to say another challenges I encountered when fine-tuning Paligemma using Big Vision and JAX.
- Establishing different model configurations is difficult because there’s still little documentation on those parameters.
- The primary version of Paligemma has been trained to handle images of various aspect ratios resized to 224×224. Ensure to resize your input images with this size only. This may prevent raising exceptions.
- When fine-tuning with Big Vision and JAX, You would possibly have JAX GPU-related problems. Ways to beat this issue are:
a. Reducing the samples in your training and validation datasets.
b. Increasing the batch size from 8 to 16 or higher.
- The fine-tuned model has a size of ~ 5GB. Ensure to have enough space in your Drive to store it.
Discovering a brand new AI model is exciting, especially on this age of multimodal algorithms transforming our society. Nevertheless, working with state-of-the-art models can sometimes be difficult resulting from the dearth of obtainable documentation. Subsequently, the launch of a brand new AI model must be accompanied by comprehensive documentation to make sure its smooth and widespread adoption, especially amongst professionals who’re still inexperienced on this area.
Despite the difficulties I encountered fine-tuning Paligemma, the present pre-trained models are powerful at doing zero-shot object detection and image segmentation, which may be used for a lot of applications, including assisted ML labeling.
Are you using Paligemma in your Computer Vision projects? Share your experience fine-tuning this model within the comments!
I hope you enjoyed this post. Over again, thanks for reading!
You’ll be able to contact me via LinkedIn at:
https://www.linkedin.com/in/camartinezbarbosa/
Acknowledgments: I would like to thank José Celis-Gil for all of the fruitful discussions on data preprocessing and modeling.