Segmenting Water in Satellite Images Using Paligemma

-

Multimodal models are architectures that concurrently integrate and process different data types, resembling text, images, and audio. Some examples include CLIP and DALL-E from OpenAI, each released in 2021. CLIP understands images and text jointly, allowing it to perform tasks like zero-shot image classification. DALL-E, however, generates images from textual descriptions, allowing the automation and enhancement of creative processes in gaming, promoting, and literature, amongst other sectors.

Visual language models (VLMs) are a special case of multimodal models. VLMs generate language based on visual inputs. One outstanding example is Paligemma, which Google introduced in May 2024. Paligemma may be used for Visual Query Answering, object detection, and image segmentation.

Some blog posts explore the capabilities of Paligemma in object detection, resembling this excellent read from Roboflow:

Nevertheless, by the point I wrote this blog, the present documentation on preparing data to make use of Paligemma for object segmentation was vague. That’s the reason I wanted to guage whether it is straightforward to make use of Paligemma for this task. Here, I share my experience.

Before going into detail on the use case, let’s briefly revisit the inner workings of Paligemma.

Architecture of Paligemma2. Source: https://arxiv.org/abs/2412.03555

Paligemma combines a SigLIP-So400m vision encoder with a Gemma language model to process images and text (see figure above). Within the new edition of Paligemma released in December of this yr, the vision encoder can preprocess images at three different resolutions: 224px, 448px, or 896px. The vision encoder preprocesses a picture and outputs a sequence of image tokens, that are linearly combined with input text tokens. This mix of tokens is further processed by the Gemma language model, which outputs text tokens. The Gemma model has different sizes, from 2B to 27B parameters.

An example of model output is shown in the next figure.

Example of an object segmentation output. Source: https://arxiv.org/abs/2412.03555

The Paligemma model was trained on various datasets resembling WebLi, openImages, WIT, and others (see this Kaggle blog for more details). Which means that Paligemma can discover objects without fine-tuning. Nevertheless, such abilities are limited. That’s why Google recommends fine-tuning Paligemma in domain-specific use cases.

Input format

To fine-tune Paligemma, the input data must be in JSONL format. A dataset in JSONL format has each line as a separate JSON object, like an inventory of individual records. Each JSON object comprises the next keys:

Image: The image’s name.

Prefix: This specifies the duty you would like the model to perform.

Suffix: This provides the bottom truth the model learns to make predictions.

Depending on the duty, you will need to change the JSON object’s prefix and suffix accordingly. Listed here are some examples:

{"image": "some_filename.png", 
"prefix": "caption en" (To point that the model should generate an English caption for a picture),
"suffix": "That is a picture of a giant, white boat traveling within the ocean."
}
{"image": "another_filename.jpg", 
"prefix": "What number of individuals are within the image?",
"suffix": "ten"
}
{"image": "filename.jpeg", 
"prefix": "detect airplane",
"suffix": " airplane" (4 corner bounding box coords)
}

If you may have several categories to be detected, add a semicolon (;) amongst each category within the prefix and suffix.

A whole and clear explanation of methods to prepare the info for object detection in Paligemma may be present in this Roboflow post.

{"image": "filename.jpeg", 
"prefix": "detect airplane",
"suffix": " airplane"
}

Note that for segmentation, aside from the thing’s bounding box coordinates, that you must specify 16 extra segmentation tokens representing a mask that matches throughout the bounding box. In line with Google’s Big Vision repository, those tokens are codewords with 128 entries (). How will we obtain these values? In my personal experience, it was difficult and frustrating to get them without proper documentation. But I’ll give more details later.

When you are all in favour of learning more about Paligemma, I like to recommend these blogs:

As mentioned above, Paligemma was trained on different datasets. Subsequently, this model is anticipated to be good at segmenting “traditional” objects resembling cars, people, or animals. But what about segmenting objects in satellite images? This query led me to explore Paligemma’s capabilities for segmenting water in satellite images.

Kaggle’s Satellite Image of Water Bodies dataset is suitable for this purpose. This dataset comprises 2841 images with their corresponding masks.

Here’s an example of the water bodies dataset: The RGB image is shown on the left, while the corresponding mask appears on the suitable.

Some masks on this dataset were incorrect, and others needed further preprocessing. Faulty examples include masks with all values set to water, while only a small portion was present in the unique image. Other masks didn’t correspond to their RGB images. When a picture is rotated, some masks make these areas appear as in the event that they have water.

Example of a rotated mask. When reading this image in Python, the world outside the image appears as it might have water. On this case, image rotation is required to correct this mask. Image made by the writer.

Given these data limitations, I chosen a sample of 164 images for which the masks didn’t have any of the issues mentioned above. This set of images is used to fine-tune Paligemma.

Preparing the JSONL dataset

As explained within the previous section, Paligemma needs entries that represent the thing’s bounding box coordinates in normalized image-space () plus an additional 16 segmentation tokens representing 128 different codewords (). Obtaining the bounding box coordinates in the specified format was easy, because of Roboflow’s explanation. But how will we obtain the 128 codewords from the masks? There was no clear documentation or examples within the Big Vision repository that I could use for my use case. I naively thought that the strategy of creating the segmentation tokens was much like that of creating the bounding boxes. Nevertheless, this led to an incorrect representation of the water masks, which led to flawed prediction results.

By the point I wrote this blog (starting of December), Google announced the second version of Paligemma. Following this event, Roboflow published a pleasant overview of preparing data to fine-tune Paligemma2 for various applications, including image segmentation. I exploit a part of their code to finally obtain the right segmentation codewords. What was my mistake? Well, initially, the masks must be resized to a tensor of shape [None, 64, 64, 1] after which use a pre-trained variational auto-encoder (VAE) to convert annotation masks into text labels. Although the usage of a VAE model was briefly mentioned within the Big Vision repository, there isn’t a explanation or examples on methods to use it.

The workflow I exploit to organize the info to fine-tune Paligemma is shown below:

Steps to convert one original mask from the filtered water bodies dataset to a JSON object. This process is repeated over the 164 images of the train set and the 21 images of the test dataset to construct the JSONL dataset.

As observed, the variety of steps needed to organize the info for Paligemma is large, so I don’t share code snippets here. Nevertheless, if you need to explore the code, you may visit this GitHub repository. The script convert.py has all of the steps mentioned within the workflow shown above. I also added the chosen images so you may play with this script immediately.

When preprocessing the segmentation codewords back to segmentation masks, we note how these masks cover the water bodies in the pictures:

Resulting masks when decoding the segmentation codewords within the train set. Image made by the writer using this Notebook.

Before fine-tuning Paligemma, I attempted its segmentation capabilities on the models uploaded to Hugging Face. This platform has a demo where you may upload images and interact with different Paligemma models.

Default Paligemma model at segmenting water in satellite images.

The present version of Paligemma is usually good at segmenting water in satellite images, nevertheless it’s not perfect. Let’s see if we will improve these results!

There are two ways to fine-tune Paligemma, either through Hugging Face’s Transformer library or by utilizing Big Vision and JAX. I went for this last option. Big Vision provides a Colab notebook, which I modified for my use case. You’ll be able to open it by going to my GitHub repository:

I used a batch size of 8 and a learning rate of 0.003. I ran the training loop twice, which translates to 158 training steps. The whole running time using a T4 GPU machine was 24 minutes.

The outcomes weren’t as expected. Paligemma didn’t produce predictions in some images, and in others, the resulting masks were removed from the bottom truth. I also obtained segmentation codewords with greater than 16 tokens in two images.

Results of the fine-tuning where there have been predictions. Image made by the writer.

It’s price mentioning that I exploit the primary Paligemma version. Perhaps the outcomes are improved when using Paligemma2 or by tweaking the batch size or learning rate further. In any case, these experiments are out of the scope of this blog.

The demo results show that the default Paligemma model is healthier at segmenting water than my finetuned model. For my part, UNET is a greater architecture if the aim is to construct a model specialized in segmenting objects. For more information on methods to train such a model, you may read my previous blog post:

Other limitations:

I would like to say another challenges I encountered when fine-tuning Paligemma using Big Vision and JAX.

  • Establishing different model configurations is difficult because there’s still little documentation on those parameters.
  • The primary version of Paligemma has been trained to handle images of various aspect ratios resized to 224×224. Ensure to resize your input images with this size only. This may prevent raising exceptions.
  • When fine-tuning with Big Vision and JAX, You would possibly have JAX GPU-related problems. Ways to beat this issue are:

a. Reducing the samples in your training and validation datasets.

b. Increasing the batch size from 8 to 16 or higher.

  • The fine-tuned model has a size of ~ 5GB. Ensure to have enough space in your Drive to store it.

Discovering a brand new AI model is exciting, especially on this age of multimodal algorithms transforming our society. Nevertheless, working with state-of-the-art models can sometimes be difficult resulting from the dearth of obtainable documentation. Subsequently, the launch of a brand new AI model must be accompanied by comprehensive documentation to make sure its smooth and widespread adoption, especially amongst professionals who’re still inexperienced on this area.

Despite the difficulties I encountered fine-tuning Paligemma, the present pre-trained models are powerful at doing zero-shot object detection and image segmentation, which may be used for a lot of applications, including assisted ML labeling.

Are you using Paligemma in your Computer Vision projects? Share your experience fine-tuning this model within the comments!

I hope you enjoyed this post. Over again, thanks for reading!

You’ll be able to contact me via LinkedIn at:

https://www.linkedin.com/in/camartinezbarbosa/

Acknowledgments: I would like to thank José Celis-Gil for all of the fruitful discussions on data preprocessing and modeling.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x