Gain a Higher Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow

https://github.com/syrax90/dynamic-solov2-tensorflow2 – Source code of the project described within the article.

Disclaimer

⚠️ To begin with, note that this project is just not production-ready code.

and Why I Decided to Implement It from Scratch

This project targets individuals who don’t have high-performance hardware (GPU particularly) but want to review computer vision or at the very least on the best way of finding themselves as an individual thinking about this area. I attempted to make the code as clear as possible, so I used Google’s description style for all methods and classes, comments contained in the code to make the logic and calculations more clear and used Single Responsibility Principle and other OOP principles to make the code more human-readable.

Because the title of the article suggests, I made a decision to implement Dynamic SOLO from scratch to deeply understand all of the intricacies of implementing such models, including all the cycle of functional production, to higher understand the issues that might be encountered in computer vision tasks, and to realize precious experience in creating computer vision models using TensorFlow. Looking ahead, I’ll say that I used to be not mistaken with this alternative, because it brought me a variety of recent skills and knowledge.

I’d recommend implementing models from scratch to everyone who want to know their principles of working deeper. That’s why:

Once you encounter a misunderstanding about something, you begin to delve deeper into the particular problem. By exploring the issue, you discover a solution to the query of why a specific approach was invented, and thus expand your knowledge on this area.
Once you understand the speculation behind an approach or principle, you begin to explore methods to implement it using existing technical tools. In this fashion, you improve your technical skills for solving specific problems.
When implementing something from scratch, you higher understand the worth of the hassle, time, and resources that might be spent on such tasks. By comparing them with similar tasks, you more accurately estimate the prices and have a greater idea of the worth of comparable work, including preparation, research, technical implementation, and even documentation.

TensorFlow was chosen because the framework just because I take advantage of this framework for many of my machine learning tasks (nothing special here).
The project represents implementation of Dynamic SOLO (SOLOv2) model with TensorFlow2 framework.

Dynamic SOLO plot. Image by creator. Inspired by arXiv:2106.15947

SOLO (Segmenting Objects by Locations) is a model designed for computer vision tasks, specifically as an example segmentation. It is completely anchor-free framework that predicts masks with none bounding boxes. The paper presents several variants of the model: Vanilla SOLO, Decoupled SOLO, Dynamic SOLO, Decoupled Dynamic SOLO. Indeed, I implemented Vanilla SOLO first since it is the best of all of them. But I’m not going to publish the code because there isn’t a large distinguish between Vanilla and Dynamic SOLO from implementation perspective.

Model

Actually, the model might be very flexible in accordance with the principles described within the SOLO paper: from the variety of FPN layers to the variety of parameters within the layers. I made a decision to begin with the best implementation. The fundamental idea of the model is to divide all the image into cells, where one grid cell can represent just one instance: determined class + segmentation mask.

Backbone

I selected ResNet50 because the backbone since it is a light-weight network that suits for starting perfectly. I didn’t use pretrained parameters for ResNet50 because I used to be experimenting with greater than just original COCO dataset. Nevertheless, you should utilize pretrained parameters should you intend to make use of the unique COCO dataset, because it saves time, quickens the training process, and improves performance.

backbone = ResNet50(weights='imagenet', include_top=False, input_shape=input_shape)
backbone.trainable = False

Neck

FPN (Feature Pyramid Network) is used because the neck for extracting multi-scale features. Inside the FPN, we use all outputs C2, C3, C4, C5 from the corresponding residual blocks of ResNet50 as described within the FPN paper (Feature Pyramid Networks for Object Detection by Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie). Each FPN level represents a selected scale and has its own grid as shown above.

Head

The outputs of the FPN layers are used as inputs to layers where the instance class and its mask are determined. Head incorporates two parallel branches for the aim: Classification branch and Mask kernel branch.

Classification branch (within the figure above it’s designated as “Category”) – is accountable for predicting the category of every instance (grid cell) in a picture. It consists of a sequence of Conv2D -> GroupNorm -> ReLU sets arranged in a row. I applied a sequence of 4 such sets.
Mask branch (within the figure above it’s designated as “Mask”) – here’s a critical nuance: unlike within the Vanilla SOLO model, it doesn’t generate masks directly. As an alternative, it predicts a mask kernel (known as “” in Section of the paper), which is later applied through dynamic convolution with the Mask feature described below. This design differentiates Dynamic SOLO from Vanilla SOLO by reducing the variety of parameters and making a more efficient, lightweight architecture. The Mask branch predicts a mask kernel for every instance (grid cell) using the identical structure because the Classification branch: a sequence of Conv2D -> GroupNorm -> ReLU sets arranged in a row. I also implemented 4 such sets within the model.

,

Mask Feature

The Mask feature branch is combined with the Mask kernel branch to find out the ultimate predicted mask. This layer fuses multi-level FPN features to supply a unified mask feature map. The authors of the paper evaluated two approaches to implementing the Mask feature branch: a selected mask feature for every FPN level or one unified mask feature for all FPN levels. Just like the authors, I selected the last one. The Mask feature branch and Mask kernel branch are combined via dynamic convolution operation.

Dataset

I selected to work with the COCO dataset format, training my model on each the unique COCO dataset and a small custom dataset structured in the identical format. I selected COCO format since it has already been widely researched, that makes writing code for parsing the format much easier. Furthermore, the LabelMe tool I selected to construct my custom dataset in a position to convert a dataset on to COCO format. Moreover, starting with a small custom dataset reduces training time and simplifies the event process. Yet one more reason to create a dataset by yourself is the chance to higher understand the dataset creation process, take part in it directly, and gain recent skills in interacting with tools like LabelMe. A small annotation file might be explored faster and easier than a big file if you would like to dive deeper into the COCO format.

Listed here are among the sub-tasks regarding datasets that I encountered while implementing the project (they’re presented within the project):

Data augmentation. Data augmentation of a picture dataset is the means of expanding the dataset by applying various image transformation methods to generate recent samples that differ from the unique ones. Mastering augmentation techniques is crucial, especially for small datasets. I applied methods resembling Horizontal flip, Brightness adjustment, Random scaling, Random cropping to provide an idea of methods to do that and understand how essential it’s that the mask of the modified image matches its recent (augmented) image.
Converting to focus on. The SOLO model expects a selected data format for the goal. It takes a normalized image as input, nothing special. But for the goal, the model expects more complex data:
- We’ve to construct a grid for every scale separating it by the variety of grid cells for the particular scale. That implies that if now we have 4 FPN levels – P2, P3, P4, P5 – for various scales, then we may have 4 grids with a certain variety of cells for every scale.
- For every instance, now we have to define by location the one cell to which the instance belongs amongst all of the grids.
- For every defined, the category and mask of the corresponding instance are applied. There may be an extra problem of converting the COCO format mask right into a mask consisting of ones for the mask pixels and zeros for the remaining of the pixels.
- Mix the entire above into a listing of tensors because the goal. I understand that TensorFlow prefers a strict set of tensors over structures like a listing, but I made a decision to decide on a listing for the added flexibility that you simply might need should you determine to alter the variety of scales.
Dataset in memory or Generated. The are two important options for dataset allocation: storing samples in memory or generating data on the fly. Despite of allocation in memory has a variety of benefits and there isn’t a problem for a variety of you to upload entire training dataset directory of COCO dataset into memory (19.3 GB only) – I intentionally selected to generate the dataset dynamically using tf.data.Dataset.from_generator. Here’s why: I believe it’s a superb skill to learn what problems you would possibly encounter interacting with big data and methods to solve them. Because when working with real-world problems, datasets may not only contain more samples than COCO datasets, but their resolution might also be much higher. Working with dynamically generated datasets is usually a bit more complex to implement, nevertheless it is more flexible. In fact, you may replace it with tf.data.Dataset.from_tensor_slices, should you wish.

Training Process

Loss Function

SOLO doesn’t have a normal Loss Function that is just not natively implemented in TensorFlow, so I implemented it by myself.

$$L = L_{cate} + lambda L_{mask}$$

Where:

(L_{cate}) is the traditional Focal Loss for semantic category classification.
(L_{mask}) is the loss for mask prediction.
(lambda) coefficient that is about to three within the paper.

$$
L_{mask}
=
frac{1}{N_{pos}}
sum_k
mathbb{1}_{{p^*_{i,j} > 0}}
d_{mask}(m_k, m^*_k)
$$

Where:

(N_{pos}) is the variety of positive samples.
(d_{mask}) is implemented as Dice Loss.
( i = lfloor k/S rfloor ), ( j = k mod S ) — Indices for grid cells, indexing left to right and top to bottom.
1 is the indicator function, being 1 if (p^*_{i,j} > 0) and 0 otherwise.

$$L_{Dice}=1 – D(p, q)$$

Where is the dice coefficient, which is defined as

$$
D(p, q)
=
frac
{2 sum_{x,y} (p_{x,y} cdot q_{x,y})}
{sum_{x,y} p^2_{x,y} + sum_{x,y} q^2_{x,y}}
$$

Where (p_{x,y})(q_{x,y}) are pixel values at for predicted mask and ground truth mask . All details of the loss function are described in of the original SOLO paper

Resuming from Checkpoint.

In the event you use a low-performance GPU, you would possibly encounter situations where training all the model in a single run is impractical. So as to not lose your trained weights and proceed to execute the training process – this project provides a Resuming from Checkpoint system. It means that you can save your model every epochs (where is configurable) and resume training later. To enable this, set load_previous_model to True and specify model_path in config.py.

self.load_previous_model = True
self.model_path = './weights/coco_epoch00000001.keras'

Evaluation Process

To see how effectively your model is trained and the way well it behaves on previously unseen images, an evaluation process is used. For the SOLO model, I’d break down the method into the next steps:

Loading a test dataset.
Preparing the dataset to be compatible for the model’s input.
Feeding the information into the model.
Suppressing resulting masks with lower probability for a similar instance.
Visualization of the unique test image with the ultimate mask and predicted category for every instance.

Essentially the most irregular task I faced here was implementing Matrix NMS (non-maximum suppression), described in of the original SOLO paper. NMS eliminates redundant masks representing the identical instance with lower probability. To avoid predicting the identical instance multiple times, we want to suppress these duplicate masks. The authors provided Python pseudo-code for Matrix NMS and one in every of my tasks was to interpret this pseudo-code and implement it using TensorFlow. My implementation:

def matrix_nms(masks, scores, labels, pre_nms_k=500, post_nms_k=100, score_threshold=0.5, sigma=0.5):
    """
    Perform class-wise Matrix NMS on instance masks.

    Parameters:
        masks (tf.Tensor): Tensor of shape (N, H, W) with each mask as a sigmoid probability map (0~1).
        scores (tf.Tensor): Tensor of shape (N,) with confidence scores for every mask.
        labels (tf.Tensor): Tensor of shape (N,) with class labels for every mask (ints).
        pre_nms_k (int): Variety of top-scoring masks to maintain before applying NMS.
        post_nms_k (int): Variety of final masks to maintain after NMS.
        score_threshold (float): Rating threshold to filter out masks after NMS (default 0.5).
        sigma (float): Sigma value for Gaussian decay.

    Returns:
        tf.Tensor: Tensor of indices of masks kept after suppression.
    """
    # Binarize masks at 0.5 threshold
    seg_masks = tf.forged(masks >= 0.5, dtype=tf.float32)  # shape: (N, H, W)
    mask_sum = tf.reduce_sum(seg_masks, axis=[1, 2])  # shape: (N,)

    # If desired, select top pre_nms_k by rating to limit computation
    num_masks = tf.shape(scores)[0]
    if pre_nms_k is just not None:
        num_selected = tf.minimum(pre_nms_k, num_masks)
    else:
        num_selected = num_masks
    topk_indices = tf.argsort(scores, direction='DESCENDING')[:num_selected]
    seg_masks = tf.gather(seg_masks, topk_indices)  # select masks by top scores
    labels_sel = tf.gather(labels, topk_indices)
    scores_sel = tf.gather(scores, topk_indices)
    mask_sum_sel = tf.gather(mask_sum, topk_indices)

    # Flatten masks for matrix operations
    N = tf.shape(seg_masks)[0]
    seg_masks_flat = tf.reshape(seg_masks, (N, -1))  # shape: (N, H*W)

    # Compute intersection and IoU matrix (N x N)
    intersection = tf.matmul(seg_masks_flat, seg_masks_flat, transpose_b=True)  # pairwise intersect counts
    # Expand mask areas to full matrices
    mask_sum_matrix = tf.tile(mask_sum_sel[tf.newaxis, :], [N, 1])  # shape: (N, N)
    union = mask_sum_matrix + tf.transpose(mask_sum_matrix) - intersection
    iou = intersection / (union + 1e-6)  # IoU matrix (avoid div-by-zero)
    # Zero out diagonal and lower triangle (keep i= score_threshold                        # boolean mask of those above threshold
    new_scores = tf.where(keep_mask, new_scores, tf.zeros_like(new_scores))

    # Select top post_nms_k by the decayed scores
    if post_nms_k is not None:
        num_final = tf.minimum(post_nms_k, tf.shape(new_scores)[0])
    else:
        num_final = tf.shape(new_scores)[0]
    final_indices = tf.argsort(new_scores, direction='DESCENDING')[:num_final]
    final_indices = tf.boolean_mask(final_indices, tf.greater(tf.gather(new_scores, final_indices), 0))

    # Map back to original indices
    kept_indices = tf.gather(topk_indices, final_indices)
    return kept_indices

Below is an example of images with overlaid masks predicted by the model for a picture it has never seen before:

Advice for Implementation from Scratch

Which data can we map to which function? It is vitally essential to be certain that we feed the suitable data to the model. The information should match what is predicted at each layer, and every layer processes the input data in order that the output is suitable for the following layer. Because we ultimately calculate the loss function based on this data. Based on the implementation of SOLO, I noticed that some goals might not be so simple as they appear at first glance. I described this within the chapter.
Research the paper. It’s inconceivable to flee reading the paper you’re about to construct your model based on. I understand it is clear, but despite the numerous references to other previous works and papers, you want to understand the principles. Once you start researching a paper, you could be faced with a variety of other papers that you want to read and understand before you may accomplish that, and this might be quite a difficult task. But often, even the most modern paper relies on a set of principles which were known for a while and usually are not recent. This implies that you could find a variety of material on the Web that describes these principles very clearly. You should utilize LLM programs for this purpose, which might summarize the data, give examples, and allow you to understand among the works and papers.
Start with small steps. That is trivial advice, but to implement a pc vision model with thousands and thousands of parameters, you don’t have to waste time on useless training, dataset preparation, evaluation, etc. should you are in the event stage and usually are not sure that the model will work accurately. Furthermore, if you could have a low-performance GPU, the method takes even longer. So, don’t start with huge datasets, many parameters, and a series of layers. You’ll be able to even let the model overfit in the primary stage of development with a small dataset and a small variety of parameters, to make certain that the information is accurately matched to the targets of the model.
Debug your code. Debugging your code means that you can make certain that you could have expected code behaviour and data value on each step. I understand that everybody who at the very least once developed a software product knows about it, and so they don’t need the recommendation. But I would love to spotlight it anyway because constructing models, writing Loss Function, preparing datasets for input and targets we interact with math operations and tensors so much. And it requires increased attention from us unlike routine programming code we face on a regular basis and know the way it really works without debugging.

Conclusion

It is a transient description of the project with none technical details, to provide a general picture and avoid reading fatigue. Obviously, an outline of a project dedicated to a pc vision model can’t be fit in a single article. If I see interest within the project from readers, I could write a more detailed evaluation with technical details.

Gain a Higher Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow

Disclaimer

and Why I Decided to Implement It from Scratch

Model

Backbone

Neck

Head

Mask Feature

Dataset

Training Process

Loss Function

Resuming from Checkpoint.

Evaluation Process

Advice for Implementation from Scratch

Conclusion

Thanks for reading!

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Geometry of Laziness: What Angles Reveal About AI Hallucinations

Faster Text Generation with Self-Speculative Decoding

Understanding Vibe Proving

You may have designed state-of-the-art positional encoding

What Happens When You Construct an LLM Using Only 1s and 0s

Gain a Higher Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow

Disclaimer

and Why I Decided to Implement It from Scratch

Model

Backbone

Neck

Head

Mask Feature

Dataset

Training Process

Loss Function

Resuming from Checkpoint.

Evaluation Process

Advice for Implementation from Scratch

Conclusion

Thanks for reading!

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.