FastSAM  for Image Segmentation Tasks — Explained Simply

-

segmentation is a well-liked task in computer vision, with the goal of partitioning an input image into multiple regions, where each region represents a separate object.

Several classic approaches from the past involved taking a model backbone (e.g., U-Net) and fine-tuning it on specialized datasets. While fine-tuning works well, the emergence of GPT-2 and GPT-3 prompted the machine learning community to steadily shift focus toward the event of zero-shot learning solutions.

The zero-shot concept plays a very important role by allowing the fine-tuning phase to be skipped, with the hope that the model is intelligent enough to unravel any task on the go.

Within the context of computer vision, Meta released the widely known general-purpose “Segment Anything Model” (SAM) in 2023, which enabled segmentation tasks to be performed with decent quality in a zero-shot manner.

The segmentation task goals to partition a picture into multiple parts, with each part representing a single object.

While the large-scale results of SAM were impressive, several months later, the Chinese Academy of Sciences Image and Video Evaluation (CASIA IVA) group released the FastSAM model. Because the adjective “fast” suggests, FastSAM addresses the speed limitations of SAM by accelerating the inference process by as much as 50 times, while maintaining high segmentation quality.

Architecture

The inference process in FastSAM takes place in two steps:

  1. All-instance segmentation. The goal is to supply segmentation masks for all objects within the image.
  2. Prompt-guided selection. After obtaining all possible masks, prompt-guided selection returns the image region corresponding to the input prompt.
FastSAM inference takes place in two steps. After the segmentation masks are obtained, prompt-guided selection is used to filter and merge them into the ultimate mask.

Allow us to start with the all instance segmentation.

All instance segmentation

Before visually examining the architecture, allow us to confer with the unique paper:

The definition may appear complex for individuals who should not accustomed to YOLOv8-seg and YOLACT. In any case, to higher make clear the meaning behind these two models, I’ll provide an easy intuition about what they’re and the way they’re used.

YOLACT (You Only Have a look at CoefficienTs)

YOLACT is a real-time instance segmentation convolutional model that focuses on high-speed detection, inspired by the YOLO model, and achieves performance comparable to the Mask R-CNN model.

YOLACT consists of two primary modules (branches):

  1. Prototype branch. YOLACT creates a set of segmentation masks called prototypes.
  2. Prediction branch. YOLACT performs object detection by predicting bounding boxes after which estimates mask coefficients, which tell the model learn how to linearly mix the prototypes to create a final mask for every object.
YOLACT architecture: yellow blocks indicate trainable parameters, while gray blocks indicate non-trainable parameters. Source: YOLACT, Real-time Instance Segmentation. The variety of mask propotypes in the image is k = 4. Imade adapted by the creator.

To extract initial features from the image, YOLACT uses ResNet, followed by a Feature Pyramid Network (FPN) to acquire multi-scale features. Each of the P-levels (shown within the image) processes features of various sizes using convolutions (e.g., P3 incorporates the smallest features, while P7 captures higher-level image features). This approach helps YOLACT account for objects at various scales.

YOLOv8-seg

YOLOv8-seg is a model based on YOLACT and incorporates the identical principles regarding prototypes. It also has two heads:

  1. Detection head. Used to predict bounding boxes and classes.
  2. Segmentation head. Used to generate masks and mix them.

The important thing difference is that YOLOv8-seg uses a YOLO backbone architecture as an alternative of the ResNet backbone and FPN utilized in YOLACT. This makes YOLOv8-seg lighter and faster during inference.

FastSAM architecture

FastSAM’s architecture relies on YOLOv8-seg but additionally incorporates an FPN, much like YOLACT. It includes each detection and segmentation heads, with  prototypes. Nevertheless, since FastSAM performs segmentation of all possible objects within the image, its workflow differs from that of YOLOv8-seg and YOLACT:

  1. First, FastSAM performs segmentation by producing  image masks.
  2. These masks are then combined to supply the ultimate segmentation mask.
  3. During post-processing, FastSAM extracts regions, computes bounding boxes, and performs instance segmentation for every object.
FastSAM architecture: yellow blocks indicate trainable parameters, while gray blocks indicate non-trainable parameters. Source: Fast Segment Anything. Image adapted by the creator.

Note

Although the paper doesn’t mention details about post-processing, it may well be observed that the official FastSAM GitHub repository uses the tactic  from OpenCV within the prediction stage.

# The usage of cv2.findContours() method the during prediction stage.
# Source: FastSAM repository (FastSAM / fastsam / prompt.py)  

def _get_bbox_from_mask(self, mask):
      mask = mask.astype(np.uint8)
      contours, hierarchy = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
      x1, y1, w, h = cv2.boundingRect(contours[0])
      x2, y2 = x1 + w, y1 + h
      if len(contours) > 1:
          for b in contours:
              x_t, y_t, w_t, h_t = cv2.boundingRect(b)
              # Merge multiple bounding boxes into one.
              x1 = min(x1, x_t)
              y1 = min(y1, y_t)
              x2 = max(x2, x_t + w_t)
              y2 = max(y2, y_t + h_t)
          h = y2 - y1
          w = x2 - x1
      return [x1, y1, x2, y2]

In practice, there are several methods to extract instance masks from the ultimate segmentation mask. Some examples include contour detection (utilized in FastSAM) and connected component evaluation ().

Training

FastSAM researchers used the identical SA-1B dataset because the SAM developers but trained the CNN detector on only 2% of the info. Despite this, the CNN detector achieves performance comparable to the unique SAM, while requiring significantly fewer resources for segmentation. Consequently, inference in FastSAM is as much as 50 times faster!

For reference, SA-1B consists of 11 million diverse images and 1.1 billion high-quality segmentation masks.

Prompt guided selection

The “segment anything task” involves producing a segmentation mask for a given prompt, which could be represented in several forms.

Several types of prompts processed by FastSAM. Source: Fast Segment Anything. Image adapted by the creator.

Point prompt

After obtaining multiple prototypes for a picture, some extent prompt could be used to point that the item of interest is situated (or not) in a particular area of the image. Consequently, the desired point influences the coefficients for the prototype masks.

Just like SAM, FastSAM allows choosing multiple points and specifying whether or not they belong to the foreground or background. If a foreground point corresponding to the item appears in multiple masks, background points could be used to filter out irrelevant masks.

Nevertheless, if several masks still satisfy the purpose prompts after filtering, mask merging is applied to acquire the ultimate mask for the item.

Moreover, the authors apply morphological operators to smooth the ultimate mask shape and take away small artifacts and noise.

Box prompt

The box prompt involves choosing the mask whose bounding box has the very best Intersection over Union (IoU) with the bounding box laid out in the prompt.

Text prompt

Similarly, for the text prompt, the mask that best corresponds to the text description is chosen. To realize this, the CLIP model is used:

  1. The embeddings for the text prompt and the k = 32 prototype masks are computed.
  2. The similarities between the text embedding and the prototypes are then calculated. The prototype with the very best similarity is post-processed and returned.
For the text prompt, the CLIP model is used to compute the text embedding of the prompt and the image embeddings of the mask prototypes. The similarities between the text embedding and the image embeddings are calculated, and the prototype corresponding to the image embedding with the very best similarity is chosen.

FastSAM repository

Below is the link to the official repository of FastSAM, which incorporates a transparent README.md file and documentation.

In this text, we now have checked out FastSAM — an improved version of SAM. Combining one of the best practices from YOLACT and YOLOv8-seg models, FastSAM maintains high segmentation quality while achieving a major boost in prediction speed, accelerating inference by several dozen times in comparison with the unique SAM.

The power to make use of prompts with FastSAM provides a versatile approach to retrieve segmentation masks for objects of interest. Moreover, it has been shown that decoupling prompt-guided selection from all-instance segmentation reduces complexity.

Below are some examples of FastSAM usage with different prompts, visually demonstrating that it still retains the high segmentation quality of SAM:

Source: Fast Segment Anything
Source: Fast Segment Anything

Resources

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x