segmentation is a well-liked task in computer vision, with the goal of partitioning an input image into multiple regions, where each region represents a separate object.
Several classic approaches from the past involved taking a model backbone (e.g., U-Net) and fine-tuning it on specialized datasets. While fine-tuning works well, the emergence of GPT-2 and GPT-3 prompted the machine learning community to steadily shift focus toward the event of zero-shot learning solutions.
The zero-shot concept plays a very important role by allowing the fine-tuning phase to be skipped, with the hope that the model is intelligent enough to unravel any task on the go.
Within the context of computer vision, Meta released the widely known general-purpose “Segment Anything Model” (SAM) in 2023, which enabled segmentation tasks to be performed with decent quality in a zero-shot manner.
While the large-scale results of SAM were impressive, several months later, the Chinese Academy of Sciences Image and Video Evaluation (CASIA IVA) group released the FastSAM model. Because the adjective “fast” suggests, FastSAM addresses the speed limitations of SAM by accelerating the inference process by as much as 50 times, while maintaining high segmentation quality.
Architecture
The inference process in FastSAM takes place in two steps:
- All-instance segmentation. The goal is to supply segmentation masks for all objects within the image.
- Prompt-guided selection. After obtaining all possible masks, prompt-guided selection returns the image region corresponding to the input prompt.

Allow us to start with the all instance segmentation.
All instance segmentation
Before visually examining the architecture, allow us to confer with the unique paper:
The definition may appear complex for individuals who should not accustomed to YOLOv8-seg and YOLACT. In any case, to higher make clear the meaning behind these two models, I’ll provide an easy intuition about what they’re and the way they’re used.
YOLACT (You Only Have a look at CoefficienTs)
YOLACT is a real-time instance segmentation convolutional model that focuses on high-speed detection, inspired by the YOLO model, and achieves performance comparable to the Mask R-CNN model.
YOLACT consists of two primary modules (branches):
- Prototype branch. YOLACT creates a set of segmentation masks called prototypes.
- Prediction branch. YOLACT performs object detection by predicting bounding boxes after which estimates mask coefficients, which tell the model learn how to linearly mix the prototypes to create a final mask for every object.

To extract initial features from the image, YOLACT uses ResNet, followed by a Feature Pyramid Network (FPN) to acquire multi-scale features. Each of the P-levels (shown within the image) processes features of various sizes using convolutions (e.g., P3 incorporates the smallest features, while P7 captures higher-level image features). This approach helps YOLACT account for objects at various scales.
YOLOv8-seg
YOLOv8-seg is a model based on YOLACT and incorporates the identical principles regarding prototypes. It also has two heads:
- Detection head. Used to predict bounding boxes and classes.
- Segmentation head. Used to generate masks and mix them.
The important thing difference is that YOLOv8-seg uses a YOLO backbone architecture as an alternative of the ResNet backbone and FPN utilized in YOLACT. This makes YOLOv8-seg lighter and faster during inference.
FastSAM architecture
FastSAM’s architecture relies on YOLOv8-seg but additionally incorporates an FPN, much like YOLACT. It includes each detection and segmentation heads, with prototypes. Nevertheless, since FastSAM performs segmentation of all possible objects within the image, its workflow differs from that of YOLOv8-seg and YOLACT:
- First, FastSAM performs segmentation by producing image masks.
- These masks are then combined to supply the ultimate segmentation mask.
- During post-processing, FastSAM extracts regions, computes bounding boxes, and performs instance segmentation for every object.

Note
Although the paper doesn’t mention details about post-processing, it may well be observed that the official FastSAM GitHub repository uses the tactic from OpenCV within the prediction stage.
# The usage of cv2.findContours() method the during prediction stage.
# Source: FastSAM repository (FastSAM / fastsam / prompt.py)
def _get_bbox_from_mask(self, mask):
mask = mask.astype(np.uint8)
contours, hierarchy = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
x1, y1, w, h = cv2.boundingRect(contours[0])
x2, y2 = x1 + w, y1 + h
if len(contours) > 1:
for b in contours:
x_t, y_t, w_t, h_t = cv2.boundingRect(b)
# Merge multiple bounding boxes into one.
x1 = min(x1, x_t)
y1 = min(y1, y_t)
x2 = max(x2, x_t + w_t)
y2 = max(y2, y_t + h_t)
h = y2 - y1
w = x2 - x1
return [x1, y1, x2, y2]
In practice, there are several methods to extract instance masks from the ultimate segmentation mask. Some examples include contour detection (utilized in FastSAM) and connected component evaluation ().
Training
FastSAM researchers used the identical SA-1B dataset because the SAM developers but trained the CNN detector on only 2% of the info. Despite this, the CNN detector achieves performance comparable to the unique SAM, while requiring significantly fewer resources for segmentation. Consequently, inference in FastSAM is as much as 50 times faster!
For reference, SA-1B consists of 11 million diverse images and 1.1 billion high-quality segmentation masks.
Prompt guided selection
The “segment anything task” involves producing a segmentation mask for a given prompt, which could be represented in several forms.

Point prompt
After obtaining multiple prototypes for a picture, some extent prompt could be used to point that the item of interest is situated (or not) in a particular area of the image. Consequently, the desired point influences the coefficients for the prototype masks.
Just like SAM, FastSAM allows choosing multiple points and specifying whether or not they belong to the foreground or background. If a foreground point corresponding to the item appears in multiple masks, background points could be used to filter out irrelevant masks.
Nevertheless, if several masks still satisfy the purpose prompts after filtering, mask merging is applied to acquire the ultimate mask for the item.
Moreover, the authors apply morphological operators to smooth the ultimate mask shape and take away small artifacts and noise.
Box prompt
The box prompt involves choosing the mask whose bounding box has the very best Intersection over Union (IoU) with the bounding box laid out in the prompt.
Text prompt
Similarly, for the text prompt, the mask that best corresponds to the text description is chosen. To realize this, the CLIP model is used:
- The embeddings for the text prompt and the k = 32 prototype masks are computed.
- The similarities between the text embedding and the prototypes are then calculated. The prototype with the very best similarity is post-processed and returned.

FastSAM repository
Below is the link to the official repository of FastSAM, which incorporates a transparent README.md file and documentation.
In this text, we now have checked out FastSAM — an improved version of SAM. Combining one of the best practices from YOLACT and YOLOv8-seg models, FastSAM maintains high segmentation quality while achieving a major boost in prediction speed, accelerating inference by several dozen times in comparison with the unique SAM.
The power to make use of prompts with FastSAM provides a versatile approach to retrieve segmentation masks for objects of interest. Moreover, it has been shown that decoupling prompt-guided selection from all-instance segmentation reduces complexity.
Below are some examples of FastSAM usage with different prompts, visually demonstrating that it still retains the high segmentation quality of SAM:


Resources