The Data-centric AI Concepts in Segment Anything

Artificial Intelligence

The Data-centric AI Concepts in Segment Anything

admin

June 2, 2023

The Data-centric AI Concepts in Segment Anything

Unpacking the data-centric AI concepts utilized in Segment Anything, the primary foundation model for image segmentation

Segment Anything dataset construction. Image from the paper https://arxiv.org/pdf/2304.02643.pdf

Artificial Intelligence (AI) has made remarkable progress, especially in developing foundation models, that are trained with a great quantity of knowledge and may be adapted to a wide selection of downstream tasks.

A notable success of the muse models is Large Language Models (LLMs). These models can perform complex tasks with great precision, equivalent to language translation, text summarization, and question-answering.

Foundation models are also starting to vary the sport in Computer Vision. Meta’s Segment Anything is a recent development that’s causing a stir.

The success of Segment Anything may be attributed to its large labeled dataset, which has played an important role in enabling its remarkable performance. The model architecture, as described within the Segment Anything paper, is surprisingly easy and light-weight.

In this text, drawing upon insights from our recent survey papers [1,2], we are going to take a more in-depth have a look at Segment Anything through the lens of data-centric AI, a growing concept in the info science community.

What Can Segment Anything Do?

In a nutshell, the image segmentation task is to predict a mask to separate the areas of interest in a picture, equivalent to an object, an individual, etc. Segmentation is an important task in Computer Visual, making the image more meaningful and easier to investigate.

The difference between Segment Anything and other image segmentation approaches lies in introducing prompts to specify the segmentation location. Prompts may be vague, equivalent to a degree, a box, etc.

The image is a screenshot from https://segment-anything.com/ by uploading the image taken by the creator.

What’s Data-centric AI?

Comparison between data-centric AI and model-centric AI. https://arxiv.org/abs/2301.04819 Image by the creator.

Data-centric AI is a novel approach to AI system development, which has been gaining traction and is being promoted by AI pioneer Andrew Ng.

Data-centric AI is the discipline of systematically engineering the info used to construct an AI system. — Andrew Ng

Previously, our primary focus was on developing higher models using data that remained largely unchanged — this was known as model-centric AI. Nonetheless, this approach may be problematic in real-world scenarios because it fails to account for issues which will arise in the info, including inaccurate labels, duplicates, and biases. Consequently, overfitting a dataset may not necessarily lead to improved model behavior.

Data-centric AI, however, prioritizes enhancing the standard and quantity of knowledge utilized in creating AI systems. The main target is on the info itself, with relatively fixed models. Adopting a data-centric approach in developing AI systems has more promise in real-world applications for the reason that maximum capability of a model is decided by the info used for training.

It’s crucial to differentiate between “data-centric” and “data-driven” approaches. “Data-driven” methods only depend on data to steer AI development, but the main focus stays on creating models as an alternative of engineering data, making it fundamentally different from “data-centric” approaches.

The data-centric AI framework encompasses three important objectives:

entails gathering and generating high-quality, diverse data to facilitate the training of machine learning models.
involves constructing revolutionary evaluation sets that provide detailed insights into the model or unlock specific capabilities of the model through engineered data inputs, equivalent to prompt engineering.
goals to make sure the standard and dependability of knowledge in a continually changing environment.

Data-centric AI framework. https://arxiv.org/abs/2303.10158. Image by the creator.

The Model utilized in Segment Anything

Segment Anything Model. Image from the paper https://arxiv.org/pdf/2304.02643.pdf

The model design is surprisingly easy. The model mainly consists of three parts:

This part is used to acquire the representation of the prompt, either through positional encoding or convolution.
This part directly uses the Vision Transformer (ViT) with none special modifications.
This part mainly fuses prompt embedding and image embedding, using mechanisms equivalent to attention. It is known as lightweight since it has only a number of layers.

The lightweight mask decoder is interesting, because it allows the model to be easily deployed, even with just CPUs. Below is the comment provided by the authors of Segment Anything.

Surprisingly, we discover that a straightforward design satisfies all three constraints: a robust image encoder computes a picture embedding, a prompt encoder embeds prompts, after which the 2 information sources are combined in a light-weight mask decoder that predicts segmentation masks.

Subsequently, the key of Segment Anything’s strong performance could be very likely not the model design, as it is extremely easy and light-weight.

Data-centric AI Concepts in Segment Anything

The core of coaching Segment Anything lies in a big annotated dataset containing greater than a billion masks, which is 400 times larger than existing segmentation datasets. How did they achieve this? The authors used a knowledge engine to perform the annotation, which may be broadly divided into three steps:

This step may be understood as an lively learning process. First, an initial model is trained on public datasets. Next, annotators modify the expected masks. Finally, the model is trained with the newly annotated data. These three steps were repeated six times, ultimately leading to 4.3 million mask annotations.
The goal of this step is to extend the variety of masks, which may also be understood as an lively learning process. In easy terms, if the model can routinely generate good masks, human annotators don’t must label them, and human efforts can give attention to masks where the model shouldn’t be confident enough. The strategy used to seek out confident masks is sort of interesting, involving object detection on masks from step one. For instance, suppose there are 20 possible masks in a picture. We first use the present model for segmentation, but it will probably only annotate a portion of the masks, with some masks not being well-annotated. We now must discover which masks are good (confident) routinely. This paper’s approach is to perform object detection on the expected masks to see if objects may be detected within the image. If objects are detected, we consider the corresponding mask to be confident. Suppose this process identifies eight confident masks; the annotator then labels the remaining 12, saving human effort. The above process was repeated five times, adding one other 5.9 million mask annotations.
Simply put, this step uses the model trained within the previous step to annotate data. Some strategies were used to enhance annotation quality, including:
based on predicted Intersection over Union (IoU) values (the model has a head to predict IoU).
, meaning that if the brink is adjusted barely above or below 0.5, the masks remain mostly unchanged. Specifically, for every pixel, the model outputs a worth between 0 and 1. We typically use 0.5 as the brink to make your mind up whether a pixel is masked. Stability implies that when the brink is adjusted to a certain extent around 0.5 (e.g., 0.45 to 0.55), the corresponding mask stays largely unchanged, indicating that the model’s predictions are significantly different on either side of the boundary.
with non-maximal suppression (NMS).
This step annotated 11 billion masks (a rise of greater than 100 times in quantity).

Does this process sound familiar? That’s right, the Reinforcement Learning from Human Feedback (RLHF) utilized in ChatGPT is sort of much like the method described above. The commonality between the 2 approaches is that as an alternative of directly counting on humans to annotate data, a model is first trained by human inputs after which used to annotate data. In RLHF, a reward model is trained to provide rewards for reinforcement learning, while in Segment Anything, the model is trained for direct image annotation.

Summary

The core contribution of Segment Anything lies in its large annotated data, demonstrating the crucial importance of the data-centric AI concept. The success of foundation models in the pc vision field may be considered an inevitable event, but surprisingly, it happened so quickly. Going forward, I think other AI subfields, and even non-AI and non-computer-related fields, will see the emergence of foundation models in the end.

Regardless of how technology evolves, improving data quality and quantity will all the time be an efficient approach to enhance AI performance, making the concept of data-centric AI increasingly vital.

I hope this text can encourage you in your personal work. You’ll be able to learn more in regards to the data-centric AI framework in the next papers/resources:

For those who found this text interesting, chances are you’ll also want to ascertain out my previous article: What Are the Data-Centric AI Concepts behind GPT Models?

Stay tuned!