YOLO-World: Real-Time Open-Vocabulary Object Detection

Artificial Intelligence

YOLO-World: Real-Time Open-Vocabulary Object Detection

admin

March 18, 2024

YOLO-World: Real-Time Open-Vocabulary Object Detection

Object detection has been a fundamental challenge in the pc vision industry, with applications in robotics, image understanding, autonomous vehicles, and image recognition. Lately, groundbreaking work in AI, particularly through deep neural networks, has significantly advanced object detection. Nevertheless, these models have a set vocabulary, limited to detecting objects throughout the 80 categories of the COCO dataset. This limitation stems from the training process, where object detectors are trained to acknowledge only specific categories, thus limiting their applicability.

To beat this, we introduce YOLO-World, an progressive approach aimed toward enhancing the YOLO (You Only Look Once) framework with open vocabulary detection capabilities. That is achieved by pre-training the framework on large-scale datasets and implementing a vision-language modeling approach. Specifically, YOLO-World employs a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to foster interaction between linguistic and visual information. Through RepVL-PAN and region-text contrastive loss, YOLO-World can accurately detect a big selection of objects in a zero-shot setting, showing remarkable performance in open-vocabulary segmentation and object detection tasks.

This text goals to offer an intensive understanding of YOLO-World’s technical foundations, model architecture, training process, and application scenarios. Let’s dive in.

YOLO or You Only Look Once is probably the most popular methods for contemporary day object detection throughout the computer vision industry. Renowned for its incredible speed and efficiency, the arrival of YOLO mechanism has revolutionized the best way machines interpret and detect specific objects inside images and videos in real time. Traditional object detection frameworks implement a two-step object detection approach: in step one, the framework proposes regions which may contain the article, and the framework classifies the article in the subsequent step. The YOLO framework however integrates these two steps right into a single neural network model, an approach that enables the framework to take a look at the image just once to predict the article and its location throughout the image, and hence, the name YOLO or You Only Look Once.

Moreover, the YOLO framework treats object detection as a regression problem, and predicts the category probabilities and bounding boxes directly from the total image in a single glance. Implementation of this method not only increases the speed of the detection process, but in addition enhances the power of the model to generalize from complex and diverse data, making it an acceptable selection for applications operating in real-time like autonomous driving, speed detection or number plate recognition. Moreover, the numerous advancement of deep neural networks previously few years has also contributed significantly in the event of object detection frameworks, however the success of object detection frameworks remains to be limited since they’re able to detect objects only with limited vocabulary. It’s primarily because once the article categories are defined and labeled within the dataset, trained detectors within the framework are able to recognizing only these specific categories, thus limiting the applicability and skill of deploying object detection models in real-time and open scenarios.

Moving along, recently developed vision language models employ distilled vocabulary knowledge from language encoders to handle open-vocabularry detection. Although these frameworks perform higher than traditional object detection models on open-vocabulary detection, they still have limited applicability owing to the scarce availability of coaching data with limited vocabulary diversity. Moreover, chosen frameworks train open-vocabulary object detectors at scale, and categorize training object detectors as region-level vision-language pre-training. Nevertheless, the approach still struggles in detecting objects in real-time attributable to two primary reasons: complex deployment process for edge devices, and heavy computational requirements. On the positive note, these frameworks have demonstrated positive results from pre-training large detectors to employ them with open recognition capabilities.

The YOLO-World framework goals to realize highly efficient open-vocabulary object detection, and explore the potential of large-scale pre-training approaches to spice up the efficiency of traditional YOLO detectors for open-vocabulary object detection. Contrary to the previous works in object detection, the YOLO-World framework displays remarkable efficiency with high inference speeds, and could be deployed on downstream applications with ease. The YOLO-World model follows the standard YOLO architecture, and encodes input texts by leveraging the capabilities of a pre-trained CLIP text encoder. Moreover, the YOLO-World framework features a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) component in its architecture to attach image and text features for enhanced visual-semantic representations. In the course of the inference phase, the framework removes the text encoder, and re-parameterized the text embeddings into RepVL-PAN weights, leading to efficient deployment. The framework also includes region-text contrastive learning in its framework to check open-vocabulary pre-training methods for the standard YOLO models. The region-text contrastive learning method unifies image-text data, grounding data, and detection data into region-text pairs. Constructing on this, the YOLO-World framework pre-trained on region-text pairs reveal remarkable capabilities for open and huge vocabulary detection. Moreover, the YOLO-World framework also explores a prompt-then-detect paradigm with the aim to reinforce the efficiency of the open-vocabulary object detection in real-time and real-world scenarios.

As demonstrated in the next image, traditional object detectors concentrate on close-set of fixed vocabulary detection with predefined categories whereas open vocabulary detectors detect objects by encoding user prompts with text encoders for open vocabulary. As compared, YOLO-World’s prompt-then-detect approach first builds an offline vocabulary(various vocabulary for various needs) by encoding the user prompts allowing the detectors to interpret the offline vocabulary in real-time without having to re-encode the prompts.

YOLO-World : Method and Architecture

Region-Text Pairs

Traditionally, object detection frameworks including the YOLO family of object detectors are trained using instance annotations that contain category labels and bounding boxes. In contrast, the YOLO-World framework re-formulate the instance annotations as region-text pairs where the text could be the outline of the article, noun phrases, or category name. It’s price stating that the YOLO-World framework adopts each the texts and pictures as input and output predicted boxes with its corresponding object embeddings.

Model Architecture

At its core, the YOLO-World model consists of a Text Encoder, a YOLO detector, and the Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) component, as illustrated in the next image.

For an input text, the text encoder component encodes the text into text embeddings followed by the extraction of multi-scale features from the input image by the image detectors within the YOLO detector component. The Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) component then exploits the cross-modality fusion between the text and have embeddings to reinforce the text and image representations.

YOLO Detector

The YOLO-World model is built on top of the prevailing YOLOv8 framework that accommodates a Darknet backbone component as its image encoder, a head for object embeddings and bounding box regression, and a PAN or Path Aggression Network for multi-scale feature pyramids.

Text Encoder

For a given text, the YOLO-World model extracts the corresponding text embeddings by adopting a pre-trained CLIP Transformer text encoder with a certain variety of nouns and embedding dimension. The first reason why the YOLO-World framework adopts a CLIP text encoder is since it offers higher visual-semantic performance for connecting texts with visual objects, significantly outperforming traditional text-only language encoders. Nevertheless, if the input text is either a caption or a referring expression, the YOLO-World model opts for an easier n-gram algorithm to extract the phrases. These phrases are then fed to the text encoder.

Text Contrastive Head

Decoupled head is a component utilized by earlier object detection models, and the YOLO-World framework adopts a decoupled head with dual 3×3 convolutions to regress object embeddings and bounding boxes for a set variety of objects. The YOLO-World framework employs a text contrastive head to acquire the object-text similarity using the L2 normalization approach and text embeddings. Moreover, the YOLO-World model also employs the affine transformation approach with a shifting factor and a learnable scaling factor, with the L2 normalization and affine transformation enhancing the soundness of the model during region-text training.

Online Vocabulary Training

In the course of the training phase, the YOLO-World model constructs an internet vocabulary for every mosaic sample consisting of 4 images each. The model samples all positive nouns included within the mosaic images, and samples some negative nouns randomly from the corresponding dataset. The vocabulary for every sample consists of a maximum of n nouns, with the default value being 80.

Offline Vocabulary Inference

During inference, the YOLO-World model presents a prompt-then-detect strategy with offline vocabulary to further enhance the efficiency of the model. The user first defines a series of custom prompts which may include categories and even captions. The YOLO-World model then obtains offline vocabulary embeddings by utilizing the text encoder to encode these prompts. Consequently, the offline vocabulary for inference helps the model avoid computations for every input, and in addition allows the model to regulate the vocabulary flexibly based on the necessities.

Re-parameterizable Vision-Language Path Aggression Network (RevVL-PAN)

The next figure illustrates the structure of the proposed Re-parameterizable Vision-Language Path Aggression Network that follows the top-down and bottom-up paths to determine the feature pyramid with multi-scale feature images.

To boost the interaction between text and image features, the YOLO-World model proposes an Image-Pooling Attention and a Text-guided CSPLayer (Cross-Stage Partial Layers) with the last word aim of improving the visual-semantic representations for open vocabulary capabilities. During inference, the YOLO-World model re-parametrize the offline vocabulary embeddings into the weights of the linear or convolutional layers for effective deployment.

As it will possibly be seen within the above figure, the YOLO-World model utilizes the CSPLayer after the top-down or bottom-up fusion, and incorporates text-guidance into multi-scale image features, forming the Text-Guided CSPLayer, thus extending the CSPLayer. For any given image feature and its corresponding text embedding, the model adopts the max-sigmoid attention after the last bottleneck block to aggregate text features into image features. The updated image feature is then concatenated with the cross-stage features, and is presented because the output.

Moving on, the YOLO-World model aggregates image features to update the text embedding by introducing the Image Pooling Attention layer to reinforce the text embeddings with image aware information. As an alternative of using the cross-attention directly on image features, the model leverages max pooling on multi-scale features to acquire 3×3 regions, leading to 27 patch tokens with the model updating the text embeddings in the subsequent step.

Pre-Training Schemes

The YOLO-World model follows two primary pre-training schemes: Learning from Region-Text Contrastive Loss and Pseudo Labeling with Image-Text Data. For the first pre-training scheme, the model outputs object predictions together with annotations for a given text and mosaic samples. The YOLO-World framework matches the predictions with ground truth annotations by following and leveraging task-assigned label project, and assigns individual positive predictions with a text index that serves because the classification label. Alternatively, the Pseudo Labeling with Image-Text Data pre-training scheme proposes to make use of an automatic labeling approach as a substitute of using image-text pairs to generate region-text pairs. The proposed labeling approach consists of three steps: extract noun phrases, pseudo labeling, and filtering. Step one utilizes the n-gram algorithm to extract noun phrases from the input text, the second step adopts a pre-trained open vocabulary detector to generate pseudo boxes for the given noun phrase for individual images, whereas the third and the ultimate step employs a pre-trained CLIP framework to guage the relevance of the region-text and text-image pairs, following which the model filters low-relevance pseudo images and annotations.

YOLO-World : Results

Once the YOLO-World model has been pre-trained, it’s evaluated directly on the LVIS dataset in a zero-shot setting, with the LVIS dataset consisting over 1200 categories, significantly greater than the pre-training datasets utilized by existing frameworks for testing their performance on large vocabulary detection. The next figure demonstrates the performance of the YOLO-World framework with among the existing state-of-the-art object detection frameworks on the LVIS dataset in a zero-shot setting.

As it will possibly be observed, the YOLO-World framework outperforms a majority of existing frameworks when it comes to inference speeds, and zero-shot performance, even with frameworks like Grounding DINO, GLIP, and GLIPv2 that incorporate more data. Overall, the outcomes reveal that small object detection models like YOLO-World-S with only 13 million parameters could be utilized for pre-training on vision-language tasks with remarkable open-vocabulary capabilities.

Final Thoughts

In this text, we’ve got talked about YOLO-World, an progressive approach that goals to reinforce the talents of the YOLO or You Only Look Once framework with open vocabulary detection capabilities by pre-training the framework on large-scale datasets, and implementing the vision-language modeling approach. To be more specific, the YOLO-World framework proposes to implement a Re-parameterizable Vision Language Path Aggregation Network or RepVL-PAN together with region-text contrastive loss to facilitate an interaction between the linguistic and the visual information. By implementing RepVL-PAN and region-text contrastive loss, the YOLO-World framework is in a position to accurately and effectively detect a big selection of objects in a zero-shot setting.