Compare and Evaluate Object Detection Models From TorchVision Introduction What’s Object Detection Finetuning Pre-trained Models Image Data Formats Evaluation Metrics for Object Detection Challenges of Comparing Object Detection Models Using Comet for Object Detection Single Experiment View Visualize Outputs Project Level View Conclusion

Artificial Intelligence

Compare and Evaluate Object Detection Models From TorchVision Introduction What’s Object Detection Finetuning Pre-trained Models Image Data Formats Evaluation Metrics for Object Detection Challenges of Comparing Object Detection Models Using Comet for Object Detection Single Experiment View Visualize Outputs Project Level View Conclusion

admin

May 10, 2023

Compare and Evaluate Object Detection Models From TorchVision
Introduction
What’s Object Detection
Finetuning Pre-trained Models
Image Data Formats
Evaluation Metrics for Object Detection
Challenges of Comparing Object Detection Models
Using Comet for Object Detection
Single Experiment View
Visualize Outputs
Project Level View
Conclusion

Visualizing the performance of Fast RCNN, Faster RCNN, Mask RCNN, RetinaNet, and FCOS

Comparing FastRCNN, FasterRCNN, MaskRCNN, FCOS, RetinaNet object detection models from PyTorch using Comet’s Image Panel — Comparing object detection models from PyTorch; Image by writer

Object detection is one of the crucial popular applications of machine learning for computer vision. A detection model predicts each the category types and locations of every distinct object in a picture. Object detection models have a wide selection of applications, including manufacturing, surveillance, health care, and more. TorchVision is a Python package that extends the PyTorch framework for computer vision use cases. But how will you systematically find the most effective model for a selected use-case? I’m going to be using Comet to log my experiment data (full disclosure: I work at Comet), but be at liberty to make use of whatever tooling you favor.

Follow together with the complete code in and take a look at !

Object detection is a pc vision task that goals to discover instances of objects in images and assign them to specific classes. At a low level, object detection seeks to reply the query, “what objects are where?”

GIF of a green turtle underwater with many yellow fish in the background as an object detection model predicts multiple class labels — Detecting sea animals; GIF by writer

Object detection algorithms are generally separated into two categories: single-stage (RetinaNet, SSD, FCOS, YOLO, etc.) and two-stage (Fast RCNN, Mask RCNN, FPN, etc.). In two-stage detectors, one model is used to extract generalized regions of objects, and a second model is used to categorise and further refine the situation of an object. Single-stage detectors do all of this in a single step. Single-stage detectors are likely to be faster and fewer computationally expensive than two-stage detectors, but they’re also less accurate.

A diagram of the differences between one-stage object detection models and two-stage object detection models, focusing on the presence of an object proposal component. — Image from *Semantic Image Cropping*, by Oriol Corcoll Andreu

The most effective object detection models are trained on tens, if not a whole bunch, of hundreds of labeled images. What’s more, image datasets themselves are inherently computationally expensive to process. To coach an object detection model from scratch requires quite a lot of time and resources that aren’t at all times available. To coach several object detection models for comparison requires much more time and resources. Thankfully, we don’t should. As an alternative, we will use transfer learning or fine-tune pre-trained models.

In essence, each these methods allow us to make the most of the weights and biases learned from one task and repurpose them on a latest task. By leveraging feature representations from a pre-trained model, we don’t should train a latest model from scratch, saving us time and compute resources. What’s more, these methods can contribute to rapid boosts in model performance for little overhead.

A graph showing the increased accuracy and performance of model that uses transfer learning, as comparing with one that does not use transfer learning. — Image from *A graphic representation of the potential advantages of transfer learning*, by Laura Aelenei

Transfer learning and fine-tuning are similar processes but with one key difference. In transfer learning, all previously trained layers are frozen, and (optionally) additional layers are added for retraining. In fine-tuning, all previously trained layers are retrained, but at a really low learning rate. In each cases, models typically see boosted initial performance, steeper improvement slopes, and elevated final performance.

TorchVision’s Pre-Trained Models

In real-world applications, we regularly make selections to balance accuracy and speed. The performance of a model under a given set of circumstances won’t be relevant if we aren’t capable of replicate those circumstances in production. So when in search of the “best” object detection model, it becomes essential to watch a wide selection of metrics pertaining your particular use case.

TorchVision’s detection module comes with several pre-trained models already inbuilt. For this tutorial we can be comparing Fast-RCNN, Faster-RCNN, Mask-RCNN, RetinaNet, and FCOS, with either ResNet50 of MobileNet v2 backbones. Each of those models was previously trained on the COCO dataset. We’ll download the trained models, replace the classifier heads to reflect our goal classes, and retrain the models on our own data.

In computer vision, images are represented as matrices of pixel intensity values. Black and white (grayscale) images are frequently two-dimensional, and color images are typically three-dimensional, with one “layer” each representing red, blue, and green pixels.

Just as there are several ways to represent images, there are also several ways we will represent our labels and predictions. Within the full code for this tutorial, we’ll provide methods for logging bounding boxes, segmentation masks, and polygon annotations to an experiment tracking tool. But when comparing our torchvision models we are going to only use bounding boxes, as not all of our models are capable of calculate the opposite forms of predictions.

To further complicate things, not all algorithms format bounding box annotations in the identical way. Below we’ve listed a number of of essentially the most common bounding box formats you’re prone to run into, but we’ll be specializing in Pascal VOC and COCO formats on this tutorial.

An image of a cat sleeping in a stone corner next to a green plant with annotation formats from COCO, Pascal VOC, Albumentations, and YOLO plotted on top for comparison. — Image from *Albumentations*

[xmin, ymin, xmax, ymax] → [98, 345, 420, 462]

normalized([x_min, y_min, x_max, y_max]) → [0.153125, 0.71875, 0.65625, 0.9625]

[xmin, ymin, width, height] → [98, 345, 322, 117]

normalized([x_center, y_center, width, height]) → [0.4046875, 0.8614583, 0.503125, 0.24375]

A naive approach to evaluating object detection models is likely to be binary classification (“match” or “no match”, “1” or “0”), but this method leaves little room for nuance. We will do higher!

Intersection Over Union (IoU)

The usual evaluation metric for comparing individual bounding boxes is Intersection over Union, or IoU. IoU evaluates the degree of overlap between the bottom truth bounding box and the expected bounding box with a worth between 0 and 1.

A diagram showing the formula for Intersection Over Union, with example images using two overlapping squares. — Intersection over Union, image from Shivy Yohanandan in Towards Data Science

Using the Comet Image Panel tool to demonstrate the difference between the intersection and union of a set of bounding box predictions. — The difference between the intersection and union of a ground truth bounding box and a prediction bounding box; GIF by writer

IoU is set for every set of bounding boxes in a picture after which a threshold is applied. If an IoU meets the edge, it’s marked as a “true positive.” All predictions not marked as “true positives” are marked as “false positives,” and any items left in our “ground truth” annotations list are marked as “false negatives. The choice to mark a detection as TP, FP, or FN is totally contingent on the alternative of IoU threshold. The IoU threshold is usually set at 0.5, but you might wish to experiment with this number. Once we’ve calculated our confusion matrix, we will compute precision and recall.

A row of three images showing the difference between True Positives (TP), False Positives (FP), and False Negatives (FN) using bounding boxes predicted the location of turtles and fish in an image. — True positives, false positives, and false negatives; note that we don’t calculate true negatives in object detection. Image by writer

Mean Average Precision (mAP) and Mean Average Recall (mAR)

Precision (also often called specificity) is the degree of exactness of the model in identifying only relevant objects. The equation for precision is:

The formula for Precision equals the True Positive Rate divided by the sum of True Positives and False Positives.

Recall (also often called sensitivity) measures the flexibility of the model to detect all ground truths. The equation for recall is:

The forumal for recall equals the rate of True Positives divided by the sum of True Postives and False Negatives.

In an ideal world, our perfect model would have a precision and recall of 1, meaning it predicted zero false negatives and 0 false positives. But in the actual world this isn’t generally achievable. A precision-recall curve plots the worth of precision against recall for various confidence thresholds. The world under this curve can also be known as the Average Precision (AP). Average recall (AR) describes double the worth of the world under the recall-IoU curve.

Average precision is equal to the area under the Precision-Recall (PR) curve. — Average precision is the same as the world under the PR curve; image by writer.

Mean Average Precision (mAP) and mean Average Recall (mAR) are calculated by taking the weighted mean of the AP or AR over all classes and/or over all IoU thresholds. They’re two of essentially the most common evaluation metrics for object detection and are used to judge submissions in popular computer vision competitions just like the COCO and Pascal VOC challenges. We will derive many other metrics from mAP and mAR, including mAP across scales, at different IoU thresholds, and with a minimum variety of detections per image.

A diagram of the 12 metrics used for characterizing the performance of an object detector with the COCO evaluator — The 12 metrics used for characterizing the performance of an object detector with the COCO evaluator; image from COCO

Other metrics

If we were constructing a model to detect very large objects (relative to the image’s field of view), we is likely to be willing to think about models with poor “AP_small” scores, as this metric can be less relevant to our use case. If we were planning on using our model to assist in medical diagnoses, we would place a better emphasis on mAR values than mAP values, since it will likely be more essential to not miss any positive samples than it will be to miss negative samples.

For this tutorial we use a mixture of the mAP and mAR values calculated by the COCO evaluator and torchmetrics.detection module. We’ll also log all relevant values to a DataFrame to offer us a fuller picture of how different models perform in numerous scenarios. Finally, we’ll select our “best” model accordingly.

Comparing object detection models will be difficult for numerous reasons. Different models have different requirements in the case of input size and shape, annotation formats, and other dataset attributes. Hyperparameters vary from algorithm to algorithm and keeping track of which values produce which ends up can quickly grow to be tedious and overwhelming.

Most computer vision pipelines incorporate image augmentation in some form, so dataset versioning becomes essential for reproducibility and explainability. What’s more, performance metrics only tell a part of the story in the case of object detection. Often, it’s crucial to visualise prediction annotations to know where things are going right– and where they’re going mistaken.

What’s more, image datasets themselves are inherently computationally expensive to process. To coach an object detection model from scratch requires quite a lot of time and resources that aren’t at all times available. To coach several object detection models for comparison requires much more time and resources.

A graph showing that the Idealized accuracy-speed tradeoff has a logarithmic relationship — Idealized accuracy-speed tradeoff has a logarithmic relationship; image by Jeff Miller on ResearchGate

Clearly, comparing object detection models isn’t nearly as simple as just minimizing a single loss function. Now we have a reasonably wide selection of metrics to calculate and log, a few of which have to be visualized to completely understand, and every model has it’s own graph definition, set of hyperparameters, code output, and other features. To assist keep track of all of those moving pieces, we’ll log our inputs, metrics, and outputs to Comet, a experiment tracking tool. By visualizing our data in an experiment tracking tool, we’ll have the option to get a far more complete understanding of how each of our models behaves, under which circumstances, and with which data.

For this tutorial, we’ll be using the Penn-Fudan dataset, which consists of 170 images labeled with 345 instances of pedestrians. Pedestrian detection has several applications, including surveillance, training self-driving cars, and other traffic safety applications. Since we’re using PyTorch, we’ll must define a custom dataset class that inherits from the torch.utils.data.Dataset class.

Example image from the PennFudan dataset with bounding box and segmentation mask labels, as shown in the Comet UI with the Image Panel. — Example image from the PennFudan dataset with bounding box and mask labels; GIF by writer

The entire models in TorchVision’s detection module use Pascal VOC format, so we’ll format our bounding boxes accordingly in our Dataset class. We’ll then must convert the model’s prediction labels from Pascal VOC to COCO format to be used with the COCO evaluator and Comet. When you’re using this tutorial together with your own model, check your specific model’s annotation requirements to make sure proper formatting of all inputs.

With the intention to get an excellent understanding of how each of our models is performing, and with which hyperparameters, we’ll start by examining our results at an experiment-level.

Store Hyper-parameters

Keeping track of our hyperparameters is crucial for reproducibility and explainability. Model hyperparameters can affect model performance, computational selections and what information to retain for evaluation. Hyperparameters vary from algorithm to algorithm, and a few are more essential than others, so this critical task can quickly grow to be tedious and confusing.

Here, we log essential hyperparameters with only a single command. For our project we’ll be monitoring the next hyperaparameters, which we will adjust by simply editing the relevant keys-value pairs:

hyper_params = {
"lr" : 0.0005,
"momentum" : 0.9,
"weight_decay" : 0.0005,
"step_size" : 3,
"gamma" : 0.1,
"num_epochs" : 1,
"num_classes" : 2,
"model_name": "mask_rcnn",
"backbone" : "retinanet50",
"feature_extract": False
}

experiment.log_parameters(hyper_params)

Sometimes metrics that work well for one model may not work in any respect for an additional. For instance, the FCOS model tends to struggle with exploding gradients. When using it, we’ve to significantly decrease the educational rate to accommodate for this. If, nonetheless, we use the reduced learning rate on a model like Fast-RCNN, (typically certainly one of our best-performing models), it performs unusually poorly since it fails to ever really “learn” the feature maps of our dataset.

In this experiment panel, green represents a Fast-RCNN model trained with a learning rate of 5e-4. Blue represents the same model trained on a learning rate of 5e-8, the rate needed to prevent exploding gradients in the FCOS model. — In this experiment panel, green represents a Fast-RCNN model trained with a learning rate of 5e-4. Blue represents the identical model trained on a learning rate of 5e-8, the speed needed to stop exploding gradients within the FCOS model.

Since we’re specializing in comparing different models on this tutorial, we are going to mostly be keeping the hyperparameters constant (apart from learning rate). Nonetheless, if we were seeking to optimize the hyperparameters of a single model, we could also pass an inventory of values to every hyperparameter key and use an optimizer to iterate through them.

System Metrics

Since object detection is such a resource-heavy task, we’ll also want to watch our system metrics, including CPU and GPU usage. This also can help diagnose bottlenecks in our pipeline, aid with reproducibility, and debug crashed experiments.

Evaluation Metrics

Each of our PyTorch detection models comes with relevant evaluation metrics built-in. Comet is integrated with PyTorch, so each of those pre-defined metrics can be robotically logged to the experiment. This could be very helpful when comparing multiple runs of the identical model, or different object detection models with the identical evaluation metrics, however the PyTorch models we’ve chosen don’t all include the identical built-in metrics. We’ll still use these auto-logged plots to get an initial impression of the performance of our models, but we’ll wish to log a few of our own metrics for cross-experiment comparisons.

A screenshot of auto-logged metrics as displayed in the Comet UI, including six separate charts that track loss, bbox_ctrness, bbox_regression, classification, loss_mask, and loss_box_reg. — Auto-logged metrics are super helpful when comparing multiple runs with the identical model or multiple models with the identical evaluation metrics. But as you possibly can see within the plot above, not all of our models have the identical default evaluation metrics, making these plots less relevant. As an alternative, we’ll define our own.

Now we have the flexibility to manually log nearly any metric, asset, artifact, or graphic we wish. On this tutorial, we’ll track:

Mean Average Precision (mAP) of all validation images per epoch
Mean Average Recall (mAR) of all validation images per epoch
TorchMetric’s 12 metrics for characterizing the performance of an object detection (very just like COCO’s 12 metrics listed above) per image
Relevant code files from TorchVision (engine.py, transforms.py, etc.)
Graph definitions of our various models
Each image within the validation dataset, in addition to our model’s predicted bounding boxes, with their corresponding labels and confidence scores.

log_metric(name, value, step=None, epoch=None, include_context=True)
log_metrics(dict, prefix=None, step=None, epoch=None)

A screenshot of custom-logged evaluation metrics of our object detection model, as logged in the Comet UI, including epoch mAP (mean average precision), epoch mAR (mean average recall), epoch F1 score, model name, backbone, and learning rate. — For our experiment, we log epoch mAP, epoch mAR, epoch F1, and loss.

Graphics Tab

Understanding where your model goes right and where it’s going mistaken will be especially difficult with image datasets. Loss metrics and other numerical values don’t at all times tell the entire story and will be hard to visualise. So we’ll also log each of our validation images, together with their predicted bounding boxes, per model per epoch. Flipping through a model’s predictions can be helpful to see how our models improve over time.

To log images to Comet, we simply use the log_image method:

experiment.log_image(image, name, annotations, metadata)

Alternatively, we also can pass the annotations to the metadata parameter:

experiment.log_image(…, metadata = { "annotations": annotations })

In either case, the image annotations needs to be in JSON format and the bounding boxes needs to be in COCO format. Bounding boxes can either be passed as a dictionary (as shown below) or as an inventory of lists. Note that a latest instance needs to be created for every bounding box and polygon points are passed within the format [x1, y1, x2, y2, …, xn, yn].

[
{
"imageId": "7d045ad5a96b45f8b5e770d817ac429b",
"experimentKey": "someExperimentKey",
"metadata": {
"annotations": [
{
"name": "some name",
"data": [
{
"label": "person",
"score": 0.8004001479832858,
"boxes": [
{
"x": 0,
"y": 40,
"width": 40,
"height": 40
}
],
"points": [
[
230,
32,
30,
40,
50,
10,
23,
54,
94,
20
],
[
230,
32,
30,
40,
50,
10,
23,
54,
94,
20
]
]
}
]
}
]
}
}
]

Since we’re comparing object detection models on this tutorial, one of the crucial essential ways we will use a tracking tool is to create a holistic project-level view. Comet robotically generates a basic model performance panel, but we even have the flexibility to customize our panels for our particular use case.

Image Panel

The image panel allows us to visualise different models’ prediction per experiment run, over time. Use the step slider to walk through each model’s predictions, or click on a person image for a more in-depth look.

From there, decide to smooth your image or render it in grayscale, select which class labels you desire to examine, and set confidence thresholds with a sliding bar.

Using the smoothing and grayscale features, as well as the confidence score slider and label selector in Comet’s Image Panel. — Using the smoothing and grayscale features, in addition to the arrogance rating slider and label selector; GIF by writer

Data Panel

Sometimes we really need a deep dive into the numbers. With Comet’s Data Panel we will log any csv, DataFrame or table to our project and explore it interactively within the UI. We logged all twelve evaluation metrics from TorchMetric’s mean_ap module, as shown below. If a given metric isn’t relevant to a selected image, it’s given a worth of -1 (for instance, if a picture doesn’t predict any “large” bounding boxes, then mAP_large for that image can be -1). We will reorder columns, sort them, and filter values. Below, we compare our most elementary mAP and mAR measures after which sort them to see where precision could be very different from recall. Alternatively, we could also check the epoch f1-score that we logged as a further tool in our toolbox.

Multiple Dashboards

Now that we’ve built all of those panels, we’d like a method to keep them organized! For this, we construct and save multiple dashboards, each of which we’ll use for a unique purpose. We’ll keep the auto-generated dashboard that Comet built for us, and we’ll organize the remainder of our panels into 4 more dashboards. Now we have a project overview dashboard that offers us a really basic overview of our project’s stats (parameters used, variety of experiments run, and a few of the most effective metrics achieved). We’ll put our image panel and data panel right into a Debugging dashboard and we’ll store our plots and charts in a Metrics dashboard. Now we will easily navigate through all of our panels to seek out exactly what we’re in search of!

A GIF demonstrating how to navitgate between multiple custom dashboards within the Comet UI. — Exploring multiple dashboards; GIF by writer

Accuracy-Speed Tradeoff

In the beginning of this tutorial, we briefly explored machine learning’s accuracy-speed tradeoff. Models with higher precision and accuracy are likely to devour more compute resources, and fast models are likely to be be less accurate. Depending in your use case, your definition of the “best” model may vary. Circling back to this thought, we’ll compare 4 of our models by way of their general speed and accuracy with a view to understand which models work “best” for which scenarios.

A screenshot using the experiment diffing feature in Comet, demonstrating that Faster RCNN is faster than Fast_RCNN, but it is also a lot less accurate. — As its name suggests, Faster RCNN is quicker than Fast_RCNN, but as you possibly can see in the experiment diffing view above, it’s also lots less accurate.

We create a final dashboard called “Accuracy-Speed Tradeoff” and plot some basic evaluation and system metrics for 4 different models: Mask RCNN, Fast RCNN, RetinaNet, and FCOS. Keep in mind that each RCNN models are two-stage object detection models, that are generally more computational expensive. RetinaNet and FCOS are each single-stage models.

A screenshot of one of our custom dashboards in Comet that illustrates the accuracy-speed tradeoff of several of our object detection models. — Our Accuracy-Speed Tradeoff dashboard for 4 of our base models

Each of our two-stage object detection models (in green and lightweight blue above) far out-perform the single-stage models in mean average precision, epoch f1-score, and loss. Shifting to the underside row of charts, nonetheless, we will see that also they are far more computationally-expensive. It might come as no surprise that Mask RCNN is the slowest model of all, since it’s actually based on Fast RCNN, but produces additional outputs (segmentation masks).

For a general purpose object detection model, we would conclude that Fast RCNN performs the most effective with bounding box prediction. It has the very best mAP and f1, the bottom loss, and consumes far less memory than Mask RCNN. However the “best” model is subjective and fully depending on your use case! If we were seeking to deploy our model to a mobile device, Fast RCNN’s memory requirements might disqualify it from our consideration.

Comparing and logging object detection models could be a tedious and overwhelming task, but when you could have an experiment tracking tool like Comet, you possibly can focus your attention where it really matters. Comet is a robust tool for tracking your models, datasets, and metrics to maintain your experiments organized, reproducible, and explainable.

Check out the code on this tutorial in this Colab and apply it to a dataset of your personal! You may view the public project here or, to start together with your own project, create an account here totally free!