Custom Training Pipeline for Object Detection Models

What if you desire to write the entire object detection training pipeline from scratch, so you possibly can understand each step and give you the option to customize it? That’s what I got down to do. I examined several well-known object detection pipelines and designed one which most accurately fits my needs and tasks. Due to Ultralytics, YOLOx, DAMO-YOLO, RT-DETR and D-FINE repos, I leveraged them to achieve deeper understanding into various design details. I ended up implementing SoTA real-time object detection model D-FINE in my custom pipeline.

Plan

Dataset, Augmentations and transforms:
- Mosaic (with affine transforms)
- Mixup and Cutout
- Other augmentations with bounding boxes
- Letterbox vs easy resize
Training:
- Optimizer
- Scheduler
- EMA
- Batch accumulation
- AMP
- Grad clipping
- Logging
Metrics:
- mAPs from TorchMetrics / cocotools
- The best way to compute Precision, Recall, IoU?
Pick an acceptable solution in your case
Experiments
Attention to data preprocessing
Where to start out

Dataset

Dataset processing is the very first thing you often start working on. With object detection, that you must load your image and annotations. Annotations are sometimes stored in COCO format as a json file or YOLO format, with txt file for every image. Let’s take a have a look at the YOLO format: Each line is structured as: class_id, x_center, y_center, width, height, where bbox values are normalized between 0 and 1.

When you’ve your images and txt files, you possibly can write your dataset class, nothing tricky here. Load every little thing, transform (augmentations included) and return during training. I prefer splitting the information by making a CSV file for every split after which reading it within the Dataloader class slightly than physically moving files into train/val/test folders. That is an example of a customization that helped my use case.

Augmentations

Firstly, when augmenting images for object detection, it’s crucial to use the identical transformations to the bounding boxes. To comfortably try this I take advantage of Albumentations lib. For instance:

    def _init_augs(self, cfg) -> None:
        if self.keep_ratio:
            resize = [
                A.LongestMaxSize(max_size=max(self.target_h, self.target_w)),
                A.PadIfNeeded(
                    min_height=self.target_h,
                    min_width=self.target_w,
                    border_mode=cv2.BORDER_CONSTANT,
                    fill=(114, 114, 114),
                ),
            ]

        else:
            resize = [A.Resize(self.target_h, self.target_w)]
        norm = [
            A.Normalize(mean=self.norm[0], std=self.norm[1]),
            ToTensorV2(),
        ]

        if self.mode == "train":
            augs = [
                A.RandomBrightnessContrast(p=cfg.train.augs.brightness),
                A.RandomGamma(p=cfg.train.augs.gamma),
                A.Blur(p=cfg.train.augs.blur),
                A.GaussNoise(p=cfg.train.augs.noise, std_range=(0.1, 0.2)),
                A.ToGray(p=cfg.train.augs.to_gray),
                A.Affine(
                    rotate=[90, 90],
                    p=cfg.train.augs.rotate_90,
                    fit_output=True,
                ),
                A.HorizontalFlip(p=cfg.train.augs.left_right_flip),
                A.VerticalFlip(p=cfg.train.augs.up_down_flip),
            ]

            self.transform = A.Compose(
                augs + resize + norm,
                bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
            )

        elif self.mode in ["val", "test", "bench"]:
            self.mosaic_prob = 0
            self.transform = A.Compose(
                resize + norm,
                bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
            )

Secondly, there are a whole lot of interesting and never trivial augmentations:

Mosaic. The thought is straightforward, let’s take several images (for instance 4), and stack them together in a grid (2×2). Then let’s do some affine transforms and feed it to the model.
MixUp. Originally utilized in image classification (it’s surprising that it really works). Idea – let’s take two images, put them onto one another with some percent of transparency. In classification models it often implies that if one image is 20% transparent and the second is 80%, then the model should predict 80% for sophistication 1 and 20% for sophistication 2. In object detection we just get more objects into 1 image.
Cutout. Cutout involves removing parts of the image (by replacing them with black pixels) to assist the model learn more robust features.

I see mosaic often applied with Probability 1.0 of the primary ~90% of epochs. Then, it’s often turned off, and lighter augmentations are used. The identical idea applies to mixup, but I see it getting used quite a bit less (for the preferred detection framework, Ultralytics, it’s turned off by default. For an additional one, I see P=0.15). Cutout appears to be used less continuously.

You’ll be able to read more about those augmentations in these two articles: 1, 2.

Results from just turning on mosaic are pretty good (darker one without mosaic got mAP 0.89 vs 0.92 with, tested on an actual dataset)

Creator’s metrics on a custom dataset, logged in Wandb

Letterbox or easy resize?

During training, you often resize the input image to a square. Models often use 640×640 and benchmark on COCO dataset. And there are two foremost ways the way you get there:

Easy resize to a goal size.
Letterbox: Resize the longest side to the goal size (e.g., 640), preserving the aspect ratio, and pad the shorter side to achieve the goal dimensions.

Sample from VisDrone dataset with ground truth bounding boxes, preprocessed with a straightforward resize function

Sample from VisDrone dataset with ground truth bounding boxes, preprocessed with a letterbox

Each approaches have benefits and drawbacks. Let’s discuss them first, after which I’ll share the outcomes of various experiments I ran comparing these approaches.

Easy resize:

Compute goes to the entire image, with no useless padding.
“Dynamic” aspect ratio may act as a type of regularization.
Inference preprocessing perfectly matches training preprocessing (augmentations excluded).
Kills real geometry. Resize distortion could affect the spatial relationships within the image. Even though it may be a human bias to think that a hard and fast aspect ratio is vital.

Letterbox:

Preserves real aspect ratio.
During inference, you possibly can cut padding and run not on the square image if you happen to don’t lose accuracy (some models can degrade).
Can train on an even bigger image size, then inference with cut padding to get the identical inference latency as with easy resize. For instance 640×640 vs 832×480. The second will preserve the aspect ratios and objects will appear +- the identical size.

A part of the compute is wasted on gray padding.
Objects get smaller.

The best way to test it and choose which one to make use of?

Train from scratch with parameters:

Easy resize, 640×640
Keep aspect ratio, max side 640, and add padding (as a baseline)
Keep aspect ratio, larger image size (for instance max side 832), and add padding Then inference 3 models. When the aspect ratio is preserved – cut padding through the inference. Compare latency and metrics.

Example of the identical image from above with cut padding (640 × 384):

Here’s what happens if you preserve ratio and inference by cutting gray padding:

params                  |   F1 rating  |  latency (ms).   |
-------------------------+-------------+-----------------|
ratio kept, 832         |    0.633    |        33.5      |
no ratio, 640x640       |    0.617    |        33.4      |

As shown, training with preserved aspect ratio at a bigger size (832) achieved the next F1 rating (0.633) in comparison with a straightforward 640×640 resize (F1 rating of 0.617), while the latency remained similar. Note that some models may degrade if the padding is removed during inference, which kills the entire purpose of this trick and possibly the letterbox too.

What does this mean:

Training from scratch:

With the identical image size, easy resize gets higher accuracy than letterbox.
For letterbox, Should you cut padding through the inference and your model doesn’t lose accuracy – you possibly can train and inference with an even bigger image size to match the latency, and get a little bit bit higher metrics (as in the instance above).

Training with pre-trained weights initialized:

Should you finetune – use the identical tactic because the pre-trained model did, it should offer you the perfect results if the datasets aren’t too different.

For D-FINE I see lower metrics when cutting padding during inference. Also the model was pre-trained on a straightforward resize. For YOLO, a letterbox is often a very good selection.

Training

Every ML engineer should know how one can implement a training loop. Although PyTorch does much of the heavy lifting, you may still feel overwhelmed by the variety of design selections available. Listed below are some key components to think about:

Optimizer – start with Adam/AdamW/SGD.
Scheduler – fixed LR could be okay for Adams, but take a have a look at StepLR, CosineAnnealingLR or OneCycleLR.
EMA. This can be a nice technique that makes training smoother and sometimes achieves higher metrics. After each batch, you update a secondary model (often called the EMA model) by computing an exponential moving average of the first model’s weights.
Batch accumulation is good when your vRAM may be very limited. Training a transformer-based object detection model implies that in some cases even in a middle-sized model you simply can fit 4 images into the vRAM. By accumulating gradients over several batches before performing an optimizer step, you effectively simulate a bigger batch size without exceeding your memory constraints. One other use case is when you’ve a whole lot of negatives (images without goal objects) in your dataset and a small batch size, you possibly can encounter unstable training. Batch accumulation also can help here.
AMP uses half precision mechanically where applicable. It reduces vRAM usage and makes training faster (if you’ve a GPU that supports it). I see 40% less vRAM usage and at the very least a 15% training speed increase.
Grad clipping. Often, if you use AMP, training can grow to be less stable. This also can occur with higher LRs. When your gradients are too big, training will fail. Gradient clipping will be certain gradients are never larger than a certain value.
Logging. Try Hydra for configs and something like Weights and Biases or Clear ML for experiment tracking. Also, log every little thing locally. Save your best weights, and metrics, so after quite a few experiments, you possibly can at all times find all the information on the model you would like.

    def train(self) -> None:
        best_metric = 0
        cur_iter = 0
        ema_iter = 0
        one_epoch_time = None

        def optimizer_step(step_scheduler: bool):
            """
            Clip grads, optimizer step, scheduler step, zero grad, EMA model update
            """
            nonlocal ema_iter
            if self.amp_enabled:
                if self.clip_max_norm:
                    self.scaler.unscale_(self.optimizer)

torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm)
                self.scaler.step(self.optimizer)
                self.scaler.update()

            else:
                if self.clip_max_norm:

torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm)
                self.optimizer.step()

            if step_scheduler:
                self.scheduler.step()
            self.optimizer.zero_grad()

            if self.ema_model:
                ema_iter += 1
                self.ema_model.update(ema_iter, self.model)

        for epoch in range(1, self.epochs + 1):
            epoch_start_time = time.time()
            self.model.train()
            self.loss_fn.train()
            losses = []

            with tqdm(self.train_loader, unit="batch") as tepoch:
                for batch_idx, (inputs, targets, _) in enumerate(tepoch):
                    tepoch.set_description(f"Epoch {epoch}/{self.epochs}")
                    if inputs is None:
                        proceed
                    cur_iter += 1

                    inputs = inputs.to(self.device)
                    targets = [
                        {
                            k: (v.to(self.device) if (v is not None and hasattr(v, "to")) else v)
                            for k, v in t.items()
                        }
                        for t in targets
                    ]

                    lr = self.optimizer.param_groups[0]["lr"]

                    if self.amp_enabled:
                        with autocast(self.device, cache_enabled=True):
                            output = self.model(inputs, targets=targets)
                        with autocast(self.device, enabled=False):
                            loss_dict = self.loss_fn(output, targets)
                        loss = sum(loss_dict.values()) / self.b_accum_steps
                        self.scaler.scale(loss).backward()

                    else:
                        output = self.model(inputs, targets=targets)
                        loss_dict = self.loss_fn(output, targets)
                        loss = sum(loss_dict.values()) / self.b_accum_steps
                        loss.backward()

                    if (batch_idx + 1) % self.b_accum_steps == 0:
                        optimizer_step(step_scheduler=True)

                    losses.append(loss.item())

                    tepoch.set_postfix(
                        loss=np.mean(losses) * self.b_accum_steps,
                        eta=calculate_remaining_time(
                            one_epoch_time,
                            epoch_start_time,
                            epoch,
                            self.epochs,
                            cur_iter,
                            len(self.train_loader),
                        ),
                        vram=f"{get_vram_usage()}%",
                    )

            # Final update for any leftover gradients from an incomplete accumulation step
            if (batch_idx + 1) % self.b_accum_steps != 0:
                optimizer_step(step_scheduler=False)

            wandb.log({"lr": lr, "epoch": epoch})

            metrics = self.evaluate(
                val_loader=self.val_loader,
                conf_thresh=self.conf_thresh,
                iou_thresh=self.iou_thresh,
                path_to_save=None,
            )

            best_metric = self.save_model(metrics, best_metric)
            save_metrics(
                {}, metrics, np.mean(losses) * self.b_accum_steps, epoch, path_to_save=None
            )

            if (
                epoch >= self.epochs - self.no_mosaic_epochs
                and self.train_loader.dataset.mosaic_prob
            ):
                self.train_loader.dataset.close_mosaic()

            if epoch == self.ignore_background_epochs:
                self.train_loader.dataset.ignore_background = False
                logger.info("Including background images")

            one_epoch_time = time.time() - epoch_start_time

Metrics

For object detection everyone uses mAP, and it’s already standardized how we measure those. Use pycocotools or faster-coco-eval or TorchMetrics for mAP. But mAP implies that we check how good the model is overall, on all confidence levels. mAP0.5 implies that IoU threshold is 0.5 (every little thing lower is taken into account as a incorrect prediction). I personally don’t fully like this metric, as in production we at all times use 1 confidence threshold. So why not set the brink after which compute metrics? That’s why I also at all times calculate confusion matrices, and based on that – Precision, Recall, F1-score, and IoU.

But logic also may be tricky. Here’s what I take advantage of:

1 GT (ground truth) object = 1 predicted object, and it’s a TP if IoU > threshold. If there isn’t a prediction for a GT object – it’s a FN. If there isn’t a GT for a prediction – it’s a FP.
1 GT must be matched by a prediction just one time. If there are 2 predictions for 1 GT, then I calculate 1 TP and 1 FP.
Class ids must also match. If the model predicts class_0 but GT is class_1, it means FP += 1 and FN += 1.

During training, I choose the perfect model based on the metrics which might be relevant to the duty. I typically consider the typical of mAP50 and F1-score.

Model and loss

I haven’t discussed model architecture and loss function here. They typically go together, and you possibly can select any model you want and integrate it into your pipeline with every little thing from above. I did that with DAMO-YOLO and D-FINE, and the outcomes were great.

Pick an acceptable solution in your case

Many individuals use Ultralytics, nevertheless it has GPLv3, and you possibly can’t use it in business projects unless your code is open source. So people often look into Apache 2 and MIT licensed models. Take a look at D-FINE, RT-DETR2 or some yolo models like Yolov9.

What if you desire to customize something within the pipeline? Whenever you construct every little thing from scratch, it is best to have full control. Otherwise, try selecting a project with a smaller codebase, as a big one could make it difficult to isolate and modify individual components.

Should you don’t need anything custom and your usage is allowed by the Ultralytics license – it’s an incredible repo to make use of, because it supports multiple tasks (classification, detection, instance segmentation, key points, oriented bounding boxes), models are efficient and achieve good scores. Reiterating ones more, you most likely don’t need a custom training pipeline if you happen to aren’t doing very specific things.

Experiments

Let me share some results I got with a custom training pipeline with the D-FINE model and compare it to the Ultralytics YOLO11 model on the VisDrone-DET2019 dataset.

Trained from scratch:

model                     |  mAP 0.50.   |    F1-score  |  Latency (ms)  |
---------------------------------+--------------+--------------+------------------|
YOLO11m TRT               |     0.417    |     0.568    |       15.6     |
YOLO11m TRT dynamic       |     -        |     0.568    |       13.3     |
YOLO11m OV                |      -       |     0.568    |      122.4     |
D-FINEm TRT               |    0.457     |     0.622    |       16.6     |
D-FINEm OV                |    0.457     |     0.622    |       115.3    |

From COCO pre-trained:

model          |    mAP 0.50   |   F1-score  |
------------------+------------|-------------|
YOLO11m        |     0.456     |    0.600    |
D-FINEm        |     0.506     |    0.649    |

Latency was measured on an RTX 3060 with TensorRT (TRT), static image size 640×640, including the time for cv2.imread. OpenVINO (OV) on i5 14000f (no iGPU). Dynamic implies that during inference, gray padding is being cut for faster inference. It worked with the YOLO11 TensorRT version. More details about cutting gray padding above ( section).

One disappointing result’s the latency on intel N100 CPU with iGPU ($150 miniPC):

model            | Latency (ms) |
------------------+-------------|
YOLO11m          |       188    |
D-FINEm          |       272    |
D-FINEs          |       11     |

Creator’s screenshot of iGPU usage from n100 machine during model inference

Here, traditional convolutional neural networks are noticeably faster, possibly due to optimizations in OpenVINO for GPUs.

Overall, I conducted over 30 experiments with different datasets (including real-world datasets), models, and parameters and I can say that D-FINE gets higher metrics. And it is smart, as on COCO, additionally it is higher than all YOLO models.

D-FINE paper comparison to other object detection models

VisDrone experiments:

Creator’s metrics logged in WandB, D-FINE model

Creator’s metrics logged in WandB, YOLO11 model

Example of D-FINE model predictions (green – GT, blue – pred):

Final results

Knowing all the small print, let’s see a final comparison with the perfect settings for each models on i12400F and RTX 3060 with the VisDrone dataset:

model                              |   F1-score    |   Latency (ms)    |
-----------------------------------+---------------+-------------------|
YOLO11m TRT dynamic                |      0.600    |        13.3       |
YOLO11m OV                         |      0.600    |       122.4       |
D-FINEs TRT                        |      0.629    |        12.3       |
D-FINEs OV                         |      0.629    |        57.4       |

As shown above, I used to be capable of use a smaller D-FINE model and achieve each faster inference time and accuracy than YOLO11. Beating Ultralytics, essentially the most widely used real-time object detection model, in each speed and accuracy, is kind of an accomplishment, isn’t it? The identical pattern is observed across several other real-world datasets.

I also tried out YOLOv12, which got here out while I used to be writing this text. It performed similarly to YOLO11 and even achieved barely lower metrics (mAP 0.456 vs 0.452). It seems that YOLO models have been hitting the wall for the last couple of years. D-FINE was an incredible update for object detection models.

Finally, let’s see visually the difference between YOLO11m and D-FINEs. YOLO11m, conf 0.25, nms iou 0.5, latency 13.3ms:

D-FINEs, conf 0.5, no nms, latency 12.3ms:

Each Precision and Recall are higher with the D-FINE model. And it’s also faster. Here can be “m” version of D-FINE:

Isn’t it crazy that even that one automobile on the left was detected?

Attention to data preprocessing

This part can go a little bit bit outside the scope of the article, but I would like to at the very least quickly mention it, as some parts could be automated and utilized in the pipeline. What I definitely see as a Computer Vision engineer is that when engineers don’t spend time working with the information – they don’t get good models. You’ll be able to have all SoTA models and every little thing done right, but garbage in – garbage out. So, I at all times pay a ton of attention to how one can approach the duty and how one can gather, filter, validate, and annotate the information. Don’t think that the annotation team will do every little thing right. Get your hands dirty and check manually some portion of the dataset to make certain that annotations are good and picked up images are representative.

Several quick ideas to look into:

Remove duplicates and near duplicates from val/test sets. The model shouldn’t be validated on one sample two times, and definitely, you don’t wish to have an information leak, by getting two same images, one in training and one in validation sets.
Check how small your objects could be. Every thing not visible to your eye shouldn’t be annotated. Also, do not forget that augmentations will make objects appear even smaller (for instance, mosaic or zoom out). Configure these augmentations accordingly so that you won’t find yourself with unusably small objects on the image.
Whenever you have already got a model for a certain task and want more data – try using your model to pre-annotate recent images. Check cases where the model fails and gather more similar cases.

Where to start out

I worked quite a bit on this pipeline, and I’m able to share it with everyone who desires to try it out. It uses the SoTA D-FINE model under the hood and adds some features that were absent in the unique repo (mosaic augmentations, batch accumulation, scheduler, more metrics, visualization of preprocessed images and eval predictions, exporting and inference code, higher logging, unified and simplified configuration file).

Here is the link to my repo. Here is the original D-FINE repo, where I also contribute. Should you need any help, please contact me on LinkedIn. Thanks in your time!

Citations and acknowledgments

DroneVis

@article{zhu2021detection,
  title={Detection and tracking meet drones challenge},
  writer={Zhu, Pengfei and Wen, Longyin and Du, Dawei and Bian, Xiao and Fan, Heng and Hu, Qinghua and Ling, Haibin},
  journal={IEEE Transactions on Pattern Evaluation and Machine Intelligence},
  volume={44},
  number={11},
  pages={7380--7399},
  yr={2021},
  publisher={IEEE}
}

D-FINE

@misc{peng2024dfine,
      title={D-FINE: Redefine Regression Task in DETRs as Nice-grained Distribution Refinement},
      writer={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
      yr={2024},
      eprint={2410.13842},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Custom Training Pipeline for Object Detection Models

Plan

Dataset

Augmentations

Letterbox or easy resize?

Training

Metrics

Model and loss

Pick an acceptable solution in your case

Experiments

Final results

Attention to data preprocessing

Where to start out

Citations and acknowledgments

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Custom Training Pipeline for Object Detection Models

Plan

Dataset

Augmentations

Letterbox or easy resize?

Training

Metrics

Model and loss

Pick an acceptable solution in your case

Experiments

Final results

Attention to data preprocessing

Where to start out

Citations and acknowledgments

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.