YOLOv2 & YOLO9000 Paper Walkthrough: Higher, Faster, Stronger

— that’s the ambitious title the authors selected for his or her paper introducing each YOLOv2 and YOLO9000. The title of the paper itself is “” [1], which was published back in December 2016. The most important focus of this paper is indeed to create YOLO9000. But let’s make things clear. Despite the title of the paper, the model proposed within the study is named YOLOv2. The name YOLO9000 is their proposed algorithm specialized to detect over 9000 object categories which is built on top of the YOLOv2 architecture.

In this text I’m going to give attention to how YOLOv2 works and implement the architecture from scratch with PyTorch. I may even talk a little bit bit about how the authors eventually ended up with YOLO9000.

From YOLOv1 to YOLOv2

Because the name suggests, YOLOv2 is the advancement of YOLOv1. Thus, with a view to understand YOLOv2, I like to recommend you read my previous article about YOLOv1 [2] and its loss function [3] before reading this one.

There have been two most important problems raised by the authors on YOLOv1: first, the high localization error, or in other words the bounding box predictions made by the model is just not quite accurate. Second, the low recall, which is a condition where the model is unable to detect all objects inside the image. There have been a number of modifications made by the authors on YOLOv1 to handle the above issues, which generally the changes they made are summarized in Figure 1. We’re going to discuss each of those modifications one after the other in the following sub-sections.

Figure 1. The changes the authors made on YOLOv1 to construct YOLOv2 [1].

Batch Normalization

The primary modification the authors did was applying batch normalization layer. Do not forget that YOLOv1 is sort of old. It was first introduced back when BN layer was not quite popular just yet, which was the explanation why YOLOv1 don’t utilize this normalization mechanism in the primary place. It’s already proven that BN layer is in a position to stabilize training, speed up convergence, dan regularize model. Because of this reason, the dropout layer we previously have in YOLOv1 was omitted as we apply BN layers. It’s mentioned within the paper that by attaching this sort layer after each convolution they obtained 2.4% improvement in mAP from 63.4% to 65.8%.

Higher Wonderful-Tuning

Next, the authors proposed a greater option to perform fine-tuning. Previously in YOLOv1 the backbone model was pretrained on ImageNet classification dataset which the photographs had the dimensions of 224×224. Then, they replaced the classification head with detection head and directly fine-tune it on PASCAL VOC detection dataset which accommodates images of size 448×448. Here we are able to clearly see that there was something like a “jump” because of the various image resolutions in pretraining and fine-tuning. The pipeline used for training YOLOv2 is barely modified, where the authors added an intermediate step, namely fine-tuning the model on 448×448 ImageNet before fine-tuning it again on PASCAL VOC of the identical image resolution. This extra step allows the model to get adapted to the upper resolution image before being fine-tuned for detection, unlike in YOLOv1 which the model is forced to work on 448×448 images directly after being pretrained on 224×224 images. This latest fine-tuning pipeline allowed the mAP to extend by 3.7% from 65.8% to 69.5%.

Figure 2. The fine-tuning mechanism of YOLOv1 and YOLOv2 [4].

Anchor Box and Fully Convolutional Network

The following modification was related to the usage of anchor box. In the event you’re not yet aware of it, this is actually a template bounding box (a.k.a. prior box) corresponding to a single grid cell, which is rescaled to match the actual object size. The model is then trained to predict the offset of the anchor box slightly than the bounding box coordinates like YOLOv1. We will consider an anchor box as the start line of the model to make bounding box prediction. In keeping with the paper, predicting offset like this is simpler than predicting coordinates, hence allowing the model to perform higher. Figure 3 below illustrates 5 anchor boxes that correspond to the top-left grid cell. In a while, the identical anchor boxes will likely be applied to all grid cells inside the image.

Figure 3. Example of 5 anchor boxes applied to the top-left grid cell of a picture [5].

Using anchor boxes also modified the best way we do the thing classification. Previously in YOLOv1 we had each grid cell predicted two bounding boxes, yet it could only predict a single object class. YOLOv2 addresses this issue by attaching the thing classification mechanism with the anchor box slightly than the grid cell, allowing each anchor box from the identical grid cell to predict different object classes. Mathematically speaking, the length of the prediction vector of YOLOv1 could be formulated as (B×5)+C for every grid cell, whereas in YOLOv2 this prediction vector length modified to B×(5+C), where is the variety of bounding box to be generated, is the variety of classes within the dataset, and 5 is the variety of and the bounding box confidence value. With the mechanism introduced in YOLOv2, the prediction vector indeed becomes longer, but it surely allows each anchor box predicts its own class. The figure below illustrates the prediction vectors of YOLOv1 and YOLOv2, where we set to 2 and to twenty. On this particular case, the length of the prediction vectors of each models are (2×5)+20=30 and a couple of×(5+20)=50, respectively.

Figure 4. What the prediction vectors of YOLOv1 and YOLOv2 seem like for 20-class PASCAL VOC object detection dataset [5].

At this point authors also replaced the fully-connected layers in YOLOv1 with a stack of convolution layers, causing the complete model to be a totally convolutional network which has the downsampling factor of 32. This downsampling factor causes an input tensor of size 448×448 to get reduced to 14×14. The authors argued that enormous objects are frequently positioned in the course of a picture, so that they made the output feature map to have odd dimensions, ensuring that there’s a single center cell to predict such objects. So as to achieve this, the authors modified the input shape to 416×416 because the default configuration in order that the output dimension has the spatial resolution of 13×13.

Interestingly, the usage of anchor box and fully convolutional network caused the mAP to diminish by 0.3% from 69.5% to 69.2% as an alternative, yet at the identical time the recall increased by 7% from 81% to 88%. This improvement in recall was particularly attributable to the rise of the variety of predictions made by the model. Within the case of YOLOv1, the model could only predict 7×7=49 objects in total, and now YOLOv2 can predict as much as 13×13×5=845 objects, where the number 5 comes from the default variety of anchor boxes used. Meanwhile, the decrease in mAP indicated that there was a room for improvement on the anchor boxes.

Prior Box Clustering and Constrained Predictions

The authors indeed saw an issue within the anchor boxes, and so in the following step they tried to switch the best way it really works. Previously in Faster R-CNN the anchor boxes were manually handpicked, which caused them to not optimally represent all object shapes within the dataset. To deal with this problem, authors used K-means to cluster the distribution of the bounding box size. They did it by taking the and values of the bounding boxes in the thing detection dataset, putting them into two-dimensional space, and clustering the datapoints using K-means as usual. The authors decided to make use of =5, which essentially means that we are going to later have that variety of clusters.

The illustration in Figure 5 below displays what the bounding box size distribution looks like, where each black datapoint represents a single bounding box within the dataset and the green circles are the centroids which is able to then act as the dimensions of our anchor boxes. Note that this illustration is indeed created based on dummy data, but the thought here is that the datapoint positioned on the top-right represents a big square bounding box, the one within the top-left is a vertical rectangle box, and so forth.

Figure 5. Example of a bounding box distribution. The bounding box sizes are scaled to 0–1 relative to the image size [5].

In the event you’re aware of K-means, we typically use Euclidean distance to measure the space between datapoints and the centroids. But here the authors created a brand new distance metric specifically for this case, wherein they used the complement of the IOU between the bounding boxes and the cluster centroids. See the equation below for the main points.

Figure 6. The gap metric the authors use to measure distance between the bounding boxes within the dataset (black datapoints) and the anchor boxes (green centroids).

Using the space metric above, we are able to see in the next table that the prior boxes generated using K-means clustering (highlighted in blue) have a greater average IOU in comparison with the prior boxes utilized in Faster R-CNN (highlighted in green) despite the smaller variety of prior boxes (5 vs 9). This essentially indicates that the proposed clustering mechanism allows the resulting prior boxes to represent the bounding box size distribution within the dataset higher as in comparison with the handpicked anchor boxes.

Figure 7. Comparison of various methods for generating prior box [1].

Still related to prior box, the authors found that predicting anchor box offset like Faster R-CNN was actually still not quite optimal because of the unbounded equation. If we take a take a look at the Figure 8 below, there’s a possibility that the box position might be shifted wildly throughout the complete image, causing the training difficult especially in earlier stages.

Figure 8. The equations utilized in Faster R-CNN for transforming anchor box coordinates and size [1].

As a substitute of being relative to the anchor box like Faster R-CNN, the authors solved this issue by adopting the thought of predicting location coordinates relative to the grid cell from YOLOv1. Nonetheless, the authors further modify this by introducing sigmoid function to constrain the coordinate prediction of the network, effectively bounds the worth to the range of 0 to 1 hence causing the expected location won’t ever fall outside the corresponding grid cell, as shown in the primary and the second rows in Figure 9. Next, the and of the bounding box are processed with exponential function (third and fourth row), which is beneficial to stop negative values since it is just nonsense to have negative width or height. Meanwhile, the option to compute confidence rating on the fifth row is identical as YOLOv1, namely by calculating the multiplication of objectness confidence and the IOU between the expected and the goal box.

Figure 9. The equations utilized by YOLOv2 to make bounding box prediction [1].

So in easy words, we indeed adopt the concept of prior box introduced by Faster R-CNN, but as an alternative of handpicking the box, we use clustering to mechanically find probably the most optimal prior box sizes. The bounding box is created with additional sigmoid function for the and exponential function for the . It’s value noting that now the and are relative to the grid cell while the and are relative to the prior box. The authors found that this method improved mAP from 69.6% to 74.4%.

Passthrough Layer

The ultimate output feature map of YOLOv2 has the spatial dimension of 13×13, wherein each element corresponds to a single grid cell. The knowledge contained inside each grid cell is taken into account coarse, which absolutely is smart because maxpooling layers inside the network indeed work by taking only the best values from the sooner feature map. This may not be an issue if the objects to be detected are considerably large. But when the objects are small, our model is perhaps having hard time in performing the detection on account of the loss of data contained within the non-prominent pixels.

To deal with this problem, authors proposed to use the so-called . The target of this layer is to preserve fine-grained information from earlier feature map before being downsampled by maxpooling layer. In Figure 12, the a part of the network known as passthrough layer is the connection that branches out from the network before eventually merging back to the most important flow ultimately. The concept of this layer is sort of just like identity mapping introduced in ResNet. Nonetheless, the method done in that model is less complicated since the tensor dimension from the unique flow and the one within the skip-connection matches exactly, allowing them to be element-wise sumed. The case is different in passthrough layer, wherein here the latter feature map has a smaller spatial dimension, hence we want to think a option to mix information from the 2 tensors. The authors got here up with an idea where they divide the image within the passthrough layer after which stack the divided tensors in channel-wise manner as shown in Figure 10 below. By doing so, we may have the spatial dimension of the resulting tensor matches with the following feature map, allowing them to be concatenated along the channel axis. The fine-grained information from the previous layer will then be combined with the higher-level features from the latter layer using a convolution layer.

Figure 10. How the tensor is processed contained in the passthrough layer to adapt the dimension to the following layer [5].

Multi-Scale Training

Previously I’ve mentioned that in YOLOv2 all FC layers have been replaced with a stack of convolution layers. This essentially allows us to feed images of various scales inside the same training process, considering that the weights of CNN-based model correspond to the trainable parameters within the kernel, which is independent of the input image dimension. The truth is, this is definitely the explanation why the authors decided to remove the FC layers in the primary place. Throughout the training phase, authors modified the input resolution every 10 batches randomly from 320×320, 352×352, 384×384, and so forth to 608×608, all with multiple of 32. This process could be considered their approach to reinforce the info in order that the model can detect objects across various input dimensions, which I imagine it also allows the model to predict objects of various scales with higher performance. This process boosted mAP to 76.8% on the default input resolution 416×416, and it got even higher to 78.6% after we they increased the image resolution further to 544×544.

Darknet-19

All modifications on YOLOv1 we discussed within the previous sub sections were all related to how the authors improve detection quality by way of the mAP and recall. Now the main target of this sub section is to enhance model performance by way of the speed. It’s mentioned within the paper that the authors use a model known as Darknet-19 because the backbone, which has less operations in comparison with the backbone of YOLOv1 (5.58 billion vs 8.52 billion), allowing YOLOv2 to run faster than its predecessor. The unique version of this model consists of 19 convolution layers and 5 maxpooling layers, which the main points could be seen in Figure 11 below.

It is crucial to notice that the above architecture is the vanilla Darknet-19 model, which is simply suitable for classification task. To adapt it with the requirement of YOLOv2, we want to barely modify it by adding passthrough layer and replacing the classification head with detection head. You may see the modified architecture in Figure 12 below.

Figure 12. The entire YOLOv2 architecture [5].

Here you may see that the passthrough layer is placed after the last 26×26 feature map. This passthrough layer will reduce the spatial dimension to 13×13, allowing it to be concatenated in channel-wise manner with the 13×13 feature map from the most important flow. Later in the following section I’m going to display implement this Darknet-19 architecture from scratch including the detection head in addition to the passthrough layer.

9000-Class Object Detection

YOLOv2 model was initially trained on PASCAL VOC and COCO datasets which have 20 and 80 variety of object classes, respectively. The authors saw this as an issue because they thought that this number could be very limited for general case, and hence lack of versatility. Because of this reason, it’s vital to enhance the model such that it could possibly detect a greater diversity of object classes. Nonetheless, creating object detection dataset could be very expensive and laborious, because not only the thing classes but we’re also required to annotate the bounding box information.

The authors got here up with a really clever idea, where they combined ImageNet, which has over 22,000 classes, with COCO using class hierarchy mechanism which they discuss with as WordTree as shown in Figure 13. You may see the illustration within the figure that blue nodes are the classes from COCO dataset, while the red ones are from ImageNet dataset. The item categories available within the COCO dataset are relatively general, whereas those in ImageNet are loads more fine-grained. As an example, if in COCO we got , in ImageNet we got , , , and . So using the thought of WordTree, the authors put these 4 airplane types because the subclass of . You may consider the inference like this: the model works by predicting bounding box and the parent class, then it would check if it got subclasses. If that’s the case, the model will proceed predicting from the smaller subset of classes.

By combining the 2 datasets like this, we eventually ended up with a model that’s able to predicting over 9000 object classes (9418 to be exact), hence the name YOLO9000.

Figure 13. The category grouping mechanism utilized in YOLO9000 [1].

YOLOv2 Architecture Implementation

As I promised earlier, on this section I’m going to display implement the YOLOv2 architecture from scratch so that you may recover understanding about how an input image eventually becomes a tensor containing bounding box and sophistication predictions.

Now what we want to do first is to import the required modules which is shown in Codeblock 1 below.

# Codeblock 1
import torch
import torch.nn as nn

Next, we create the ConvBlock class, wherein it’ll encapsulate the convolution layer itself, a batch normalization layer and a leaky ReLU activation function. The negative_slope parameter itself is about to 0.1 as shown at line #(1), which is precisely the identical because the one utilized in YOLOv1.

# Codeblock 2
class ConvBlock(nn.Module):
    def __init__(self, 
                 in_channels, 
                 out_channels, 
                 kernel_size, 
                 padding):
        super().__init__()
        
        self.conv = nn.Conv2d(in_channels=in_channels,
                              out_channels=out_channels, 
                              kernel_size=kernel_size, 
                              padding=padding)
        self.bn = nn.BatchNorm2d(num_features=out_channels)
        self.leaky_relu = nn.LeakyReLU(negative_slope=0.1)    #(1)
        
    def forward(self, x):
        print(f'originalt: {x.size()}')

        x = self.conv(x)
        print(f'after convt: {x.size()}')
        
        x = self.leaky_relu(x)
        print(f'after leaky relu: {x.size()}')
        
        return x

Just to envision if the above class works properly, here I test it with a quite simple test case, where I initialize a ConvBlock instance which accepts an RGB image of size 416×416. You may see within the resulting output that the image now has 64 channels, proving that our ConvBlock works properly.

# Codeblock 3
convblock = ConvBlock(in_channels=3,
                      out_channels=64,
                      kernel_size=3,
                      padding=1)
x = torch.randn(1, 3, 416, 416)
out = convblock(x)

# Codeblock 3 Output
original         : torch.Size([1, 3, 416, 416])
after conv       : torch.Size([1, 64, 416, 416])
after leaky relu : torch.Size([1, 64, 416, 416])

Darknet-19 Implementation

Now let’s use this ConvBlock class to construct the Darknet-19 architecture. The option to achieve this is pretty easy, as what we want to do is simply to stack multiple ConvBlock instances followed by a maxpooling layer based on the architecture in Figure 12. See the main points in Codeblock 4a below. Note that the maxpooling layer for stage4 is placed in the beginning of stage5 as shown at the road marked with #(1). This is actually done since the output of stage4 will directly be fed into the passthrough layer without being downsampled. Along with this, it is necessary to notice that the term “stage” is just not officially mentioned within the paper. Reasonably, that is only a term I personally use for the sake of this implementation.

# Codeblock 4a
class Darknet(nn.Module):
    def __init__(self):
        super(Darknet, self).__init__()
        
        
        self.stage0 = nn.ModuleList([
            ConvBlock(3, 32, 3, 1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        ])
        
        self.stage1 = nn.ModuleList([
            ConvBlock(32, 64, 3, 1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        ])
            
        self.stage2 = nn.ModuleList([
            ConvBlock(64, 128, 3, 1), 
            ConvBlock(128, 64, 1, 0), 
            ConvBlock(64, 128, 3, 1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        ])
        
        self.stage3 = nn.ModuleList([
            ConvBlock(128, 256, 3, 1), 
            ConvBlock(256, 128, 1, 0), 
            ConvBlock(128, 256, 3, 1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        ])
        
        self.stage4 = nn.ModuleList([
            ConvBlock(256, 512, 3, 1), 
            ConvBlock(512, 256, 1, 0), 
            ConvBlock(256, 512, 3, 1), 
            ConvBlock(512, 256, 1, 0), 
            ConvBlock(256, 512, 3, 1), 
        ])
        
        self.stage5 = nn.ModuleList([
            nn.MaxPool2d(kernel_size=2, stride=2),    #(1)
            ConvBlock(512, 1024, 3, 1), 
            ConvBlock(1024, 512, 1, 0), 
            ConvBlock(512, 1024, 3, 1), 
            ConvBlock(1024, 512, 1, 0), 
            ConvBlock(512, 1024, 3, 1), 
        ])

As all layers have been initialized, the following thing we do is to attach all these layers using the forward() method in Codeblock 4b below. Previously I said that we are going to take the output of stage4 because the input for the passthrough layer. To achieve this, I store the feature map produced by the last layer of stage4 in a separate variable which I discuss with as x_stage4 (#(1)). We then do the identical thing for the output of stage5 (#(2)) and return each x_stage4 and x_stage5 because the output of our Darknet (#(3)).

# Codeblock 4b
    def forward(self, x):
        print(f'originalt: {x.size()}')
        
        print()
        for i in range(len(self.stage0)):
            x = self.stage0[i](x)
            print(f'after stage0 #{i}t: {x.size()}')
        
        print()
        for i in range(len(self.stage1)):
            x = self.stage1[i](x)
            print(f'after stage1 #{i}t: {x.size()}')
        
        print()
        for i in range(len(self.stage2)):
            x = self.stage2[i](x)
            print(f'after stage2 #{i}t: {x.size()}')
        
        print()
        for i in range(len(self.stage3)):
            x = self.stage3[i](x)
            print(f'after stage3 #{i}t: {x.size()}')
        
        print()
        for i in range(len(self.stage4)):
            x = self.stage4[i](x)
            print(f'after stage4 #{i}t: {x.size()}')
            
        x_stage4 = x.clone()        #(1)
        
        print()
        for i in range(len(self.stage5)):
            x = self.stage5[i](x)
            print(f'after stage5 #{i}t: {x.size()}')
        
        x_stage5 = x.clone()        #(2)

        return x_stage4, x_stage5   #(3)

Next, I test the Darknet-19 model above by passing the identical dummy image because the one in our previous test case.

# Codeblock 5
darknet = Darknet()

x = torch.randn(1, 3, 416, 416)
out = darknet(x)

# Codeblock 5 Output
original        : torch.Size([1, 3, 416, 416])

after stage0 #0 : torch.Size([1, 32, 416, 416])
after stage0 #1 : torch.Size([1, 32, 208, 208])

after stage1 #0 : torch.Size([1, 64, 208, 208])
after stage1 #1 : torch.Size([1, 64, 104, 104])

after stage2 #0 : torch.Size([1, 128, 104, 104])
after stage2 #1 : torch.Size([1, 64, 104, 104])
after stage2 #2 : torch.Size([1, 128, 104, 104])
after stage2 #3 : torch.Size([1, 128, 52, 52])

after stage3 #0 : torch.Size([1, 256, 52, 52])
after stage3 #1 : torch.Size([1, 128, 52, 52])
after stage3 #2 : torch.Size([1, 256, 52, 52])
after stage3 #3 : torch.Size([1, 256, 26, 26])

after stage4 #0 : torch.Size([1, 512, 26, 26])
after stage4 #1 : torch.Size([1, 256, 26, 26])
after stage4 #2 : torch.Size([1, 512, 26, 26])
after stage4 #3 : torch.Size([1, 256, 26, 26])
after stage4 #4 : torch.Size([1, 512, 26, 26])

after stage5 #0 : torch.Size([1, 512, 13, 13])
after stage5 #1 : torch.Size([1, 1024, 13, 13])
after stage5 #2 : torch.Size([1, 512, 13, 13])
after stage5 #3 : torch.Size([1, 1024, 13, 13])
after stage5 #4 : torch.Size([1, 512, 13, 13])
after stage5 #5 : torch.Size([1, 1024, 13, 13])

Here we are able to see that our output matches exactly with the architectural details in Figure 12, indicating that our implementation of the Darknet-19 model is correct.

The Entire YOLOv2 Architecture

Before actually constructing the complete YOLOv2 architecture, we want to initialize the parameters for the model first. Here we wish each cell to generate 5 anchor boxes, hence we want to set the NUM_ANCHORS variable to that number. Next, I set NUM_CLASSES to twenty because we assume that we wish to coach the model on PASCAL VOC dataset.

# Codeblock 6
NUM_ANCHORS = 5
NUM_CLASSES = 20

Now it’s time to define the YOLOv2 class. Within the Codeblock 7a below, we initially define the __init__() method, where we initialize the Darknet model (#(1)), a single ConvBlock for the passthrough layer (#(2)), a stack of two convolution layers which I discuss with as stage6 (#(3)), and one other stack of two convolution layers which the last one is used to map the tensor into prediction vector with B×(5+C) variety of channels (#(4)).

# Codeblock 7a
class YOLOv2(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.darknet = Darknet()                       #(1)
        
        self.passthrough = ConvBlock(512, 64, 1, 0)    #(2)
        
        self.stage6 = nn.ModuleList([                  #(3)
            ConvBlock(1024, 1024, 3, 1), 
            ConvBlock(1024, 1024, 3, 1), 
        ])

        self.stage7 = nn.ModuleList([
            ConvBlock(1280, 1024, 3, 1),
            ConvBlock(1024, NUM_ANCHORS*(5+NUM_CLASSES), 1, 0)    #(4)
        ])

Afterwards, we define the so-called reorder() method, which we are going to use to process the feature map within the passthrough layer. The logic of the code below is sort of complicated though, however the most important idea is that it follows the principle given in Figure 10. Here I show you the output of every line so that you may recover understanding of how the method goes contained in the function given an input tensor of shape 1×64×26×26, which represents a single image of size 26×26 with 64 channels. Within the last step we are able to see that the ultimate output tensor has the form of 1×256×13×13. This shape matches exactly with our requirement, where the channel dimension becomes 4 times larger than that of the input while at the identical time the spatial dimension halves.

# Codeblock 7b
    def reorder(self, x, scale=2):                      # ([1, 64, 26, 26])
        B, C, H, W = x.shape
        h, w = H // scale, W // scale

        x = x.reshape(B, C, h, scale, w, scale)         # ([1, 64, 13, 2, 13, 2])     
        x = x.transpose(3, 4)                           # ([1, 64, 13, 13, 2, 2])

        x = x.reshape(B, C, h * w, scale * scale)       # ([1, 64, 169, 4])
        x = x.transpose(2, 3)                           # ([1, 64, 4, 169])

        x = x.reshape(B, C, scale * scale, h, w)        # ([1, 64, 4, 13, 13])
        x = x.transpose(1, 2)                           # ([1, 4, 64, 13, 13])

        x = x.reshape(B, scale * scale * C, h, w)       # ([1, 256, 13, 13])

        return x

Next, the Codeblock 7c below shows how we create the flow of the network. We initially start from the darknet backbone, wherein it returns x_stage4 and x_stage5 (#(1)). The x_stage5 tensor will directly be processed with the following convolution layers which I discuss with as stage6 (#(2)) whereas the x_stage4 tensor will likely be passed to the passthrough layer (#(3)) and processed by the reorder() (#(4)) method we defined in Codeblock 7b above. Afterwards, we then concatenate each tensors in channel-wise manner at line #(5). This concatenated tensor is then processed further with one other stack of convolution layers called stage7 (#(6)) which returns the prediction vector.

# Codeblock 7c
    def forward(self, x):
        print(f'originalttt: {x.size()}')
        
        x_stage4, x_stage5 = self.darknet(x)              #(1)
        print(f'nx_stage4ttt: {x_stage4.size()}')
        print(f'x_stage5ttt: {x_stage5.size()}')
        
        print()
        x = x_stage5
        for i in range(len(self.stage6)):
            x = self.stage6[i](x)                         #(2)
            print(f'x_stage5 after stage6 #{i}t: {x.size()}')    
        
        x_stage4 = self.passthrough(x_stage4)             #(3)
        print(f'nx_stage4 after passthrought: {x_stage4.size()}')
        
        x_stage4 = self.reorder(x_stage4)                 #(4)
        print(f'x_stage4 after reordertt: {x_stage4.size()}')
        
        x = torch.cat([x_stage4, x], dim=1)               #(5)
        print(f'nx after concatenatett: {x.size()}')
        
        for i in range(len(self.stage7)):                 #(6)
            x = self.stage7[i](x)
            print(f'x after stage7 #{i}t: {x.size()}')    
        
        return x

Again, to check the above code we are going to go through a tensor of size 1×3×416×416.

# Codeblock 8
yolov2 = YOLOv2()
x = torch.randn(1, 3, 416, 416)

out = yolov2(x)

And below is what the output looks like after the code is run. The outputs known as stage0 to stage5 are the processes inside the Darknet backbone, wherein this is precisely the identical because the one I showed you earlier in Codeblock 5 Output. Afterwards we are able to see in stage6 that the form of the x_stage5 tensor doesn’t change in any respect (#(1–3)). Meanwhile, the channel dimension of x_stage4 increased from 64 to 256 after being processed by the reorder() operation (#(4–5)). The tensor from the most important flow is then concatenated with the one from passthrough layer, which caused the variety of channels within the resulting tensor became 1024+256=1280 (#(6)). Lastly, we pass the tensor to stage7 which returns a prediction tensor of size 125×13×13, denoting that we now have 13×13 grid cells where each of those cells accommodates a prediction vector of length 125 (#(7)), storing the bounding box and the thing class predictions.

# Codeblock 8 Output
original        : torch.Size([1, 3, 416, 416])

after stage0 #0 : torch.Size([1, 32, 416, 416])
after stage0 #1 : torch.Size([1, 32, 208, 208])

after stage1 #0 : torch.Size([1, 64, 208, 208])
after stage1 #1 : torch.Size([1, 64, 104, 104])

after stage2 #0 : torch.Size([1, 128, 104, 104])
after stage2 #1 : torch.Size([1, 64, 104, 104])
after stage2 #2 : torch.Size([1, 128, 104, 104])
after stage2 #3 : torch.Size([1, 128, 52, 52])

after stage3 #0 : torch.Size([1, 256, 52, 52])
after stage3 #1 : torch.Size([1, 128, 52, 52])
after stage3 #2 : torch.Size([1, 256, 52, 52])
after stage3 #3 : torch.Size([1, 256, 26, 26])

after stage4 #0 : torch.Size([1, 512, 26, 26])
after stage4 #1 : torch.Size([1, 256, 26, 26])
after stage4 #2 : torch.Size([1, 512, 26, 26])
after stage4 #3 : torch.Size([1, 256, 26, 26])
after stage4 #4 : torch.Size([1, 512, 26, 26])

after stage5 #0 : torch.Size([1, 512, 13, 13])
after stage5 #1 : torch.Size([1, 1024, 13, 13])
after stage5 #2 : torch.Size([1, 512, 13, 13])
after stage5 #3 : torch.Size([1, 1024, 13, 13])
after stage5 #4 : torch.Size([1, 512, 13, 13])
after stage5 #5 : torch.Size([1, 1024, 13, 13])

x_stage4        : torch.Size([1, 512, 26, 26])
x_stage5        : torch.Size([1, 1024, 13, 13])              #(1)

x_stage5 after stage6 #0   : torch.Size([1, 1024, 13, 13])   #(2)
x_stage5 after stage6 #1   : torch.Size([1, 1024, 13, 13])   #(3)

x_stage4 after passthrough : torch.Size([1, 64, 26, 26])     #(4)
x_stage4 after reorder     : torch.Size([1, 256, 13, 13])    #(5)

x after concatenate        : torch.Size([1, 1280, 13, 13])   #(6)
x after stage7 #0          : torch.Size([1, 1024, 13, 13])
x after stage7 #1          : torch.Size([1, 125, 13, 13])

Ending

I believe that’s just about the whole lot about YOLOv2 and its model architecture implementation from scratch. The code utilized in this text can also be available on my GitHub repository [6]. Please let me know should you spot any mistake in my explanation or within the code. Thanks for reading, I hope you learn something latest from this text. See ya in my next writing!

References

[1] Joseph Redmon and Ali Farhadi. YOLO9000: Higher, Faster, Stronger. Arxiv. https://arxiv.org/abs/1612.08242 [Accessed August 9, 2025].

[2] Muhammad Ardi. YOLOv1 Paper Walkthrough: The Day YOLO First Saw the World. Medium. https://medium.com/ai-advances/yolov1-paper-walkthrough-the-day-yolo-first-saw-the-world-ccff8b60d84b [Accessed January 24, 2026].

[3] Muhammad Ardi. YOLOv1 Loss Function Walkthrough: Regression for All. Towards Data Science. https://towardsdatascience.com/yolov1-loss-function-walkthrough-regression-for-all/ [Accessed January 24, 2026].

[4] Image originally created by creator, partially generated with Gemini

[5] Image originally created by creator

[6] MuhammadArdiPutra. Higher, Faster, Stronger — YOLOv2 and YOLO9000. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/most important/Higher%2C%20Faster%2C%20Stronger%20-%20YOLOv2%20and%20YOLO9000.ipynb [Accessed August 9, 2025].

YOLOv2 & YOLO9000 Paper Walkthrough: Higher, Faster, Stronger

From YOLOv1 to YOLOv2

Batch Normalization

Higher Wonderful-Tuning

Anchor Box and Fully Convolutional Network

Prior Box Clustering and Constrained Predictions

Passthrough Layer

Multi-Scale Training

Darknet-19

9000-Class Object Detection

YOLOv2 Architecture Implementation

Darknet-19 Implementation

The Entire YOLOv2 Architecture

Ending

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact

NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories

Self-Hosting Your First LLM

Researchers disclose vulnerabilities in IP KVMs from 4 manufacturers

Constructing the AI Grid with NVIDIA: Orchestrating Intelligence In every single place

YOLOv2 & YOLO9000 Paper Walkthrough: Higher, Faster, Stronger

From YOLOv1 to YOLOv2

Batch Normalization

Higher Wonderful-Tuning

Anchor Box and Fully Convolutional Network

Prior Box Clustering and Constrained Predictions

Passthrough Layer

Multi-Scale Training

Darknet-19

9000-Class Object Detection

YOLOv2 Architecture Implementation

Darknet-19 Implementation

The Entire YOLOv2 Architecture

Ending

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.