YOLOv3 Paper Walkthrough: Even Higher, But Not That Much

-

to be the state-of-the-art object detection algorithm, looked to turn into obsolete due to the looks of other methods like SSD (Single Shot Multibox Detector), DSSD (Deconvolutional Single Shot Detector), and RetinaNet. Finally, after two years for the reason that introduction of YOLOv2, the authors decided to enhance the algorithm where they eventually got here up with the subsequent YOLO version reported in a paper titled “” [1]. Because the title suggests, there have been indeed not many things the authors improved upon YOLOv2 when it comes to the underlying algorithm. But hey, on the subject of performance, it actually looks pretty impressive.

In this text I’m going to speak in regards to the modifications the authors made to YOLOv2 to create YOLOv3 and tips on how to implement the model architecture from scratch with PyTorch. I highly recommend you reading my previous article about YOLOv1 [2, 3] and YOLOv2 [4] before this one, unless you already got a powerful foundation in how these two earlier versions of YOLO work.


What Makes YOLOv3 Higher Than YOLOv2

The Vanilla Darknet-53

The modification the authors made was mainly related to the architecture, during which they proposed a backbone model known as Darknet-53. See the detailed structure of this network in Figure 1. Because the name suggests, this model is an improvement upon the Darknet-19 utilized in YOLOv2. In the event you count the variety of layers in Darknet-53, you’ll discover that this network consists of 52 convolution layers and a single fully-connected layer at the top. Have in mind that later after we implement it on YOLOv3, we’ll feed it with images of size 416×416 fairly than 256×256 as written within the figure.

Figure 1. The vanilla Darknet-53 architecture [1].

In the event you’re acquainted with Darknet-19, you could keep in mind that it performs spatial downsmapling using maxpooling operations after every stack of several convolution layers. In Darknet-53, authors replaced these pooling operations with convolutions of stride 2. This was essentially done because maxpooling layer completely ignores non-maximum numbers, causing us to lose quite a lot of information contained within the lower intensity pixels. We will actually use average-pooling as a substitute, but in theory, this approach won’t be optimal either because all pixels inside the small region are weighted the identical. In order an answer, authors decided to make use of convolution layer with a stride of two, which by doing so the model will have the opportunity to scale back image resolution while capturing spatial information with specific weightings. You may see the illustration for this in Figure 2 below.

Figure 2. How maxpooling, average-pooling and convolution with stride 2 differ from one another [5].

Next, the backbone of this YOLO version is now equipped with residual blocks which the thought is originated from ResNet. One thing that I would like to emphasise regarding our implementation is the activation function inside the residual block. You may see in Figure 3 below that in keeping with the unique ResNet paper, the second activation function is placed after the element-wise summation. Nevertheless, based on the opposite tutorials that I read [6, 7], I discovered that within the case of YOLOv3 the second activation function is placed right after the load layer as an alternative (before summation). So later within the implementation, I made a decision to follow the guide in these tutorials for the reason that YOLOv3 paper doesn’t give any explanations about it.

Figure 3. A residual block [8]

Darknet-53 With Detection Heads

Have in mind that the architecture in Figure 1 is just meant for classification. Thus, we’d like to interchange all the pieces after the last residual block if we intend to make it compatible for detection tasks. Again, the unique YOLOv3 paper also doesn’t provide the detailed implementation guide, hence I made a decision to look for it and eventually got one from the paper referenced as [9]. I redraw the illustration from that paper to make the architecture looks clearer as shown in Figure 4 below.

Figure 4. The YOLOv3 architecture [5].

There are literally numerous things to elucidate regarding the above architecture. Now let’s start from the part I consult with because the . Different from the previous YOLO versions which relied on a single head, here in YOLOv3 now we have 2 additional heads. Thus, we’ll later have 3 prediction tensors for each single input image. These three detection heads have different specializations: the leftmost head (13×13) is the one responsible to detect large objects, the center head (26×26) is for detecting medium-sized objects, and the one on the best (52×52) is used to detect objects of small size. We will consider the 52×52 tensor because the feature map that comprises the detailed representation of a picture, hence is suitable to detect small objects. Conversely, the 13×13 prediction tensor is supposed to detect large objects due to its lower spatial resolution which is effective at capturing the overall shape of an object.

Still with the detection head, you can too see in Figure 4 that the three prediction tensors have 255 channels. To grasp where this number comes from, we first must know that every detection head has 3 prior boxes. Following the rule given in YOLOv2, each of those prior boxes is configured such that it may possibly predict its own object category independently. With this mechanism, the feature vector of every grid cell might be obtained by computing B×(5+C), where is the variety of prior boxes, is the variety of object classes, and 5 is the and the bounding box confidence (a.k.a. ). Within the case of YOLOv3, each detection head has 3 prior boxes and 80 classes, assuming that we train it on 80-class COCO dataset. By plugging these numbers to the formula, we obtain 3×(5+80)=255 prediction values for a single grid cell.

In truth, using multi-head mechanism like this enables the model to detect more objects as in comparison with the sooner YOLO versions. Previously in YOLOv1, since a picture is split into 7×7 grid cells and every of those can predict 2 bounding boxes, hence there are 98 objects possible to be detected. Meanwhile in YOLOv2, a picture is split into 13×13 grid cells during which a single cell is able to generating 5 bounding boxes, making YOLOv2 capable of detect as much as 845 objects inside a single image. This essentially allows YOLOv2 to have a greater recall than YOLOv1. In theory, YOLOv3 is potentially capable of achieve an excellent higher recall, especially when tested on a picture that comprises quite a lot of objects due to the larger variety of possible detections. We will calculate the variety of maximum bounding boxes for a single image in YOLOv3 by computing (13×13×3) + (26×26×3) + (52×52×3) = 507 + 2028 + 8112 = 10647, where 13×13, 26×26, and 52×25 are the variety of grid cells inside each prediction tensor, while 3 is the variety of prior boxes a single grid cell has.

We may see in Figure 4 that there are two concatenation steps incorporated within the network, i.e., between the unique Darknet-53 architecture and the detection heads. The target of those steps is to mix information from the deeper layer with the one from the shallower layer. Combining information from different depths like this is very important because on the subject of detecting smaller objects, we do need each an in depth spatial information (contained within the shallower layer) and a greater semantic information (contained within the deeper layer). Have in mind that the feature map from the deeper layer has a smaller spatial dimension, hence we’d like to expand it before actually doing the concatenation. This is basically the rationale that we’d like to position an layer right before we do the concatenation.

Multi-Label Classification

Other than the architecture, the authors also modified the category labeling mechanism. As an alternative of using an ordinary multiclass classification paradigm, they proposed to make use of the so-called multilabel classification. In the event you’re not yet acquainted with it, this is essentially a way where a picture might be assigned multiple labels directly. Take a take a look at Figure 5 below to higher understand this concept. In this instance, the image on the left belongs to the category , , , and concurrently. In a while, YOLOv3 can also be expected to have the opportunity to make multiple class predictions on the identical detected object.

Figure 5. Image with multiple labels [10].

To ensure that the model to predict multiple labels, we’d like to treat each class prediction output as an independent binary classifier. Have a look at Figure 6 below to see how multiclass classification differs from multilabel classification. The illustration on the left is a condition after we use a typical multiclass classification mechanism. Here you’ll be able to see that the possibilities of all classes sum to 1 due to the character of the softmax activation function inside the output layer. In this instance, for the reason that class is predicted with the very best probability, then the ultimate prediction could be no matter how high the prediction confidence of the opposite classes is.

However, if we use multilabel classification, there may be a possibility that the sum of all class prediction probabilities is larger than 1 because we use sigmoid activation function which by nature doesn’t restrict the sum of all prediction confidence scores to 1. Because of this reason, later within the implementation we will simply apply a particular threshold to think about a category predicted. In the instance below, if we assume that the brink is 0.7, then the image will likely be predicted as each and .

Figure 6. Multiclass vs multilabel classification [10].

Modified Loss Function

One other modification the authors made was related to the loss function. Now take a look at the loss function of YOLOv1 in Figure 7 below. As a refresher, the first and 2nd rows are responsible to compute the bounding box loss, the third and 4th rows are for the objectness confidence loss, and the fifth row is for computing the classification loss. Do not forget that in YOLOv1 the authors used SSE (Sum of Squared Errors) in all these five rows to make things easy.

Figure 7. The loss function of YOLOv1 [5, 11].

In YOLOv3, the authors decided to interchange the objectness loss (the third and 4th rows) with , considering that the predictions corresponding to this part is just either 1 or 0, i.e., whether there may be an object midpoint or not. Thus, it makes more sense to treat this as a binary classification fairly than a regression problem.

Binary cross entropy will even be utilized in the classification loss (fifth row). This is basically because we use multilabel classification mechanism we discussed earlier, where we treat each of the output neuron as an independent binary classifier. Do not forget that if we were to perform an ordinary classification task, we typically must set the loss function to as an alternative.

Now below is what the loss function looks like after we replace the SSE with binary cross entropy for the objectness (green) and the multilabel classification (blue) parts. Note that this equation is created based on the YouTube tutorial I watched given at reference number [12] because, again, the authors don’t explicitly provide the ultimate loss function within the paper.

Figure 8. The loss function of YOLOv3 [5, 12].

Some Experimental Results

With all of the modifications discussed above, the authors found that the advance in performance is pretty impressive. The primary experimental result I would like to indicate you is said to the performance of the backbone model in classifying images on ImageNet dataset. You may see in Figure 9 below that the advance from Darknet-19 (YOLOv2) to Darknet-53 (YOLOv3) is kind of significant when it comes to each the top-1 accuracy (74.1 to 77.2) and the top-5 accuracy (91.8 to 93.8). It’s vital to acknowledge that ResNet-101 and ResNet-152 indeed also perform nearly as good as Darknet-53 in accuracy, but when we compare the FPS (measured on Nvidia Titan X), we will see that Darknet-53 is rather a lot faster than each ResNet variants.

Figure 9. Performance of various backbones on ImageNet classification dataset [1].

The same behavior may be observed on object detection task, where it’s seen in Figure 10 that every one YOLOv3 variants successfully obtained the fastest computation time amongst all other methods despite not having the very best accuracy. You may see within the figure that the most important YOLOv3 variant is sort of 4 times faster than the most important RetinaNet variant (51 ms vs 198 ms). Furthermore, the most important YOLOv3 variant itself already surpasses the mAP of the smallest RetinaNet variant (33.0 vs 32.5) while still having a faster inference time (51 ms vs 73 ms). These experimental results essentially prove that YOLOv3 became the state-of-the-art object detection model when it comes to computational speed at that moment.

Figure 10. Performance of YOLOv3 in comparison with other object detection methods [10].

YOLOv3 Architecture Implementation

As now we have already discussed just about all the pieces in regards to the theory behind YOLOv3, we will now start implementing the architecture from scratch. In Codeblock 1 below, I import the torch module and its nn submodule. Here I also initialize the NUM_PRIORS and NUM_CLASS variables, during which these two correspond to the variety of prior boxes inside each grid cell and the variety of object classes within the dataset, respectively.

# Codeblock 1
import torch
import torch.nn as nn

NUM_PRIORS = 3
NUM_CLASS  = 80

Convolutional Block Implementation

What I’m going to implement first is the major constructing block of the network, which I consult with because the Convolutional block as seen in Codeblock 2. The structure of this block is kind of a bit the identical because the one utilized in YOLOv2, where it follows the pattern. Once we use this sort of structure, don’t forget to set the bias parameter of the conv layer to False (at line #(1)) because using bias term is somewhat useless if we directly place a batch normalization layer right after it. Here I also configure the padding of the conv layer such that it is going to mechanically set to 1 each time the kernel size is 3×3 or 0 each time we use 1×1 kernel (#(2)). Next, because the conv, bn, and leaky_relu have been initialized, we will simply connect all of them using the code written contained in the forward() method (#(3)).

# Codeblock 2
class Convolutional(nn.Module):
    def __init__(self, 
                 in_channels, 
                 out_channels, 
                 kernel_size, 
                 stride=1):
        super().__init__()
        
        self.conv = nn.Conv2d(in_channels=in_channels,
                              out_channels=out_channels, 
                              kernel_size=kernel_size, 
                              stride=stride,
                              bias=False,                            #(1)
                              padding=1 if kernel_size==3 else 0)    #(2)
        
        self.bn = nn.BatchNorm2d(num_features=out_channels)
        
        self.leaky_relu = nn.LeakyReLU(negative_slope=0.1)
        
    def forward(self, x):        #(3)
        print(f'originalt: {x.size()}')

        x = self.conv(x)
        print(f'after convt: {x.size()}')
        
        x = self.bn(x)
        print(f'after bnt: {x.size()}')
        
        x = self.leaky_relu(x)
        print(f'after leaky relu: {x.size()}')
        
        return x

We just need to be sure that our major constructing block is working properly, we’ll test it by simulating the very first Convolutional block in Figure 1. Do not forget that since YOLOv3 takes a picture of size 416×416 because the input, here in Codeblock 3 I create a dummy tensor of that shape to simulate a picture passed through that layer. Also, note that here I leave the stride to the default (1) because at this point we don’t need to perform spatial downsampling.

# Codeblock 3
convolutional = Convolutional(in_channels=3,
                              out_channels=32,
                              kernel_size=3)

x = torch.randn(1, 3, 416, 416)
out = convolutional(x)
# Codeblock 3 Output
original         : torch.Size([1, 3, 416, 416])
after conv       : torch.Size([1, 32, 416, 416])
after bn         : torch.Size([1, 32, 416, 416])
after leaky relu : torch.Size([1, 32, 416, 416])

Now let’s test our Convolutional block again, but this time I’ll set the stride to 2 to simulate the second convolutional block within the architecture. We will see within the output below that the spatial dimension halves from 416×416 to 208×208, indicating that this approach is a sound substitute for the maxpooling layers we previously had in YOLOv1 and YOLOv2.

# Codeblock 4
convolutional = Convolutional(in_channels=32,
                              out_channels=64,
                              kernel_size=3, 
                              stride=2)

x = torch.randn(1, 32, 416, 416)
out = convolutional(x)
# Codeblock 4 Output
original          : torch.Size([1, 32, 416, 416])
after conv        : torch.Size([1, 64, 208, 208])
after bn          : torch.Size([1, 64, 208, 208])
after leaky relu  : torch.Size([1, 64, 208, 208])

Residual Block Implementation

Because the Convolutional block is completed, what I’m going to do now could be to implement the subsequent constructing block: Residual. This block generally follows the structure I displayed back in Figure 3, where it consists of a residual connection that skips through two Convolutional blocks. Take a take a look at the Codeblock 5 below to see how I implement it.

The 2 convolution layers themselves follow the pattern in Figure 1, where the primary Convolutional halves the variety of channels (#(1)) which can then be doubled again by the second Convolutional (#(3)). Here you furthermore mght must note that the primary convolution uses 1×1 kernel (#(2)) whereas the second uses 3×3 (#(4)). Next, what we do contained in the forward() method is just connecting the 2 convolutions sequentially, which the ultimate output is summed with the unique input tensor (#(5)) before being returned.

# Codeblock 5
class Residual(nn.Module):
    def __init__(self, num_channels):
        super().__init__()
        self.conv0 = Convolutional(in_channels=num_channels, 
                                   out_channels=num_channels//2,   #(1)
                                   kernel_size=1,       #(2)
                                   stride=1)
        
        self.conv1 = Convolutional(in_channels=num_channels//2,
                                   out_channels=num_channels,      #(3)
                                   kernel_size=3,       #(4)
                                   stride=1)
        
    def forward(self, x):
        original = x.clone()
        print(f'originalt: {x.size()}')
        
        x = self.conv0(x)
        print(f'after conv0t: {x.size()}')
        
        x = self.conv1(x)
        print(f'after conv1t: {x.size()}')
        
        x = x + original      #(5)
        print(f'after summationt: {x.size()}')
        
        return x

We’ll now test the Residual block we just created using the Codeblock 6 below. Here I set the num_channels parameter to 64 because I would like to simulate the very first residual block within the Darknet-53 architecture (see Figure 1).

# Codeblock 6
residual = Residual(num_channels=64)

x = torch.randn(1, 64, 208, 208)
out = residual(x)
# Codeblock 6 Output
original        : torch.Size([1, 64, 208, 208])
after conv0     : torch.Size([1, 32, 208, 208])
after conv1     : torch.Size([1, 64, 208, 208])
after summation : torch.Size([1, 64, 208, 208])

In the event you take a better take a look at the above output, you’ll notice that the form of the input and output tensors are the exact same. This essentially allows us to repeat multiple residual blocks easily. Within the Codeblock 7 below I attempt to stack 4 residual blocks and pass a tensor through it, simulating the very last stack of residual blocks within the architecture.

# Codeblock 7
residuals = nn.ModuleList([])
for _ in range(4):
    residual = Residual(num_channels=1024)
    residuals.append(residual)
    
x = torch.randn(1, 1024, 13, 13)

for i in range(len(residuals)):
    x = residuals[i](x)
    print(f'after residuals #{i}t: {x.size()}')
# Codeblock 7 Output
after residuals #0 : torch.Size([1, 1024, 13, 13])
after residuals #1 : torch.Size([1, 1024, 13, 13])
after residuals #2 : torch.Size([1, 1024, 13, 13])
after residuals #3 : torch.Size([1, 1024, 13, 13])

Darknet-53 Implementation

Using the Convolutional and Residual constructing blocks we created earlier, we will now actually construct the Darknet-53 model. Every little thing I initialize contained in the __init__() method below relies on the architecture in Figure 1. Nevertheless, keep in mind that we’d like to stop on the last residual block since we don’t need the worldwide average pooling and the fully-connected layers. Not only that, on the lines marked with #(1) and #(2) I store the intermediate feature maps in separate variables (branch0 and branch1). We’ll later return these feature maps alongside the output from the major flow (x) (#(3)) to implement the branches that flow into the three detection heads.

# Codeblock 8
class Darknet53(nn.Module):
    def __init__(self):
        super().__init__()

        self.convolutional0 = Convolutional(in_channels=3,
                                            out_channels=32,
                                            kernel_size=3)
        
        self.convolutional1 = Convolutional(in_channels=32,
                                            out_channels=64,
                                            kernel_size=3,
                                            stride=2)
        
        self.residuals0 = nn.ModuleList([Residual(num_channels=64) for _ in range(1)])
        
        self.convolutional2 = Convolutional(in_channels=64,
                                            out_channels=128,
                                            kernel_size=3,
                                            stride=2)
        
        self.residuals1 = nn.ModuleList([Residual(num_channels=128) for _ in range(2)])
        
        self.convolutional3 = Convolutional(in_channels=128,
                                            out_channels=256,
                                            kernel_size=3,
                                            stride=2)
        
        self.residuals2 = nn.ModuleList([Residual(num_channels=256) for _ in range(8)])
        
        self.convolutional4 = Convolutional(in_channels=256,
                                            out_channels=512,
                                            kernel_size=3,
                                            stride=2)
        
        self.residuals3 = nn.ModuleList([Residual(num_channels=512) for _ in range(8)])
        
        self.convolutional5 = Convolutional(in_channels=512,
                                            out_channels=1024,
                                            kernel_size=3,
                                            stride=2)
        
        self.residuals4 = nn.ModuleList([Residual(num_channels=1024) for _ in range(4)])
        
    def forward(self, x):
        print(f'originaltt: {x.size()}n')
        
        x = self.convolutional0(x)
        print(f'after convolutional0t: {x.size()}')
        
        x = self.convolutional1(x)
        print(f'after convolutional1t: {x.size()}n')
        
        for i in range(len(self.residuals0)):
            x = self.residuals0[i](x)
            print(f'after residuals0 #{i}t: {x.size()}')
        
        x = self.convolutional2(x)
        print(f'nafter convolutional2t: {x.size()}n')
        
        for i in range(len(self.residuals1)):
            x = self.residuals1[i](x)
            print(f'after residuals1 #{i}t: {x.size()}')
            
        x = self.convolutional3(x)
        print(f'nafter convolutional3t: {x.size()}n')
        
        for i in range(len(self.residuals2)):
            x = self.residuals2[i](x)
            print(f'after residuals2 #{i}t: {x.size()}')
        
        branch0 = x.clone()           #(1)
            
        x = self.convolutional4(x)
        print(f'nafter convolutional4t: {x.size()}n')
        
        for i in range(len(self.residuals3)):
            x = self.residuals3[i](x)
            print(f'after residuals3 #{i}t: {x.size()}')
        
        branch1 = x.clone()           #(2)
            
        x = self.convolutional5(x)
        print(f'nafter convolutional5t: {x.size()}n')
        
        for i in range(len(self.residuals4)):
            x = self.residuals4[i](x)
            print(f'after residuals4 #{i}t: {x.size()}')
            
        return branch0, branch1, x    #(3)

Now we test our Darknet53 class by running the Codeblock 9 below. You may see within the resulting output that all the pieces seems to work properly as the form of the tensor appropriately transforms in keeping with the guide in Figure 1. One thing that I haven’t mentioned before is that this Darknet-53 architecture downscales the input image by an element of 32. So, with this downsampling factor, an input image of shape 256×256 will turn into 8×8 ultimately (as shown in Figure 1), whereas an input of shape 416×416 will lead to a 13×13 prediction tensor.

# Codeblock 9
darknet53 = Darknet53()

x = torch.randn(1, 3, 416, 416)
out = darknet53(x)
# Codeblock 9 Output
original              : torch.Size([1, 3, 416, 416])

after convolutional0  : torch.Size([1, 32, 416, 416])
after convolutional1  : torch.Size([1, 64, 208, 208])

after residuals0 #0   : torch.Size([1, 64, 208, 208])

after convolutional2  : torch.Size([1, 128, 104, 104])

after residuals1 #0   : torch.Size([1, 128, 104, 104])
after residuals1 #1   : torch.Size([1, 128, 104, 104])

after convolutional3  : torch.Size([1, 256, 52, 52])

after residuals2 #0   : torch.Size([1, 256, 52, 52])
after residuals2 #1   : torch.Size([1, 256, 52, 52])
after residuals2 #2   : torch.Size([1, 256, 52, 52])
after residuals2 #3   : torch.Size([1, 256, 52, 52])
after residuals2 #4   : torch.Size([1, 256, 52, 52])
after residuals2 #5   : torch.Size([1, 256, 52, 52])
after residuals2 #6   : torch.Size([1, 256, 52, 52])
after residuals2 #7   : torch.Size([1, 256, 52, 52])

after convolutional4  : torch.Size([1, 512, 26, 26])

after residuals3 #0   : torch.Size([1, 512, 26, 26])
after residuals3 #1   : torch.Size([1, 512, 26, 26])
after residuals3 #2   : torch.Size([1, 512, 26, 26])
after residuals3 #3   : torch.Size([1, 512, 26, 26])
after residuals3 #4   : torch.Size([1, 512, 26, 26])
after residuals3 #5   : torch.Size([1, 512, 26, 26])
after residuals3 #6   : torch.Size([1, 512, 26, 26])
after residuals3 #7   : torch.Size([1, 512, 26, 26])

after convolutional5  : torch.Size([1, 1024, 13, 13])

after residuals4 #0   : torch.Size([1, 1024, 13, 13])
after residuals4 #1   : torch.Size([1, 1024, 13, 13])
after residuals4 #2   : torch.Size([1, 1024, 13, 13])
after residuals4 #3   : torch.Size([1, 1024, 13, 13])

At this point we may see what the outputs produced by the three branches seem like just by printing out the shapes of branch0, branch1, and x as shown in Codeblock 10 below. Notice that the spatial dimensions of those three tensors vary. In a while, the tensors from the deeper layers will likely be upsampled in order that we will perform channel-wise concatenation with those from the shallower ones.

# Codeblock 10
print(out[0].shape)      # branch0
print(out[1].shape)      # branch1
print(out[2].shape)      # x
# Codeblock 10 Output
torch.Size([1, 256, 52, 52])
torch.Size([1, 512, 26, 26])
torch.Size([1, 1024, 13, 13])

Detection Head Implementation

In the event you return to Figure 4, you’ll notice that every of the detection heads consists of two convolution layers. Nevertheless, these two convolutions usually are not an identical. In Codeblock 11 below I take advantage of the Convolutional block for the primary one and the plain nn.Conv2d for the second. This is basically done since the second convolution acts as the ultimate layer, hence is liable for giving raw output (as an alternative of being normalized and ReLU-ed).

# Codeblock 11
class DetectionHead(nn.Module):
    def __init__(self, num_channels):
        super().__init__()
        
        self.convhead0 = Convolutional(in_channels=num_channels,
                                       out_channels=num_channels*2,
                                       kernel_size=3)
        
        self.convhead1 = nn.Conv2d(in_channels=num_channels*2, 
                                   out_channels=NUM_PRIORS*(NUM_CLASS+5), 
                                   kernel_size=1)
        
    def forward(self, x):
        print(f'originalt: {x.size()}')
        
        x = self.convhead0(x)
        print(f'after convhead0t: {x.size()}')
        
        x = self.convhead1(x)
        print(f'after convhead1t: {x.size()}')
   
        return x

Now in Codeblock 12 I’ll attempt to simulate the 13×13 detection head, hence I set the input feature map to have the form of 512×13×13 (#(1)). By the way in which you’ll know where the number 512 comes from later in the following section.

# Codeblock 12
detectionhead = DetectionHead(num_channels=512)

x = torch.randn(1, 512, 13, 13)    #(1)
out = detectionhead(x)

And below is what the resulting output looks like. We will see here that the tensor expands to 1024×13×13 before eventually shrink to 255×13×13. Do not forget that in YOLOv3 so long as we set NUM_PRIORS to three and NUM_CLASS to 80, the variety of output channel will all the time be 255 whatever the variety of input channel fed into the DetectionHead.

# Codeblock 12 Output
original        : torch.Size([1, 512, 13, 13])
after convhead0 : torch.Size([1, 1024, 13, 13])
after convhead1 : torch.Size([1, 255, 13, 13])

The Entire YOLOv3 Architecture

Okay now — since now we have initialized the major constructing blocks, what we’d like to do next is to construct the complete YOLOv3 architecture. Here I will even discuss the remaining components we haven’t covered. The code is kind of long though, so I break it down into two codeblocks: Codeblock 13a and Codeblock 13b. Just be sure that these two codeblocks are written inside the same notebook cell if you must run it on your individual.

In Codeblock 13a below, what we do first is to initialize the backbone model (#(1)). Next, we create a stack of 5 Convolutional blocks which alternately halves and doubles the variety of channels. The conv block that reduces the channel count uses 1×1 kernel while the one which increases it uses 3×3 kernel, similar to the structure we use within the Residual block. We initialize this stack of 5 convolutions for the three detection heads. Specifically for the feature maps that flow into the 26×26 and 52×52 heads, we’d like to initialize one other convolution layer (#(2) and #(4)) and an upsampling layer (#(3) and #(5)) along with the 5 convolutions.

# Codeblock 13a
class YOLOv3(nn.Module):
    def __init__(self):
        super().__init__()
        
        ###############################################
        # Backbone initialization.
        
        self.darknet53 = Darknet53()    #(1)
        
        
        ###############################################
        # For 13x13 output.
        
        self.conv0  = Convolutional(in_channels=1024, out_channels=512, kernel_size=1)
        self.conv1  = Convolutional(in_channels=512, out_channels=1024, kernel_size=3)
        self.conv2  = Convolutional(in_channels=1024, out_channels=512, kernel_size=1)
        self.conv3  = Convolutional(in_channels=512, out_channels=1024, kernel_size=3)
        self.conv4  = Convolutional(in_channels=1024, out_channels=512, kernel_size=1)
        
        self.detection_head_large_obj = DetectionHead(num_channels=512)
        
        
        ###############################################
        # For 26x26 output.
        
        self.conv5  = Convolutional(in_channels=512, out_channels=256, kernel_size=1)  #(2)
        self.upsample0 = nn.Upsample(scale_factor=2)      #(3)
        
        self.conv6  = Convolutional(in_channels=768, out_channels=256, kernel_size=1)
        self.conv7  = Convolutional(in_channels=256, out_channels=512, kernel_size=3)
        self.conv8  = Convolutional(in_channels=512, out_channels=256, kernel_size=1)
        self.conv9  = Convolutional(in_channels=256, out_channels=512, kernel_size=3)
        self.conv10 = Convolutional(in_channels=512, out_channels=256, kernel_size=1)
        
        self.detection_head_medium_obj = DetectionHead(num_channels=256)
        
        
        ###############################################
        # For 52x52 output.
        
        self.conv11  = Convolutional(in_channels=256, out_channels=128, kernel_size=1)  #(4)
        self.upsample1 = nn.Upsample(scale_factor=2)      #(5)
        
        self.conv12  = Convolutional(in_channels=384, out_channels=128, kernel_size=1)
        self.conv13  = Convolutional(in_channels=128, out_channels=256, kernel_size=3)
        self.conv14  = Convolutional(in_channels=256, out_channels=128, kernel_size=1)
        self.conv15  = Convolutional(in_channels=128, out_channels=256, kernel_size=3)
        self.conv16  = Convolutional(in_channels=256, out_channels=128, kernel_size=1)
        
        self.detection_head_small_obj = DetectionHead(num_channels=128)

Now in Codeblock 13b we define the flow of the network contained in the forward() method. Here we first pass the input tensor through the darknet53 model (#(1)), which produces 3 output tensors: branch0, branch1, and x. Then, what we do next is to attach the layers one after one other in keeping with the flow given in Figure 4.

# Codeblock 13b
    def forward(self, x):
        
        ###############################################
        # Backbone.
        branch0, branch1, x = self.darknet53(x)      #(1)
        print(f'branch0ttt: {branch0.size()}')
        print(f'branch1ttt: {branch1.size()}')
        print(f'xttt: {x.size()}n')
        
        
        ###############################################
        # Flow to 13x13 detection head.
        
        x = self.conv0(x)
        print(f'after conv0tt: {x.size()}')
        
        x = self.conv1(x)
        print(f'after conv1tt: {x.size()}')
        
        x = self.conv2(x)
        print(f'after conv2tt: {x.size()}')
        
        x = self.conv3(x)
        print(f'after conv3tt: {x.size()}')
        
        x = self.conv4(x)
        print(f'after conv4tt: {x.size()}')
        
        large_obj = self.detection_head_large_obj(x)
        print(f'large object detectiont: {large_obj.size()}n')
        
        
        ###############################################
        # Flow to 26x26 detection head.
        
        x = self.conv5(x)
        print(f'after conv5tt: {x.size()}')
        
        x = self.upsample0(x)
        print(f'after upsample0tt: {x.size()}')
        
        x = torch.cat([x, branch1], dim=1)
        print(f'after concatenatet: {x.size()}')
        
        x = self.conv6(x)
        print(f'after conv6tt: {x.size()}')
        
        x = self.conv7(x)
        print(f'after conv7tt: {x.size()}')
        
        x = self.conv8(x)
        print(f'after conv8tt: {x.size()}')
        
        x = self.conv9(x)
        print(f'after conv9tt: {x.size()}')
        
        x = self.conv10(x)
        print(f'after conv10tt: {x.size()}')
        
        medium_obj = self.detection_head_medium_obj(x)
        print(f'medium object detectiont: {medium_obj.size()}n')
        
        
        ###############################################
        # Flow to 52x52 detection head.
        
        x = self.conv11(x)
        print(f'after conv11tt: {x.size()}')
        
        x = self.upsample1(x)
        print(f'after upsample1tt: {x.size()}')
        
        x = torch.cat([x, branch0], dim=1)
        print(f'after concatenatet: {x.size()}')
        
        x = self.conv12(x)
        print(f'after conv12tt: {x.size()}')
        
        x = self.conv13(x)
        print(f'after conv13tt: {x.size()}')
        
        x = self.conv14(x)
        print(f'after conv14tt: {x.size()}')
        
        x = self.conv15(x)
        print(f'after conv15tt: {x.size()}')
        
        x = self.conv16(x)
        print(f'after conv16tt: {x.size()}')
        
        small_obj = self.detection_head_small_obj(x)
        print(f'small object detectiont: {small_obj.size()}n')
        

        ###############################################
        # Return prediction tensors.
        
        return large_obj, medium_obj, small_obj

As now we have accomplished the forward() method, we will now test the complete YOLOv3 model by passing a single RGB image of size 416×416 as shown in Codeblock 14.

# Codeblock 14
yolov3 = YOLOv3()

x = torch.randn(1, 3, 416, 416)
out = yolov3(x)

Below is what the output looks like after you run the codeblock above. Here we will see that all the pieces seems to work properly because the dummy image successfully passed through all layers within the network. One thing that you just might probably must know is that the 768-channel feature map at line #(4) is obtained from the concatenation between the tensor at lines #(2) and #(3). The same thing also applies to the 384-channel tensor at line #(6), during which it’s the concatenation between the feature maps at lines #(1) and #(5).

# Codeblock 14 Output
branch0                 : torch.Size([1, 256, 52, 52])    #(1)
branch1                 : torch.Size([1, 512, 26, 26])    #(2)
x                       : torch.Size([1, 1024, 13, 13])

after conv0             : torch.Size([1, 512, 13, 13])
after conv1             : torch.Size([1, 1024, 13, 13])
after conv2             : torch.Size([1, 512, 13, 13])
after conv3             : torch.Size([1, 1024, 13, 13])
after conv4             : torch.Size([1, 512, 13, 13])
large object detection  : torch.Size([1, 255, 13, 13])

after conv5             : torch.Size([1, 256, 13, 13])
after upsample0         : torch.Size([1, 256, 26, 26])    #(3)
after concatenate       : torch.Size([1, 768, 26, 26])    #(4)
after conv6             : torch.Size([1, 256, 26, 26])
after conv7             : torch.Size([1, 512, 26, 26])
after conv8             : torch.Size([1, 256, 26, 26])
after conv9             : torch.Size([1, 512, 26, 26])
after conv10            : torch.Size([1, 256, 26, 26])
medium object detection : torch.Size([1, 255, 26, 26])

after conv11            : torch.Size([1, 128, 26, 26])
after upsample1         : torch.Size([1, 128, 52, 52])    #(5)
after concatenate       : torch.Size([1, 384, 52, 52])    #(6)
after conv12            : torch.Size([1, 128, 52, 52])
after conv13            : torch.Size([1, 256, 52, 52])
after conv14            : torch.Size([1, 128, 52, 52])
after conv15            : torch.Size([1, 256, 52, 52])
after conv16            : torch.Size([1, 128, 52, 52])
small object detection  : torch.Size([1, 255, 52, 52])

And simply to make things clearer here I also print out the output of every detection head in Codeblock 15 below. We will see here that every one the resulting prediction tensors have the form that we expected earlier. Thus, I consider our YOLOv3 implementation is correct and hence able to train.

# Codeblock 15
print(out[0].shape)
print(out[1].shape)
print(out[2].shape)
# Codeblock 15 Output
torch.Size([1, 255, 13, 13])
torch.Size([1, 255, 26, 26])
torch.Size([1, 255, 52, 52])

I believe that’s just about all the pieces about YOLOv3 and its architecture implementation from scratch. As we’ve seen above, the authors successfully made a powerful improvement in performance in comparison with the previous YOLO version, despite the fact that the changes they made to the system weren’t what they considered significant — hence the title “.” 

Please let me know in the event you spot any mistake in this text. You can too find the fully-working code in my GitHub repo [13]. Thanks for reading, see ya in my next article!


References

[1] Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental Improvement. Arxiv. https://arxiv.org/abs/1804.02767 [Accessed August 24, 2025].

[2] Muhammad Ardi. YOLOv1 Paper Walkthrough: The Day YOLO First Saw the World. Medium. https://ai.gopubby.com/yolov1-paper-walkthrough-the-day-yolo-first-saw-the-world-ccff8b60d84b [Accessed March 1, 2026].

[3] Muhammad Ardi. YOLOv1 Loss Function Walkthrough: Regression for All. Medium. https://ai.gopubby.com/yolov1-loss-function-walkthrough-regression-for-all-18c34be6d7cb [Accessed March 1, 2026].

[4] Muhammad Ardi. YOLOv2 & YOLO9000 Paper Walkthrough: Higher, Faster, Stronger. Towards Data Science. https://towardsdatascience.com/yolov2-yolo9000-paper-walkthrough-better-faster-stronger/ [Accessed March 1, 2026].

[5] Image originally created by writer.

[6] YOLO v3 introduction to object detection with TensorFlow 2. PyLessons. https://pylessons.com/YOLOv3-TF2-introduction [Accessed August 24, 2025].

[7] aladdinpersson. YOLOv3. GitHub. https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/object_detection/YOLOv3/model.py [Accessed August 24, 2025].

[8] Kaiming He Deep Residual Learning for Image Recognition. Arxiv. https://arxiv.org/abs/1512.03385 [Accessed August 24, 2025].

[9] Langcai Cao A Text Detection Algorithm for Image of Student Exercises Based on CTPN and Enhanced YOLOv3. IEEE Access. https://ieeexplore.ieee.org/document/9200481 [Accessed August 24, 2025].

[10] Image by writer, partially generated by Gemini.

[11] Joseph Redmon You Only Look Once: Unified, Real-Time Object Detection. Arxiv. https://arxiv.org/pdf/1506.02640 [Accessed July 5, 2025].

[12] ML For Nerds. YOLO-V3: An Incremental Improvement || YOLO OBJECT DETECTION SERIES. YouTube. https://www.youtube.com/watch?v=9fhAbvPWzKs&t=174s [Accessed August 24, 2025].

[13] MuhammadArdiPutra. Even Higher, but Not That Much — YOLOv3. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/major/Even%20Better%2C%20but%20Not%20That%20Much%20-%20YOLOv3.ipynb [Accessed August 24, 2025].

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x