DenseNet Paper Walkthrough: All Connected

we attempt to train a really deep neural network model, one issue that we would encounter is the problem. This is actually an issue where the load update of a model during training slows down and even stops, hence causing the model not to enhance. When a network could be very deep, the computation during backpropagation involves multiplying many derivative terms together through the chain rule. Do not forget that if we multiply small numbers (typically lower than 1) too repeatedly, it’ll make the resulting numbers becoming extremely small. Within the case of neural networks, these numbers are used as the premise of the load update. So, if the gradient could be very small, then the load update can be very slow, causing the training to be slow as well.

To deal with this vanishing gradient problem, we will actually use shortcut paths in order that the gradients can flow more easily through a deep network. One of the popular architectures that attempts to unravel that is ResNet, where it implements skip connections that hop over several layers within the network. This concept is adopted by DenseNet, where the skip connections are implemented way more aggressively, making it higher than ResNet in handling the vanishing gradient problem. In this text I would really like to speak about how exactly DenseNet works and tips on how to implement the architecture from scratch.

The DenseNet Architecture

Dense Block

DenseNet was originally proposed in a paper titled “” written by Gao Huang back in 2016 [1]. The major idea of DenseNet is indeed to unravel the vanishing gradient problem. The rationale that it performs higher than ResNet is due to shortcut paths branching out from a single layer to all other subsequent layers. To higher illustrate this concept, you’ll be able to see in Figure 1 below that the input tensor is forwarded to , , , , and the layers. We do the identical thing to all layers inside this block, making all tensors connected — hence the name . With all these shortcut connections, information can flow seamlessly between layers. Not only that, but this mechanism also enables feature reuse where each layer can directly profit from the features produced by all previous layers.

Figure 1. The structure of a single Dense block [1].

In a normal CNN, if we now have layers, we may even have connections. Assuming that the above illustration is just a conventional 5-layer CNN, we mainly only have the 5 straight arrows coming out from each tensor. In DenseNet, if we now have layers, we could have (+1)/2 connections. So within the above case we mainly got 5(5+1)/2 = 15 connections in total. You possibly can confirm this by manually tallying the arrows one after the other: 5 red arrows, 4 green arrows, 3 purple arrows, 2 yellow arrows, and 1 brown arrow.

One other key difference between ResNet and DenseNet is how they mix information from different layers. In ResNet, we mix information from two tensors by element-wise summation, which might mathematically be defined in Figure 2 below. As an alternative of performing element-wise summation, DenseNet combines information by channel-wise concatenation as expressed in Figure 3. With this mechanism, the feature maps produced by all previous layers are concatenated with the output of the present layer before eventually getting used because the input of the next layer.

Figure 2. The mathematical notation of a residual block in ResNet [1].

Figure 3. The mathematical notation of the last layer inside a dense block in DenseNet [1].

Performing channel-wise concatenation like this actually has a side effect: the variety of feature maps grows as we get deeper into the network. In the instance I showed you in Figure 1, we initially have an input tensor of 6 channels. The layer processes this tensor and produces a 4-channel tensor. These two tensors are then concatenated before being forwarded to . This essentially signifies that the layer accepts 10 channels. Following the identical pattern, we’ll later have the , , and the layers to just accept tensors of 14, 18, and 22 channels, respectively. This is definitely an example of a DenseNet that uses the parameter of 4, meaning that every layer produces 4 recent feature maps. Afterward, we’ll use to indicate this parameter as suggested in the unique paper.

Despite having such complex connections, DenseNet is definitely rather a lot more efficient as in comparison with the standard CNN by way of the variety of parameters. Let’s do slightly little bit of math to prove this. The structure given in Figure 1 consists of 4 conv layers (let’s ignore the layer for now). To compute what number of parameters a convolution layer has, we will simply calculate × × × . Assuming that every one these convolutions use 3×3 kernel, our layers within the DenseNet architecture would have the next variety of parameters:

→ 6×3×3×4 = 216
→ 10×3×3×4 = 360
→ 14×3×3×4 = 504
→ 18×3×3×4 = 648

By summing these 4 numbers, we could have 1,728 params in total. Note that this number doesn’t include the bias term. Now if we attempt to create the very same structure with a conventional CNN, we’ll require the next variety of params for every layer:

→ 6×3×3×10 = 540
→ 10×3×3×14 = 1,260
→ 14×3×3×18 = 2,268
→ 18×3×3×22 = 3,564

Summing those up, a conventional CNN hits 7,632 params — that’s over 4× higher! With this parameter count in mind, we will clearly see that DenseNet is indeed way more lightweight than traditional CNNs. The rationale why DenseNet may be so efficient is due to feature reuse mechanism, where as a substitute of computing all feature maps from scratch, it only computes feature maps and concatenate them with the prevailing feature maps from the previous layers.

Transition Layer

The structure I showed you earlier is definitely just the major constructing block of the DenseNet model, which is known as the . Figure 4 below shows how these constructing blocks are assembled, where three of them are connected by the so-called . Each transition layer consists of a convolution followed by a pooling layer. This component has two major responsibilities: first, to cut back the spatial dimension of the tensor, and second, to cut back the variety of channels. The reduction in spatial dimension is standard practice when constructing CNN-based model, where the deeper feature maps should typically have lower dimension than that of the shallower ones. Meanwhile, reducing the variety of channels is mandatory because they could drastically increase as a consequence of the channel-wise concatenation mechanism done inside each layer within the dense block.

To grasp how the transition layer reduces channels, we’d like to take a look at the parameter. This parameter, which the authors discuss with as (theta), must have the worth of somewhere between 0 and 1. Suppose we set to 0.2, then the variety of channels to be forwarded to the subsequent dense block will only be 20% of the whole variety of channels produced by the present dense block.

The Entire DenseNet Architecture

As we now have understood the block and the layer, we will now move on to the whole DenseNet architecture shown in Figure 5 below. It initially accepts an RGB image of size 224×224, which is then processed by a 7×7 conv and a 3×3 maxpooling layer. Remember that these two layers use the stride of two, causing the spatial dimension to shrink to 112×112 and 56×56, respectively. At this point the tensor is able to be passed through the primary dense block which consists of 6 blocks — I’ll talk more about this component very soon. The resulting output will then be forwarded to the primary transition layer, followed by the second dense block, and so forth until we eventually reach the worldwide average pooling layer. Finally, we pass the tensor to the fully-connected layer which is liable for making class predictions.

There are literally several more details I want to elucidate regarding the architecture above. First, the variety of feature maps produced in each step just isn’t explicitly mentioned. This is actually since the architecture is adaptive in keeping with the and parameters. The one layer with a set number is the very first convolution layer (the 7×7 one), which produces 64 feature maps (not displayed within the figure). Second, additionally it is necessary to notice that each convolution layer shown within the architecture follows the sequence, aside from the 7×7 convolution which doesn’t include the dropout layer. Third, the authors implemented several DenseNet variants, which they discuss with as DenseNet (the vanilla one), DenseNet-B (the variant that uses blocks), DenseNet-C (the one which utilizes ), and DenseNet-BC (the variant that employs each). The architecture given in Figure 5 is the DenseNet-B (or DenseNet-BC) variant.

The so-called block itself is the stack of 1×1 and three×3 convolutions. The 1×1 conv is used to cut back the variety of channels to 4 before eventually being shrunk further to by the next 3×3 conv. The rationale for it is because 3×3 convolution is computationally expensive on tensors with many channels. So to make the computation faster, we’d like to cut back the channels first using the 1×1 conv. Later within the coding section we’re going to implement this DenseNet-BC variant. Nonetheless, if you need to implement the usual DenseNet (or DenseNet-C) as a substitute, you’ll be able to simply omit the 1×1 conv in order that each dense block only comprises 3×3 convolutions.

Some Experimental Results

It’s seen within the paper that the authors performed a number of experiments comparing DenseNet with other models. On this section I’m going to indicate you some interesting things they found.

Figure 6. DenseNet achieves higher accuracy than ResNet with fewer parameters and lower computational cost across different network depths [1].

The primary experimental result I discovered interesting is that DenseNet actually has significantly better performance than ResNet. Figure 6 above shows that it consistently outperforms ResNet across all network depths. When comparing variants with similar accuracy, DenseNet is definitely rather a lot more efficient. Let’s take a better have a look at the DenseNet-201 variant. Here you’ll be able to see that the validation error is sort of the identical as ResNet-101. Despite being 2× deeper (201 vs 101 layers), it’s roughly 2× smaller by way of each parameters and FLOPs (floating point operations).

Figure 7. How bottleneck layer and compression factor affect model performance [1].

Next, the authors also performed ablation study regarding the usage of bottleneck layer and compression factor. We are able to see in Figure 7 above that utilizing each the bottleneck layer inside the dense block and performing channel count reduction within the transition layer allows the model to attain higher accuracy (DenseNet-BC). It might sound a bit counterintuitive to see that the reduction within the variety of channels as a consequence of the compression factor improves the accuracy as a substitute. In reality, in deep learning, too many features might as a substitute hurt accuracy as a consequence of information redundancy. So, reducing the variety of channels may be perceived as a regularization mechanism which might prevent the model from overfitting, allowing it to acquire higher validation accuracy.

DenseNet From Scratch

As we now have understood the underlying theory behind DenseNet, we will now implement the architecture from scratch. What we’d like to do first is to import the required modules and initializing the configurable variables. Within the Codeblock 1 below, the and we discussed earlier are denoted as GROWTH and COMPRESSION, which the values are set to 12 and 0.5, respectively. These two values are the defaults given within the paper, which we will definitely change if we wish to. Next, here I also initialize the REPEATS list to store the variety of bottleneck blocks inside each dense block.

# Codeblock 1
import torch
import torch.nn as nn

GROWTH      = 12
COMPRESSION = 0.5
REPEATS     = [6, 12, 24, 16]

Bottleneck Implementation

Now let’s take a have a look at the Bottleneck class below to see how I implement the stack of 1×1 and three×3 convolutions. Previously I’ve mentioned that every convolution layer follows the structure, so here we’d like to initialize all these layers within the __init__() method.

The 2 convolution layers are initialized as conv0 and conv1, each with their corresponding batch normalization layers. Don’t forget to set the out_channels parameter of the conv0 layer to GROWTH*4 because we wish it to return 4 feature maps (see the road marked with #(1)). This variety of feature maps will then be shrunk even further by the conv1 layer to by setting the out_channels to GROWTH (#(2)). As all layers have been initialized, we will now define the flow within the forward() method. Just take into account that at the tip of the method we now have to concatenate the resulting tensor (out) with the unique one (x) to implement the skip-connection (#(3)).

# Codeblock 2
class Bottleneck(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.2)
        
        self.bn0   = nn.BatchNorm2d(num_features=in_channels)
        self.conv0 = nn.Conv2d(in_channels=in_channels, 
                               out_channels=GROWTH*4,          #(1) 
                               kernel_size=1, 
                               padding=0, 
                               bias=False)
        
        self.bn1   = nn.BatchNorm2d(num_features=GROWTH*4)
        self.conv1 = nn.Conv2d(in_channels=GROWTH*4, 
                               out_channels=GROWTH,            #(2)
                               kernel_size=3, 
                               padding=1, 
                               bias=False)
    
    def forward(self, x):
        print(f'originalt: {x.size()}')
        
        out = self.dropout(self.conv0(self.relu(self.bn0(x))))
        print(f'after conv0t: {out.size()}')
        
        out = self.dropout(self.conv1(self.relu(self.bn1(out))))
        print(f'after conv1t: {out.size()}')
        
        concatenated = torch.cat((out, x), dim=1)              #(3)
        print(f'after concatt: {concatenated.size()}')
        
        return concatenated

In an effort to check if our Bottleneck class works properly, we’ll now create one which accepts 64 feature maps and pass a dummy tensor through it. The bottleneck layer I instantiate below essentially corresponds to the very first bottleneck contained in the first dense block (refer back to Figure 5 when you’re unsure). So, to simulate actual the flow of the network, we’re going to pass a tensor of size 64×56×56, which is actually the form produced by the three×3 maxpooling layer.

# Codeblock 3
bottleneck = Bottleneck(in_channels=64)

x = torch.randn(1, 64, 56, 56)
x = bottleneck(x)

Once the above code is run, we’ll get the next output appear on our screen.

# Codeblock 3 Output
original     : torch.Size([1, 64, 56, 56])
after conv0  : torch.Size([1, 48, 56, 56])    #(1)
after conv1  : torch.Size([1, 12, 56, 56])    #(2)
after concat : torch.Size([1, 76, 56, 56])

Here we will see that our conv0 layer successfully reduced the feature maps from 64 to 48 (#(1)), where 48 is the 4 (do not forget that our is 12). This 48-channel tensor is then processed by the conv1 layer, which reduces the variety of feature maps even further to (#(2)). This output tensor is then concatenated with the unique one, leading to a tensor of 64+12 = 76 feature maps. And here is definitely where the pattern starts. Later within the dense block, if we repeat this bottleneck multiple times, then we could have each layer produce:

second layer → 64+(2×12) = 88 feature maps
third layer → 64+(3×12) = 100 feature maps
fourth layer → 64+(4×12) = 112 feature maps
and so forth …

Dense Block Implementation

Now let’s actually create the DenseBlock class to store the sequence of Bottleneck instances. Have a look at the Codeblock 4 below to see how I do this. The solution to do it’s pretty easy, we will just initialize a module list (#(1)) after which append the bottleneck blocks one after the other (#(3)). Note that we’d like to maintain track of the variety of input channels of every bottleneck using the current_in_channels variable (#(2)). Lastly, within the forward() method we will simply pass the tensor sequentially.

# Codeblock 4
class DenseBlock(nn.Module):
    def __init__(self, in_channels, repeats):
        super().__init__()
        
        self.bottlenecks = nn.ModuleList()    #(1)
        
        for i in range(repeats):
            current_in_channels = in_channels + i*GROWTH    #(2)
            self.bottlenecks.append(Bottleneck(in_channels=current_in_channels))  #(3)
        
    def forward(self, x):
        for i, bottleneck in enumerate(self.bottlenecks):
            x = bottleneck(x)
            print(f'after bottleneck #{i}t: {x.size()}')
        
        return x

We are able to test the code above by simulating the primary dense block within the network. You possibly can see in Figure 5 that it incorporates 6 bottleneck blocks, so within the Codeblock 5 below I set the repeats parameter to that number (#(1)). We are able to see within the resulting output that the input tensor, which initially has the form of 64×56×56, is transformed to 136×56×56. The 136 feature maps come from 64+(6×12), which follows the pattern I gave you earlier.

# Codeblock 5
dense_block = DenseBlock(in_channels=64, repeats=6)    #(1)
x = torch.randn(1, 64, 56, 56)

x = dense_block(x)

# Codeblock 5 Output
after bottleneck #0 : torch.Size([1, 76, 56, 56])
after bottleneck #1 : torch.Size([1, 88, 56, 56])
after bottleneck #2 : torch.Size([1, 100, 56, 56])
after bottleneck #3 : torch.Size([1, 112, 56, 56])
after bottleneck #4 : torch.Size([1, 124, 56, 56])
after bottleneck #5 : torch.Size([1, 136, 56, 56])

Transition Layer

The subsequent component we’re going to implement is the layer, which is shown in Codeblock 6 below. Just like the convolution layers within the bottleneck blocks, here we also use the structure, yet this one is with an extra average pooling layer at the tip (#(1)). Don’t forget to set the stride of this pooling layer to 2 to cut back the spatial dimension by half.

# Codeblock 6
class Transition(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        
        self.bn   = nn.BatchNorm2d(num_features=in_channels)
        self.relu = nn.ReLU()
        self.conv = nn.Conv2d(in_channels=in_channels, 
                              out_channels=out_channels, 
                              kernel_size=1, 
                              padding=0,
                              bias=False)
        self.dropout = nn.Dropout(p=0.2)
        self.pool = nn.AvgPool2d(kernel_size=2, stride=2)    #(1)
     
    def forward(self, x):
        print(f'originalt: {x.size()}')
        
        out = self.pool(self.dropout(self.conv(self.relu(self.bn(x)))))
        print(f'after transition: {out.size()}')
        
        return out

Now let’s take a have a look at the testing code within the Codeblock 7 below to see how a tensor transforms because it is passed through the above network. In this instance I’m attempting to simulate the very first transition layer, i.e., the one right after the primary dense block. This is actually the rationale that I set this layer to just accept 136 channels. Previously I discussed that this layer is used to shrink the channel dimension through the parameter, so to implement it we will simply multiply the variety of input feature maps with the COMPRESSION variable for the out_channels parameter.

# Codeblock 7
transition = Transition(in_channels=136, out_channels=int(136*COMPRESSION))

x = torch.randn(1, 136, 56, 56)
x = transition(x)

Once above code is run, we should always obtain the next output. Here you’ll be able to see that the spatial dimension of the input tensor shrinks from 56×56 to twenty-eight×28, whereas the variety of channels also reduces from 136 to 68. This essentially indicates that our transition layer implementation is correct.

# Codeblock 7 Output
original         : torch.Size([1, 136, 56, 56])
after transition : torch.Size([1, 68, 28, 28])

The Entire DenseNet Architecture

As we now have successfully implemented the major components of the DenseNet model, we are actually going to construct the whole architecture. Here I separate the __init__() and the forward() methods into two codeblocks as they’re pretty long. Just make sure that you place Codeblock 8a and 8b inside the same notebook cell if you need to run it on your individual.

# Codeblock 8a
class DenseNet(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.first_conv = nn.Conv2d(in_channels=3, 
                                    out_channels=64, 
                                    kernel_size=7,    #(1)
                                    stride=2,         #(2)
                                    padding=3,        #(3)
                                    bias=False)
        self.first_pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)  #(4)
        channel_count = 64
        

        # Dense block #0
        self.dense_block_0 = DenseBlock(in_channels=channel_count,
                                        repeats=REPEATS[0])          #(5)
        channel_count = int(channel_count+REPEATS[0]*GROWTH)         #(6)
        self.transition_0 = Transition(in_channels=channel_count, 
                                       out_channels=int(channel_count*COMPRESSION))
        channel_count = int(channel_count*COMPRESSION)               #(7)
    

        # Dense block #1
        self.dense_block_1 = DenseBlock(in_channels=channel_count, 
                                        repeats=REPEATS[1])
        channel_count = int(channel_count+REPEATS[1]*GROWTH)
        self.transition_1 = Transition(in_channels=channel_count, 
                                       out_channels=int(channel_count*COMPRESSION))
        channel_count = int(channel_count*COMPRESSION)

        # # Dense block #2
        self.dense_block_2 = DenseBlock(in_channels=channel_count, 
                                        repeats=REPEATS[2])
        channel_count = int(channel_count+REPEATS[2]*GROWTH)
        
        self.transition_2 = Transition(in_channels=channel_count, 
                                       out_channels=int(channel_count*COMPRESSION))
        channel_count = int(channel_count*COMPRESSION)

        # Dense block #3
        self.dense_block_3 = DenseBlock(in_channels=channel_count, 
                                        repeats=REPEATS[3])
        channel_count = int(channel_count+REPEATS[3]*GROWTH)
        
        
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))       #(8)
        self.fc = nn.Linear(in_features=channel_count, out_features=1000)  #(9)

What we do first within the __init__() method above is to initialize the first_conv and the first_pool layers. Remember that these two layers neither belong to the dense block nor the transition layer, so we’d like to manually initialize them as nn.Conv2d and nn.MaxPool2d instances. In reality, these two initial layers are quite unique. The convolution layer uses a really large kernel of size 7×7 (#(1)) with the stride of two (#(2)). So, not only capturing information from large area, but this layer also performs spatial downsampling in-place. Here we also have to set the padding to three (#(3)) to compensate for the massive kernel in order that the spatial dimension doesn’t get reduced an excessive amount of. Next, the pooling layer is different from those within the transition layer, where we use 3×3 maxpooling slightly than 2×2 average pooling (#(4)).

As the primary two layers are done, what we do next is to initialize the dense blocks and the transition layers. The thought is pretty straightforward, where we’d like to initialize the dense blocks consisting of several bottleneck blocks (which the number bottlenecks is passed through the repeats parameter (#(5))). Remember to maintain track of the channel count of every step (#(6,7)) in order that we will match the input shape of the next layer with the output shape of the previous one. After which we mainly do the very same thing for the remaining dense blocks and the transition layers.

As we now have reached the last dense block, we now initialize the worldwide average pooling layer (#(8)), which is liable for taking the common value across the spatial dimension, before eventually initializing the classification head (#(9)). Finally, as all layers have been initialized, we will now connect all of them contained in the forward() method below.

# Codeblock 8b
    def forward(self, x):
        print(f'originaltt: {x.size()}')
        
        x = self.first_conv(x)
        print(f'after first_convt: {x.size()}')
        
        x = self.first_pool(x)
        print(f'after first_poolt: {x.size()}')
        
        x = self.dense_block_0(x)
        print(f'after dense_block_0t: {x.size()}')
        
        x = self.transition_0(x)
        print(f'after transition_0t: {x.size()}')

        x = self.dense_block_1(x)
        print(f'after dense_block_1t: {x.size()}')
        
        x = self.transition_1(x)
        print(f'after transition_1t: {x.size()}')
        
        x = self.dense_block_2(x)
        print(f'after dense_block_2t: {x.size()}')
        
        x = self.transition_2(x)
        print(f'after transition_2t: {x.size()}')
        
        x = self.dense_block_3(x)
        print(f'after dense_block_3t: {x.size()}')
        
        x = self.avgpool(x)
        print(f'after avgpooltt: {x.size()}')
        
        x = torch.flatten(x, start_dim=1)
        print(f'after flattentt: {x.size()}')
        
        x = self.fc(x)
        print(f'after fctt: {x.size()}')
        
        return x

That’s mainly the entire implementation of the DenseNet architecture. We are able to test if it really works properly by running the Codeblock 9 below. Here we pass the x tensor through the network, by which it simulates a batch of a single 224×224 RGB image.

# Codeblock 9
densenet = DenseNet()
x = torch.randn(1, 3, 224, 224)

x = densenet(x)

And below is what the output looks like. Here I intentionally print out the tensor shape after each step so that you may clearly see how the tensor transforms throughout the whole network. Despite having so many layers, this is definitely the smallest DenseNet variant, i.e., DenseNet-121. You possibly can actually make the model even larger by changing the values within the REPEATS list in keeping with the variety of bottleneck blocks inside each dense block given in Figure 5.

# Codeblock 9 Output
original             : torch.Size([1, 3, 224, 224])
after first_conv     : torch.Size([1, 64, 112, 112])
after first_pool     : torch.Size([1, 64, 56, 56])
after bottleneck #0  : torch.Size([1, 76, 56, 56])
after bottleneck #1  : torch.Size([1, 88, 56, 56])
after bottleneck #2  : torch.Size([1, 100, 56, 56])
after bottleneck #3  : torch.Size([1, 112, 56, 56])
after bottleneck #4  : torch.Size([1, 124, 56, 56])
after bottleneck #5  : torch.Size([1, 136, 56, 56])
after dense_block_0  : torch.Size([1, 136, 56, 56])
after transition_0   : torch.Size([1, 68, 28, 28])
after bottleneck #0  : torch.Size([1, 80, 28, 28])
after bottleneck #1  : torch.Size([1, 92, 28, 28])
after bottleneck #2  : torch.Size([1, 104, 28, 28])
after bottleneck #3  : torch.Size([1, 116, 28, 28])
after bottleneck #4  : torch.Size([1, 128, 28, 28])
after bottleneck #5  : torch.Size([1, 140, 28, 28])
after bottleneck #6  : torch.Size([1, 152, 28, 28])
after bottleneck #7  : torch.Size([1, 164, 28, 28])
after bottleneck #8  : torch.Size([1, 176, 28, 28])
after bottleneck #9  : torch.Size([1, 188, 28, 28])
after bottleneck #10 : torch.Size([1, 200, 28, 28])
after bottleneck #11 : torch.Size([1, 212, 28, 28])
after dense_block_1  : torch.Size([1, 212, 28, 28])
after transition_1   : torch.Size([1, 106, 14, 14])
after bottleneck #0  : torch.Size([1, 118, 14, 14])
after bottleneck #1  : torch.Size([1, 130, 14, 14])
after bottleneck #2  : torch.Size([1, 142, 14, 14])
after bottleneck #3  : torch.Size([1, 154, 14, 14])
after bottleneck #4  : torch.Size([1, 166, 14, 14])
after bottleneck #5  : torch.Size([1, 178, 14, 14])
after bottleneck #6  : torch.Size([1, 190, 14, 14])
after bottleneck #7  : torch.Size([1, 202, 14, 14])
after bottleneck #8  : torch.Size([1, 214, 14, 14])
after bottleneck #9  : torch.Size([1, 226, 14, 14])
after bottleneck #10 : torch.Size([1, 238, 14, 14])
after bottleneck #11 : torch.Size([1, 250, 14, 14])
after bottleneck #12 : torch.Size([1, 262, 14, 14])
after bottleneck #13 : torch.Size([1, 274, 14, 14])
after bottleneck #14 : torch.Size([1, 286, 14, 14])
after bottleneck #15 : torch.Size([1, 298, 14, 14])
after bottleneck #16 : torch.Size([1, 310, 14, 14])
after bottleneck #17 : torch.Size([1, 322, 14, 14])
after bottleneck #18 : torch.Size([1, 334, 14, 14])
after bottleneck #19 : torch.Size([1, 346, 14, 14])
after bottleneck #20 : torch.Size([1, 358, 14, 14])
after bottleneck #21 : torch.Size([1, 370, 14, 14])
after bottleneck #22 : torch.Size([1, 382, 14, 14])
after bottleneck #23 : torch.Size([1, 394, 14, 14])
after dense_block_2  : torch.Size([1, 394, 14, 14])
after transition_2   : torch.Size([1, 197, 7, 7])
after bottleneck #0  : torch.Size([1, 209, 7, 7])
after bottleneck #1  : torch.Size([1, 221, 7, 7])
after bottleneck #2  : torch.Size([1, 233, 7, 7])
after bottleneck #3  : torch.Size([1, 245, 7, 7])
after bottleneck #4  : torch.Size([1, 257, 7, 7])
after bottleneck #5  : torch.Size([1, 269, 7, 7])
after bottleneck #6  : torch.Size([1, 281, 7, 7])
after bottleneck #7  : torch.Size([1, 293, 7, 7])
after bottleneck #8  : torch.Size([1, 305, 7, 7])
after bottleneck #9  : torch.Size([1, 317, 7, 7])
after bottleneck #10 : torch.Size([1, 329, 7, 7])
after bottleneck #11 : torch.Size([1, 341, 7, 7])
after bottleneck #12 : torch.Size([1, 353, 7, 7])
after bottleneck #13 : torch.Size([1, 365, 7, 7])
after bottleneck #14 : torch.Size([1, 377, 7, 7])
after bottleneck #15 : torch.Size([1, 389, 7, 7])
after dense_block_3  : torch.Size([1, 389, 7, 7])
after avgpool        : torch.Size([1, 389, 1, 1])
after flatten        : torch.Size([1, 389])
after fc             : torch.Size([1, 1000])

Ending

I believe that’s just about all the things concerning the theory and the implementation of the DenseNet model. You too can find all of the codes above in my GitHub repo [2]. See ya in my next article!

References

[1] Gao Huang Densely Connected Convolutional Networks. Arxiv. https://arxiv.org/abs/1608.06993 [Accessed September 18, 2025].

[2] MuhammadArdiPutra. DenseNet. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/major/DenseNet.ipynb [Accessed September 18, 2025].

DenseNet Paper Walkthrough: All Connected

The DenseNet Architecture

Dense Block

Transition Layer

The Entire DenseNet Architecture

Some Experimental Results

DenseNet From Scratch

Bottleneck Implementation

Dense Block Implementation

Transition Layer

The Entire DenseNet Architecture

Ending

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

I Replaced Vector DBs with Google’s Memory Agent Pattern for my notes in Obsidian

AI just made the billion-dollar solo founder real

Bringing AI Closer to the Edge and On-Device with Gemma 4

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

DenseNet Paper Walkthrough: All Connected

The DenseNet Architecture

Dense Block

Transition Layer

The Entire DenseNet Architecture

Some Experimental Results

DenseNet From Scratch

Bottleneck Implementation

Dense Block Implementation

Transition Layer

The Entire DenseNet Architecture

Ending

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.