MobileNetV3 Paper Walkthrough: The Tiny Giant Getting Even Smarter

Welcome back to the Tiny Giant series — a series where I share what I learned about MobileNet architectures. Up to now two articles I covered MobileNetV1 and MobileNetV2. Take a look at references [1] and [2] in the event you’re desirous about reading them. In today’s article I would love to proceed with the subsequent version of the model: MobileNetV3.

MobileNetV3 was first proposed in a paper titled “” written by Howard in 2019 [3]. Just a fast review: the most important idea of the primary MobileNet version was replacing full-convolutions with depthwise separable convolutions, which reduced the variety of params by nearly 90% in comparison with its standard CNN counterpart. Within the second MobileNet version, the authors introduced the so-called and mechanisms, which they integrated into the unique MobileNetV1 constructing blocks. Now within the third MobileNet version, the authors attempted to push the performance of the network even further by incorporating (SE) modules and into the constructing blocks. Moreover, the general structure of MobileNetV3 itself is partially designed using NAS (), through which it essentially works somewhat like a parameter tuning that operates on the architectural level by maximizing accuracy while minimizing latency. Nonetheless, note that in this text I won’t go into how NAS works intimately. As an alternative, I’ll give attention to the ultimate design of MobileNetV3 proposed within the paper.

The Detailed MobileNetV3 Architecture

The authors propose two variants of this model which they check with as MobileNetV3-Large and MobileNetV3-Small. You may see the main points of the 2 architectures in Figure 1 below.

Figure 1. The MobileNetV3-Large (left) and MobileNetV3-Small (right) architectures [3].

Taking a more in-depth have a look at the architecture, we will see that the 2 networks mainly consist of () blocks. The configuration of the blocks themselves is described in columns , , , , and . The interior structure of those blocks in addition to the corresponding parameter configurations shall be discussed further in the next subsection.

The Bottleneck

MobileNetV3 uses the modified version of the constructing blocks utilized in MobileNetV2. As I’ve mentioned earlier, what makes the 2 different is the presence of SE module and using hard activation function. You may see the 2 constructing blocks in Figure 2, with MobileNetV2 at the highest and MobileNetV3 at the underside.

Figure 2. The MobileNetV2 (top) and MobileNetV3 (bottom) constructing blocks [3].

Notice that the primary two convolution layers in each constructing blocks are mainly the identical: a pointwise convolution followed by a depthwise convolution. The previous is used for expanding the variety of channels to (), whereas the latter is responsible to process each channel of the resulting tensor independently. The one difference between the 2 constructing blocks lies within the activation functions used, which they check with as (). In MobileNetV2, the activation functions placed after the 2 convolution layers are set fixed to ReLU6, whereas in MobileNetV3 it might probably either be ReLU6 or . The and you saw earlier in Figure 1 mainly check with these two kinds of activations.

Next, in MobileNetV3 we place the SE module after the depthwise convolution layer. Should you’re not yet acquainted with SE module, it is basically a type of constructing block we will attach in any type of CNN-based model. This component is beneficial for giving weights to different channels, allowing the model to pay more attention to the essential channels only. I even have a separate article discussing the SE module intimately. Click on the link at reference number [4] if you need to read that one. It’s important to notice that the SE module used here is barely different, in that the last FC layer uses moderately than the usual sigmoid activation function. (I’ll talk more concerning the hard activations utilized in MobileNetV3 later in the next subsection.) Actually, the SE module itself is just not all the time included in every bottleneck block. Should you return to Figure 1, you’ll notice that among the bottleneck blocks have a checkmark within the column, indicating that the SE module is applied. Then again, some blocks don’t include the module, which could probably be since the NAS process didn’t find any performance improvement from using SE modules in those blocks.

Because the SE module has been connected, we’d like to put one other pointwise convolution, which is responsible to regulate the variety of output channels in keeping with the column in Figure 1. This pointwise convolution doesn’t include any activation function, aligning with the design originally introduced in MobileNetV2. I really want to make clear something here. Should you take a have a look at the MobileNetV2 constructing block in Figure 2 above, you’ll notice that the last pointwise convolution has a ReLU6 placed on it. I imagine this can be a mistake made by the authors, because in keeping with the MobileNetV2 paper [6], the ReLU6 ought to be in the primary pointwise convolution at the start of the block as an alternative.

Last but not least, notice that there’s also a residual connection that skips across all layers within the bottleneck block. This connection is just present when the output tensor has the very same dimensions because the input, i.e., when the variety of input and output channels is similar and when the (stride) is 1.

Hard-Sigmoid and Hard-Swish

The activation functions utilized in MobileNetV3 are usually not commonly present in other deep learning models. To start out with, let’s have a look at the activation first, which is the one utilized in the SE module as a substitute for the standard sigmoid. Take a have a look at Figure 3 below to see the difference between the 2.

Figure 3. The sigmoid and the hard-sigmoid activation functions [3].

Here you may probably be wondering, why don’t we just use the standard sigmoid? Why will we really want to make use of piecewise linear function that appears less smooth as an alternative? To reply this query, we’d like to grasp the mathematical definition of a sigmoid function upfront, which I provide in Figure 4 below.

Figure 4. The equation of the usual sigmoid function [5].

We will clearly see within the above figure that the sigmoid function originally involves an exponential term within the denominator. Actually, this term causes the function to be computationally expensive, which in turn makes the activation function less suitable for low-power devices. Not only that, the output of the sigmoid function itself is a high-precision floating-point value, which can be not preferable for low-power devices as a consequence of their limited support for handling such values.

Should you have a look at Figure 3 again, you may think that the hard-sigmoid function is directly derived from the unique sigmoid. Actually, that’s actually not quite right. Despite having an identical shape, hard-sigmoid is largely constructed using ReLU6 as an alternative, which may formally be expressed in Figure 5 below. Here you possibly can see that the equation is far simpler because it only consists of basic arithmetic operations and clipping, allowing it to be processed much faster.

Figure 5. The equation of the hard sigmoid function [5].

The following activation function we’re going to utilize in MobileNetV3 is the so-called , which shall be implemented after each of the primary two convolution layers within the bottleneck block. Similar to sigmoid and hard-sigmoid, the graph of the hard-swish function appears to be much like the unique one.

Figure 6. The swish and hard-swish activation functions [3].

The unique swish function itself can mathematically be expressed within the equation in Figure 7. Again, because the equation involves sigmoid, it’s going to definitely decelerate the computation. Hence, to hurry up the method, we will simply replace the sigmoid function with hard-sigmoid we just discussed. By doing so, we now have the hard version of the swish activation function as shown in Figure 8.

Figure 7. The equation of the swish activation function [5].

Figure 8. The equation of the hard-swish activation function [5].

Some Experimental Results

Before we get into the experimental results, it is advisable to know that there are two parameters in MobileNetV3 that allow us to regulate the model size in keeping with our needs. These two parameters are and , which in MobileNetV1 are often called and , respectively. Although we will technically adjust the worth for the 2 freely, the authors already provided several numbers we will use. For the , we will set it to either 0.35, 0.5, 0.75, 1.0, or 1.25, where using a price smaller than 1.0 causes the model to have fewer variety of channels than those disclosed in Figure 1, effectively reducing the model size. As an example, if we set this parameter to 0.35, then the model will only have 35% of its default width (i.e., channel count) throughout the whole network.

Meanwhile, the input resolution can either be 96, 128, 160, 192, 224, or 256, which because the name suggests, it directly controls the spatial dimension of the input image. It’s value noting that though using a small input size reduces the variety of operations during inference, it doesn’t affect the model size in any respect. So, in case your objective is to scale back model size, it is advisable to adjust the , whereas in case your goal is to lower computational cost, you possibly can mess around with each the and .

Now the experimental leads to Figure 9, we will clearly see that MobileNetV3 outperforms MobileNetV2 by way of accuracy at similar latency. The MobileNetV3-Small of default configuration (i.e., 1.0 and 224×224) indeed has a lower accuracy than the most important MobileNetV2 variant. But in the event you take the default MobileNetV3-Large under consideration, it got a straightforward win over the most important MobileNetV2 each by way of accuracy and latency. Moreover, we will still push the accuracy of MobileNetV3 even further by enlarging the model size by 1.25 times (the blue datapoint at the highest right), but have in mind that doing so significantly sacrifices computational speed.

Figure 9. Performance comparison between MobileNetV3-Large, MobileNetV3-Small, and MobileNetV2 [3].

The authors also conducted a comparative evaluation with other lightweight models, of which the outcomes are shown within the table in Figure 10.

Figure 10. Performance comparison of MobileNetV3 with other lightweight models [3].

The rows of the table above are divided into two groups, where the upper group is used to match models with complexity much like MobileNetV3-Large, while the lower group consists of models comparable to MobileNetV3-Small. Here you possibly can see that each V3-Large and V3-Small obtained the most effective accuracy on ImageNet inside their respective groups. It’s value noting that although MnasNet-A1 and V3-Large have the very same accuracy, the variety of operations () of the previous model is higher, which ends up in higher latency, as seen in columns , , and (measured in milliseconds). In case you’re wondering, the labels , , and essentially correspond to different Google Pixel series used to check the actual computational speed. Next, it’s obligatory to acknowledge that each MobileNetV3 variants have the very best parameter count (the column) in comparison with other models of their group. Nonetheless, this doesn’t appear to be a serious concern for the authors as the first goal of MobileNetV3 is to reduce computational latency, even when which means having a rather larger model.

The following experiment the authors conducted was concerning the effects of value quantization, i.e., a method that reduces the precision of floating-point numbers to hurry up computation. While the networks already incorporate hard activation functions, that are compatible with quantized values, this experiment takes quantization a step further by applying it to the whole network to see how much the speed improves. The experimental results when value quantization was applied are shown in Figure 11 below.

Figure 11. The accuracy and latency of MobileNetV2 and MobileNetV3 when using quantized values [3].

Should you compare the outcomes of V2 and V3 in Figure 11 with the corresponding models in Figure 10, you’ll notice that there’s a decrease in latency, proving that using low-precision numbers does improve computational speed. Nonetheless, it will be significant to have in mind that this also results in a decrease in accuracy.

MobileNetV3 Implementation

I feel all the reasons above cover just about the whole lot it is advisable to know concerning the theory behind MobileNetV3. Now on this section I’m going to bring you into probably the most fun a part of this text: implementing MobileNetV3 from scratch.

As all the time, the very very first thing we do is importing the required modules.

# Codeblock 1
import torch
import torch.nn as nn

Afterwards, we’d like to initialize the configurable parameters of the model, namely WIDTH_MULTIPLIER, INPUT_RESOLUTION, and NUM_CLASSES, as shown in Codeblock 2 below. I imagine the primary two variables are straightforward as I’ve explained them thoroughly within the previous section. Here I made a decision to assign default values for the 2. You may definitely change these numbers based on the values provided within the paper if you need to adjust the complexity of the model. Next, the third variable corresponds to the variety of output neurons within the classification head. Here I set it to 1000 since the model is originally trained on the ImageNet-1K dataset. It’s value noting that the MobileNetV3 architecture is definitely not limited to classification tasks only. As an alternative, it might probably even be used for object detection and semantic segmentation as demonstrated within the paper. Nonetheless, because the focus of this text is to implement the backbone, let’s just use the usual classification head for the output layer to maintain things easy.

# Codeblock 2
WIDTH_MULTIPLIER = 1.0
INPUT_RESOLUTION = 224
NUM_CLASSES      = 1000

What we’re going to do next is to wrap the repeating components into separate classes. By doing this, we’ll later give you the option to easily instantiate them every time needed as an alternative of rewriting the identical code over and once more. Now let’s begin with the Squeeze-and-Excitation module first.

The Squeeze-and-Excitation Module

The implementation of this component is shown in Codeblock 3. I’m not going to get very deep into the code because it is sort of the exact same because the one in my previous article [4]. Nonetheless, generally speaking, this code works by representing each input channel with a single number (line #(1)), processing the resulting vector with a sequence of linear layers (#(2–3)), then converting it right into a weight vector (#(4)). Remember that in the unique SE module we typically use the usual sigmoid activation function to acquire the burden vector, but here in MobileNetV3 we use hard-sigmoid as an alternative. This weight vector will then be multiplied with the unique tensor, which by doing so we will reduce the influence of channels that don’t give contribution to the ultimate output (#(5)).

# Codeblock 3
class SEModule(nn.Module):
    def __init__(self, num_channels, r):
        super().__init__()
        
        self.global_pooling = nn.AdaptiveAvgPool2d(output_size=(1,1))
        self.fc0 = nn.Linear(in_features=num_channels,
                             out_features=num_channels//r, 
                             bias=False)
        self.relu6 = nn.ReLU6()
        self.fc1 = nn.Linear(in_features=num_channels//r,
                             out_features=num_channels, 
                             bias=False)
        self.hardsigmoid = nn.Hardsigmoid()

    def forward(self, x):
        print(f'originaltt: {x.size()}')
        
        squeezed = self.global_pooling(x)              #(1)
        print(f'after avgpooltt: {squeezed.size()}')
        
        squeezed = torch.flatten(squeezed, 1)
        print(f'after flattentt: {squeezed.size()}')
        
        excited = self.fc0(squeezed)                   #(2)
        print(f'after fc0tt: {excited.size()}')
        
        excited = self.relu6(excited)
        print(f'after relu6tt: {excited.size()}')
        
        excited = self.fc1(excited)                    #(3)
        print(f'after fc1tt: {excited.size()}')
        
        excited = self.hardsigmoid(excited)            #(4)
        print(f'after hardsigmoidt: {excited.size()}')
        
        excited = excited[:, :, None, None]
        print(f'after reshapett: {excited.size()}')
        
        scaled = x * excited                           #(5)
        print(f'after scalingtt: {scaled.size()}')
        
        return scaled

Now let’s check if the above code works properly by creating an SEModule instance and passing a dummy tensor through it. See Codeblock 4 below for the main points. Here I configure the SE module to simply accept a 512-channel image for the input. Meanwhile, the r () parameter is about to 4, meaning that the vector length between the 2 FC layers goes to be 4 times smaller than that of its input and output. It may be value knowing that this number is different from the one mentioned in the unique Squeeze-and-Excitation paper [7], where is claimed to be the sweet spot for balancing accuracy and complexity.

# Codeblock 4
semodule = SEModule(num_channels=512, r=4)
x = torch.randn(1, 512, 28, 28)

out = semodule(x)

If the code above produces the next output, it confirms that our SE module implementation is correct because it successfully passed the input tensor through all layers inside the whole SE module.

# Codeblock 4 Output
original          : torch.Size([1, 512, 28, 28])
after avgpool     : torch.Size([1, 512, 1, 1])
after flatten     : torch.Size([1, 512])
after fc0         : torch.Size([1, 128])
after relu6       : torch.Size([1, 128])
after fc1         : torch.Size([1, 512])
after hardsigmoid : torch.Size([1, 512])
after reshape     : torch.Size([1, 512, 1, 1])
after scaling     : torch.Size([1, 512, 28, 28])

The Convolution Block

The following component I’m going to create is the one wrapped within the ConvBlock class, which the detailed implementation may be seen in Codeblock 5. Actually, this is definitely just a regular convolution layer, but we don’t simply use nn.Conv2d because in CNN we typically use the structure. Hence, it’s going to be convenient if we just group these three layers together inside a single class. Nonetheless, as an alternative of really following this standard structure, we’re going to customize it to match the necessities for the MobileNetV3 architecture.

# Codeblock 5
class ConvBlock(nn.Module):
    def __init__(self, 
                 in_channels,             #(1)
                 out_channels,            #(2)
                 kernel_size,             #(3)
                 stride,                  #(4)
                 padding,                 #(5)
                 groups=1,                #(6)
                 batchnorm=True,          #(7)
                 activation=nn.ReLU6()):  #(8)
        super().__init__()
        
        bias = False if batchnorm else True    #(9)
        
        self.conv = nn.Conv2d(in_channels=in_channels, 
                              out_channels=out_channels,
                              kernel_size=kernel_size, 
                              stride=stride, 
                              padding=padding, 
                              groups=groups,
                              bias=bias)
        self.bn = nn.BatchNorm2d(num_features=out_channels) if batchnorm else nn.Identity()  #(10)
        self.activation = activation
    
    def forward(self, x):    #(11)
        print(f'originaltt: {x.size()}')
        
        x = self.conv(x)
        print(f'after convtt: {x.size()}')
        
        x = self.bn(x)
        print(f'after bntt: {x.size()}')
        
        x = self.activation(x)
        print(f'after activationt: {x.size()}')
        
        return x

There are several parameters it is advisable to pass to instantiate a ConvBlock instance. The primary five ones (#(1–5)) are pretty straightforward as they’re mainly just the usual parameters for the nn.Conv2d layer. Here I set the groups parameter to be configurable (#(6)) in order that this class may be flexibly used not only for traditional convolutions but additionally for depthwise convolutions. Next, at line #(7) I create a parameter called batchnorm, which determines whether or not a ConvBlock instance implements a batch normalization layer. This is basically done because there are some cases where we don’t implement this layer, i.e., within the last two convolutions with label (which stands for ) in Figure 1. The last parameter now we have here is the activation function (#(8)). Afterward, there shall be cases that require us to set it to either nn.ReLU6(), nn.Hardswish() or nn.Identity() (no activation).

Contained in the __init__() method, there are two things happening if we alter the input argument for the batchnorm parameter. After we set it to True, firstly, the bias term of the convolution layer shall be deactivated (#(9)), and secondly, bn shall be an nn.BatchNorm2d() layer (#(10)). The bias term won’t be utilized in this case because applying batch normalization after convolution will cancel it out. So, there is largely no point of utilizing bias in the primary place. Meanwhile, if we set the batchnorm parameter to False, the bias variable goes to be True since in this example it’s going to not be canceled out. The bn itself will just be an identity layer, meaning that it won’t do anything to the tensor.

Regarding the forward() method (#(11)), I don’t think I want to clarify anything because what we do here is just passing a tensor through the layers sequentially. Now let’s just move on to Codeblock 6 to see whether our ConvBlock implementation is correct. Here I attempt to create two ConvBlock instances, where the primary one uses default batchnorm and activation, whereas the second omits the batch normalization layer (#(1)) and uses hard-swish activation function (#(2)). As an alternative of passing a tensor through them, here I would like you to see within the resulting output that our code appropriately implements each structures in keeping with the input arguments we pass.

# Codeblock 6
convblock1 = ConvBlock(in_channels=64, 
                       out_channels=128, 
                       kernel_size=3, 
                       stride=2, 
                       padding=1)

convblock2 = ConvBlock(in_channels=64, 
                       out_channels=128, 
                       kernel_size=3, 
                       stride=2, 
                       padding=1, 
                       batchnorm=False,             #(1)
                       activation=nn.Hardswish())   #(2)

print(convblock1)
print('')
print(convblock2)

# Codeblock 6 Output
ConvBlock(
  (conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation): ReLU6()
)

ConvBlock(
  (conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (bn): Identity()
  (activation): Hardswish()
)

The Bottleneck

Because the SEModule and the ConvBlock are done, we will now move on to the most important component of the MobileNetV3 architecture: the bottleneck. What we essentially do within the bottleneck is just placing one layer after one other which the final structure is shown earlier in Figure 2. Within the case of MobileNetV2, it only consists of three convolution layers, whereas here in MobileNetV3 now we have an extra SE block placed between the second and the third convolutions. Have a look at Codeblock 7a and 7b to see how I implement the bottleneck block for MobileNetV3.

# Codeblock 7a
class Bottleneck(nn.Module):
    def __init__(self, 
                 in_channels, 
                 out_channels, 
                 kernel_size, 
                 stride,
                 padding,
                 exp_size,     #(1)
                 se,           #(2)
                 activation):
        super().__init__()

        self.add = in_channels == out_channels and stride == 1    #(3)

        self.conv0 = ConvBlock(in_channels=in_channels,    #(4)
                               out_channels=exp_size,    #(5)
                               kernel_size=1,    #(6)
                               stride=1, 
                               padding=0,
                               activation=activation)
                               
        self.conv1 = ConvBlock(in_channels=exp_size,    #(7)
                               out_channels=exp_size,    #(8)
                               kernel_size=kernel_size,    #(9)
                               stride=stride, 
                               padding=padding,
                               groups=exp_size,    #(10)
                               activation=activation)

        self.semodule = SEModule(num_channels=exp_size, r=4) if se else nn.Identity()    #(11)

        self.conv2 = ConvBlock(in_channels=exp_size,    #(12)
                               out_channels=out_channels,    #(13)
                               kernel_size=1,    #(14)
                               stride=1, 
                               padding=0, 
                               activation=nn.Identity())    #(15)

The input parameters of the Bottleneck class look much like those of the ConvBlock class at a look. This definitely is smart because we’ll indeed use them to instantiate ConvBlock instances contained in the Bottleneck. Nonetheless, in the event you take a more in-depth have a look at them again, you’ll notice that there are another parameters you haven’t seen before, namely se (#(1)) and exp_size (#(2)). Afterward, the input arguments for these parameters shall be obtained from the configuration provided within the table in Figure 1.

Contained in the __init__() method, what we’d like to do first is to examine whether the input and output tensor dimensions are the identical using the code at line #(3). By doing this, we can have our add variable containing either True or False. This dimensionality checking is significant because we’d like to come to a decision whether or not we perform element-wise summation between the 2 to implement the skip-connection that skips through all layers throughout the bottleneck block.

Next, let’s now instantiate the layers themselves, of which the primary two are a pointwise convolution (conv0) and a depthwise convolution (conv1). For conv0, we’d like to set the kernel size to 1×1 (#(6)), whereas for conv1 the kernel size should match the one within the input argument (#(9)), which may either be 3×3 or 5×5. It’s obligatory to use padding within the ConvBlock to forestall the image size from shrinking after every convolution operation. For kernel sizes of 1×1, 3×3, and 5×5, the required padding values are 0, 1, and a pair of, respectively. Talking concerning the variety of channels, conv0 is responsible to expand it from in_channels to exp_size (#(4–5)). Meanwhile, the variety of input and output channels of conv1 are the exact same (#(7–8)). Along with the conv1 layer, the groups parameter ought to be set to exp_size (#(10)) because we wish each input channel to be processed independently of one another.

After the primary two convolution layers are done, what we’d like to instantiate next is the Squeeze-and-Excitation module (#(11)). Here we’d like to set the input channel count to exp_size, matching with the tensor size produced by the conv1 layer. Keep in mind that SE module is just not all the time used, hence the instantiation of this component ought to be done inside a condition, where it’s going to actually be instantiated only when the se parameter is True. Otherwise, it’s going to just be an identity layer.

Finally, the last convolution layer (conv2) is responsible to map the variety of output channels from exp_size to out_channels (#(12–13)). Similar to the conv0 layer, this one can be a pointwise convolution, hence we set the kernel size to 1×1 (#(14)) in order that it only focuses on aggregating information along the channel dimension. The activation function of this layer is about fixed to nn.Identity() (#(15)) because here we’ll implement the concept of linear bottleneck.

And that’s just about the whole lot for the layers throughout the bottleneck block. All we’d like to do afterwards is to create the flow of the network within the forward() method as shown in Codeblock 7b below.

    # Codeblock 7b
    def forward(self, x):
            residual = x
            print(f'originaltt: {x.size()}')

            x = self.conv0(x)
            print(f'after conv0tt: {x.size()}')

            x = self.conv1(x)
            print(f'after conv1tt: {x.size()}')

            x = self.semodule(x)
            print(f'after semodulett: {x.size()}')

            x = self.conv2(x)
            print(f'after conv2tt: {x.size()}')

            if self.add:
                x += residual
                print(f'after summationtt: {x.size()}')

            return x

Now I would love to check the Bottleneck class we just created by simulating the third row of the MobileNetV3-Large architecture within the table in Figure 1. Have a look at the Codeblock 8 below to see how I do that. Should you return to the architectural details, you’ll notice that this bottleneck accepts a tensor of size 16×112×112 (#(7)). On this case, the bottleneck block is configured to expand the variety of channels to 64 (#(3)) before eventually shrinking it to 24 (#(1)). The kernel size of the depthwise convolution is about to three×3 (#(2)) and the stride is about to 2 (#(4)) which is able to reduce the spatial dimension by half. Here we use ReLU6 for the activation function (#(6)) of the primary two convolutions. Lastly, SE module won’t be implemented (#(5)) since there isn’t a checkmark within the column within the table.

# Codeblock 8
bottleneck = Bottleneck(in_channels=16,
                        out_channels=24,   #(1)
                        kernel_size=3,     #(2)
                        exp_size=64,       #(3)
                        stride=2,          #(4)
                        padding=1, 
                        se=False,          #(5)
                        activation=nn.ReLU6())  #(6)

x = torch.randn(1, 16, 112, 112)           #(7)
out = bottleneck(x)

Should you run the above code, the next output should appear in your screen.

# Codeblock 8 Output
original        : torch.Size([1, 16, 112, 112])
after conv0     : torch.Size([1, 64, 112, 112])
after conv1     : torch.Size([1, 64, 56, 56])
after semodule  : torch.Size([1, 64, 56, 56])
after conv2     : torch.Size([1, 24, 56, 56])

This output confirms that our implementation is correct by way of the tensor shape, where the spatial dimension halves from 112×112 to 56×56 while the variety of channels appropriately expands from 16 to 64 after which reduces from 64 to 24. Talking more specifically concerning the SE module, we will see within the above output that the tensor remains to be passed through this component despite now we have set the se parameter to False. Actually, in the event you attempt to print out the detailed architecture of this bottleneck like what I do in Codeblock 9, you will notice that semodule is just an identity layer, which effectively makes this structure behave as if we’re passing the output of conv1 on to conv2.

# Codeblock 9
bottleneck

# Codeblock 9 Output
Bottleneck(
  (conv0): ConvBlock(
    (conv): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (activation): ReLU6()
  )
  (conv1): ConvBlock(
    (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=64, bias=False)
    (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (activation): ReLU6()
  )
  (semodule): Identity()
  (conv2): ConvBlock(
    (conv): Conv2d(64, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (activation): Identity()
  )
)

The above bottleneck goes to behave in another way if we instantiate it with the se parameter set to True. In Codeblock 10 below, I attempt to create the bottleneck block within the fifth row within the MobileNetV3-Large architecture. On this case, in the event you print out the detailed structure, you will notice that semodule consists of all layers within the SEModule class we created earlier as an alternative of just being an identity layer like before.

# Codeblock 10
bottleneck = Bottleneck(in_channels=24, 
                        out_channels=40, 
                        kernel_size=5, 
                        exp_size=72,
                        stride=2, 
                        padding=2, 
                        se=True, 
                        activation=nn.ReLU6())

bottleneck

# Codeblock 10 Output
Bottleneck(
  (conv0): ConvBlock(
    (conv): Conv2d(24, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (activation): ReLU6()
  )
  (conv1): ConvBlock(
    (conv): Conv2d(72, 72, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=72, bias=False)
    (bn): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (activation): ReLU6()
  )
  (semodule): SEModule(
    (global_pooling): AdaptiveAvgPool2d(output_size=(1, 1))
    (fc0): Linear(in_features=72, out_features=18, bias=False)
    (relu6): ReLU6()
    (fc1): Linear(in_features=18, out_features=72, bias=False)
    (hardsigmoid): Hardsigmoid()
  )
  (conv2): ConvBlock(
    (conv): Conv2d(72, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (activation): Identity()
  )
)

The Complete MobileNetV3

As all components have been created, what we’d like to do next is to construct the most important class of the MobileNetV3 model. But before doing so, I would love to initialize a listing that stores the input arguments used for instantiating the bottleneck blocks as shown in Codeblock 11 below. Remember that these arguments are written in keeping with the MobileNetV3-Large version. You’ll need to regulate the values within the BOTTLENECKS list if you need to create the small version as an alternative.

# Codeblock 11
HS = nn.Hardswish()
RE = nn.ReLU6()

BOTTLENECKS = [[16,  16,  3, 16,  False, RE, 1, 1], 
               [16,  24,  3, 64,  False, RE, 2, 1], 
               [24,  24,  3, 72,  False, RE, 1, 1], 
               [24,  40,  5, 72,  True,  RE, 2, 2], 
               [40,  40,  5, 120, True,  RE, 1, 2], 
               [40,  40,  5, 120, True,  RE, 1, 2], 
               [40,  80,  3, 240, False, HS, 2, 1], 
               [80,  80,  3, 200, False, HS, 1, 1], 
               [80,  80,  3, 184, False, HS, 1, 1], 
               [80,  80,  3, 184, False, HS, 1, 1], 
               [80,  112, 3, 480, True,  HS, 1, 1], 
               [112, 112, 3, 672, True,  HS, 1, 1], 
               [112, 160, 5, 672, True,  HS, 2, 2], 
               [160, 160, 5, 960, True,  HS, 1, 2], 
               [160, 160, 5, 960, True,  HS, 1, 2]]

The arguments listed above are structured in the next order (from left to right): , , , , , , , and . Remember that is just not explicitly stated in the unique table, but I include it here since it is required as an input when instantiating the bottleneck blocks.

Now let’s actually create the MobileNetV3 class. See the code implementation in Codeblocks 12a and 12b below.

# Codeblock 12a
class MobileNetV3(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.first_conv = ConvBlock(in_channels=3,    #(1)
                                    out_channels=int(WIDTH_MULTIPLIER*16),
                                    kernel_size=3,
                                    stride=2,
                                    padding=1, 
                                    activation=nn.Hardswish())
        
        self.blocks = nn.ModuleList([])    #(2)
        for config in BOTTLENECKS:         #(3)
            in_channels, out_channels, kernel_size, exp_size, se, activation, stride, padding = config
            self.blocks.append(Bottleneck(in_channels=int(WIDTH_MULTIPLIER*in_channels), 
                                          out_channels=int(WIDTH_MULTIPLIER*out_channels), 
                                          kernel_size=kernel_size, 
                                          exp_size=int(WIDTH_MULTIPLIER*exp_size), 
                                          stride=stride, 
                                          padding=padding, 
                                          se=se, 
                                          activation=activation))
        
        self.second_conv = ConvBlock(in_channels=int(WIDTH_MULTIPLIER*160), #(4)
                                     out_channels=int(WIDTH_MULTIPLIER*960),
                                     kernel_size=1,
                                     stride=1,
                                     padding=0, 
                                     activation=nn.Hardswish())
        
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))              #(5)
        
        self.third_conv = ConvBlock(in_channels=int(WIDTH_MULTIPLIER*960),  #(6)
                                    out_channels=int(WIDTH_MULTIPLIER*1280),
                                    kernel_size=1,
                                    stride=1,
                                    padding=0, 
                                    batchnorm=False,
                                    activation=nn.Hardswish())
        
        self.dropout = nn.Dropout(p=0.8)    #(7)
        
        self.output = ConvBlock(in_channels=int(WIDTH_MULTIPLIER*1280),     #(8)
                                out_channels=int(NUM_CLASSES),              #(9)
                                kernel_size=1,
                                stride=1,
                                padding=0, 
                                batchnorm=False,
                                activation=nn.Identity())

Notice in Figure 1 that we initially start from the usual convolution layer. Within the above codeblock, I check with this layer as first_conv (#(1)). It’s value noting that the input arguments for this layer are usually not included within the BOTTLENECKS list, hence we’d like to define them manually. Remember to multiply the channel counts at each step by WIDTH_MULTIPLIER since we wish the model size to be adjustable through that variable. Next, we initialize a placeholder named blocks for storing all of the bottleneck blocks (#(2)). With a straightforward loop at line #(3), we’ll iterate through all items within the BOTTLENECKS list to truly instantiate the bottleneck blocks and append them one after the other to blocks. Actually, this loop constructs nearly all of the layers within the network, because it covers nearly all components listed within the table.

Because the sequence of bottleneck blocks is completed, we’ll now proceed with the subsequent convolution layer, which I check with as second_conv (#(4)). Again, because the configuration parameters for this layer are usually not stored within the BOTTLENECKS list, we’d like to manually hard-code them. The output of this layer will then be passed through a world average pooling layer (#(5)) which is able to drop the spatial dimension to 1×1. Afterwards, we connect this layer to 2 consecutive pointwise convolutions (#(6) and #(8)) with a dropout layer in between (#(7)).

Talking more specifically concerning the two convolutions, it will be significant to know that applying a 1×1 convolution on a tensor that has a 1×1 spatial dimension is basically similar to applying an FC layer to a flattened tensor, where the variety of channels will correspond to the variety of neurons. That is the explanation that I set the output channel count of the last layer equal to the variety of classes within the dataset (#(9)). The batchnorm parameter of each third_conv and output layers are set to False, as suggested within the architecture.

Meanwhile, the activation function of third_conv is about to nn.Hardswish(), whereas the output layer uses nn.Identity(), which is similar to not applying any activation function in any respect. This is basically done because during training softmax is already included within the loss function (nn.CrossEntropyLoss()). Later within the inference phase, we’d like to switch nn.Identity() with nn.Softmax() within the output layer in order that the model will directly return the probability rating of every class.

Next, let’s take a have a look at the forward() method below, which I won’t explain any further since I feel it’s pretty easy to grasp.

# Codeblock 12b
    def forward(self, x):
        print(f'originaltt: {x.size()}')

        x = self.first_conv(x)
        print(f'after first_convt: {x.size()}')
        
        for i, block in enumerate(self.blocks):
            x = block(x)
            print(f"after bottleneck #{i}t: {x.shape}")
        
        x = self.second_conv(x)
        print(f'after second_convt: {x.size()}')
        
        x = self.avgpool(x)
        print(f'after avgpooltt: {x.size()}')
        
        x = self.third_conv(x)
        print(f'after third_convt: {x.size()}')
        
        x = self.dropout(x)
        print(f'after dropouttt: {x.size()}')
        
        x = self.output(x)
        print(f'after outputtt: {x.size()}')
        
        x = torch.flatten(x, start_dim=1)
        print(f'after flattentt: {x.size()}')
            
        return x

The code in Codeblock 13 demonstrates how we initialize a MobileNetV3 instance and pass a dummy tensor through it. Keep in mind that here we use the default input resolution, so we will mainly consider the tensor as a batch of a single RGB image of size 224×224.

# Codeblock 13
mobilenetv3 = MobileNetV3()

x = torch.randn(1, 3, INPUT_RESOLUTION, INPUT_RESOLUTION)
out = mobilenetv3(x)

And below is what the resulting output looks like, through which the tensor dimension after each block matches exactly with the MobileNetV3-Large architecture in Figure 1.

# Codeblock 13 Output
original             : torch.Size([1, 3, 224, 224])
after first_conv     : torch.Size([1, 16, 112, 112])
after bottleneck #0  : torch.Size([1, 16, 112, 112])
after bottleneck #1  : torch.Size([1, 24, 56, 56])
after bottleneck #2  : torch.Size([1, 24, 56, 56])
after bottleneck #3  : torch.Size([1, 40, 28, 28])
after bottleneck #4  : torch.Size([1, 40, 28, 28])
after bottleneck #5  : torch.Size([1, 40, 28, 28])
after bottleneck #6  : torch.Size([1, 80, 14, 14])
after bottleneck #7  : torch.Size([1, 80, 14, 14])
after bottleneck #8  : torch.Size([1, 80, 14, 14])
after bottleneck #9  : torch.Size([1, 80, 14, 14])
after bottleneck #10 : torch.Size([1, 112, 14, 14])
after bottleneck #11 : torch.Size([1, 112, 14, 14])
after bottleneck #12 : torch.Size([1, 160, 7, 7])
after bottleneck #13 : torch.Size([1, 160, 7, 7])
after bottleneck #14 : torch.Size([1, 160, 7, 7])
after second_conv    : torch.Size([1, 960, 7, 7])
after avgpool        : torch.Size([1, 960, 1, 1])
after third_conv     : torch.Size([1, 1280, 1, 1])
after dropout        : torch.Size([1, 1280, 1, 1])
after output         : torch.Size([1, 1000, 1, 1])
after flatten        : torch.Size([1, 1000])

With a view to be sure that our implementation is correct, we will print out the variety of parameters contained within the model using the next code.

# Codeblock 14
total_params = sum(p.numel() for p in mobilenetv3.parameters())
total_params

# Codeblock 14 Output
5476416

Here you possibly can see that this model accommodates around 5.5 million parameters, through which that is roughly the identical because the one disclosed in the unique paper (see Figure 10). Moreover, the parameter count given within the PyTorch documentation can be much like this number as you possibly can see in Figure 12 below. Based on these facts, I imagine I can confirm that our MobileNetV3-Large implementation is correct.

Figure 12. The main points of the MobileNetV3-Large model from the official PyTorch documentation [8].

Ending

Well, that’s just about the whole lot concerning the MobileNetV3 architecture. Here I encourage you to truly train this model from scratch on any datasets you wish. Not only that, I also want you to mess around with the parameter configurations of the bottleneck blocks to see whether we will still improve the performance of MobileNetV3 even further. By the best way, the code utilized in this text can be available in my GitHub repo, which you will discover within the link at reference number [9].

Thanks for reading. Be happy to achieve me through LinkedIn [10] in the event you spot any mistake in my explanation or within the code. See ya in my next article!

References

[1] Muhammad Ardi. MobileNetV1 Paper Walkthrough: The Tiny Giant. AI Advances. https://medium.com/ai-advances/mobilenetv1-paper-walkthrough-the-tiny-giant-987196f40cd5 [Accessed October 24, 2025].

[2] Muhammad Ardi. MobileNetV2 Paper Walkthrough: The Smarter Tiny Giant. Towards Data Science. https://towardsdatascience.com/mobilenetv2-paper-walkthrough-the-smarter-tiny-giant/ [Accessed October 24, 2025].

[3] Andrew Howard Trying to find MobileNetV3. Arxiv. https://arxiv.org/abs/1905.02244 [Accessed May 1, 2025].

[4] Muhammad Ardi. SENet Paper Walkthrough: The Channel-Sensible Attention. AI Advances. https://medium.com/ai-advances/senet-paper-walkthrough-the-channel-wise-attention-8ac72b9cc252 [Accessed October 24, 2025].

[5] Image created originally by creator.

[6] Mark Sandler MobileNetV2: Inverted Residuals and Linear Bottlenecks. Arxiv. https://arxiv.org/abs/1801.04381 [Accessed May 12, 2025].

[7] Jie Hu Squeeze and Excitation Networks. Arxiv. https://arxiv.org/abs/1709.01507 [Accessed May 12, 2025].

[8] Mobilenet_v3_large. PyTorch. https://docs.pytorch.org/vision/most important/models/generated/torchvision.models.mobilenet_v3_large.html#torchvision.models.mobilenet_v3_large [Accessed May 12, 2025].

[9] MuhammadArdiPutra. The Tiny Giant Getting Even Smarter — MobileNetV3. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/most important/The%20Tiny%20Giant%20Getting%20Even%20Smarter%20-%20MobileNetV3.ipynb [Accessed May 12, 2025].

[10] Muhammad Ardi Putra. LinkedIn. https://www.linkedin.com/in/muhammad-ardi-putra-879528152/ [Accessed May 12, 2025].

MobileNetV3 Paper Walkthrough: The Tiny Giant Getting Even Smarter

The Detailed MobileNetV3 Architecture

The Bottleneck

Hard-Sigmoid and Hard-Swish

Some Experimental Results

MobileNetV3 Implementation

The Squeeze-and-Excitation Module

The Convolution Block

The Bottleneck

The Complete MobileNetV3

Ending

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Open-Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs

Benchmarking Large Language Models in Healthcare

Ray: Distributed Computing for All, Part 1

Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena

NVIDIA brings agents to life with DGX Spark and Reachy Mini

MobileNetV3 Paper Walkthrough: The Tiny Giant Getting Even Smarter

The Detailed MobileNetV3 Architecture

The Bottleneck

Hard-Sigmoid and Hard-Swish

Some Experimental Results

MobileNetV3 Implementation

The Squeeze-and-Excitation Module

The Convolution Block

The Bottleneck

The Complete MobileNetV3

Ending

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.