Introduction
was a breakthrough in the sphere of computer vision because it proved that deep learning models don’t necessarily should be computationally expensive to realize high accuracy. Last month I posted an article where I explained every little thing concerning the model in addition to its PyTorch implementation from scratch. Check the link at reference number [1] at the tip of this text in the event you are inquisitive about reading it. This primary version of MobileNet was first proposed back in April 2017 in a paper titled [2] by Howard . from Google. Not long after — in January 2018 to be precise — Sandler . from the identical institution introduced the successor of MobileNetV1 in a paper titled [3], which brings significant improvement over the previous one when it comes to each accuracy and efficiency. In this text, I’m going to walk you thru the ideas proposed within the MobileNetV2 paper and show you implement the architecture from scratch.
The Improvements
The primary version of MobileNet relies solely on the so-called layers. It’s indeed obligatory to acknowledge that using these layers as a substitute of ordinary convolutions allows the model to be extremely lightweight. Nevertheless, authors thought that this architecture could still be improved even further. They got here up with an idea where as an alternative of only using depthwise separable convolutions, additionally they adopted the and mechanisms — which is where the title of the MobileNetV2 paper got here from.
Inverted Residual
In the event you’re accustomed to ResNet, I consider you recognize the so-called . For many who don’t, it is actually a mechanism where the constructing block of the network works by following the pattern. Figure 1 below displays the illustration of a bottleneck block utilized in ResNet. Here we will see that it initially accepts a 256-channel tensor, shrink it to 64, and expands it back to 256.
The inverted version of the above block is often often called , which follows the structure. Figure 2 below shows an example from the ConvNeXt paper [5], where the variety of channels within the input tensor is 96, expanded to 384, and compressed back to 96 by the last convolution layer. It’s important to notice that in MobileNetV2 an block known as for some reasons. So, starting to any extent further, I’ll use the term to avoid confusion.

At this point you may be wondering why we don’t just use the usual bottleneck for MobileNetV2. The reply lies in the unique purpose of the usual bottleneck design, where it was first introduced to cut back computational complexity. This was essentially done because ResNet is computationally expensive by nature yet wealthy in information. Because of this, ResNet authors proposed to cut back computational cost by shrinking the tensor size in the course of each constructing block, which is how the bottleneck block was born.
This reduction within the variety of channels doesn’t hurt the model capability that much since ResNet already has numerous channels overall. However, MobileNetV2 is meant to be as lightweight as possible in the primary place, which implies the model capability just isn’t as high as ResNet. With the intention to increase model capability, authors expand the tensor size in the center to form the inverted residual block, which allows the model to learn more patterns while only barely increasing complexity. So briefly, the center a part of a bottleneck block () is used for efficiency, while the center a part of an inverted residual block () is used to learn complex patterns. If we attempt to apply a typical bottleneck on MobileNetV2 as an alternative, the computation goes to be even faster, but this might cause a drop in accuracy for the reason that model will lose a major amount of data.
Linear Bottleneck
The following concept we want to grasp is the so-called . This one is definitely pretty easy since what we essentially do here is simply to omit the nonlinearity (i.e., the ReLU activation function) within the last layer of every inverted residual block. The usage of activation functions in neural networks at the primary place is to permit the network to capture complex patterns. Nevertheless, it would destroy necessary information as an alternative if we apply it on a low-dimensional tensor, especially within the context of MobileNetV2 where the inverted residual block projects a high dimensional tensor to a smaller one within the last convolution layer. By removing the activation function within the last convolution layer like this, we essentially prevent the model from losing necessary information. Figure 3 below shows what the inverted residual block utilized in MobileNetV2 looks like. Notice that ReLU just isn’t applied within the last pointwise convolution, which essentially signifies that this layer behaves somewhat similarly to a typical linear regression layer. Along with this figure, the variables and denote the variety of input and output channels, respectively. Within the intermediate process, we essentially expand the variety of channels by before eventually shrink it to . I’ll go into more detail on these variables in the subsequent section.

ReLU6
So why can we use ReLU6 as an alternative of normal ReLU? In case you’re not yet accustomed to it, this activation function is definitely much like ReLU, except that the output value is capped at 6. So, any input greater than 6 shall be mapped to that number. Meanwhile, the behavior for negative inputs is strictly the identical. Thus, we will simply say that the output of ReLU6 will at all times be throughout the range of 0 to six (inclusive). Have a look at Figure 4 below to higher understand this concept.

In standard ReLU, there’s a possibility where the input — and due to this fact the output — value goes arbitrarily large, by which it potentially causes instability in low-precision environments. Do not forget that MobileNet is meant to find a way to work on small devices, by which we all know that such devices typically expect small numbers to save lots of memory, say 8-bit integer. On this particular case, having very large activation values could lead on to precision loss or clipping when quantized to low-bit representations. Thus, to maintain the values small and inside a manageable range, we will simply employ ReLU6 to achieve this.
The Complete MobileNetV2 Architecture
Now let’s take a take a look at the entire MobileNetV2 architecture in Figure 5 below. Similar to the primary version of MobileNet which mostly consists of depthwise separable convolutions, a lot of the components inside MobileNetV2 are the inverted residual blocks with linear bottlenecks we discussed earlier. Every row in the next table labeled as corresponds to a single , by which each of them consists of several inverted residual blocks. Talking concerning the columns within the table, represents utilized in the center a part of each block, denotes the variety of output channels of every block, is the variety of repeats of the block inside that stage, and indicates the stride of the primary block throughout the stage.
To higher understand this concept, let’s take a better take a look at the stage which the input shape is 56×56×24. Here you possibly can see that the corresponding parameters of this stage are =6, =32, =3, and =2. This essentially signifies that the inverted residual stage consists of three blocks. All these blocks are equivalent except that the primary one uses stride 2, reducing the spatial dimension by half from 56×56 to twenty-eight×28. Next, =32 is pretty straightforward because it mainly says that the variety of output channel of every block throughout the stage is 32. Meanwhile, =6 indicates that the intermediate layer contained in the blocks is 6 times wider than the input, forming the inverted bottleneck structure. So, on this case the variety of channels in the method goes to be 32 → 192 → 32. Nevertheless, it is vital to notice that the primary block inside that stage is different, where it uses 24 → 144 → 32 structure because of the 24-channel input tensor. If we refer back to Figure 3, these two structures essentially follow the pattern.

Along with the above architecture, here we even have skip-connections placed throughout the inverted residual blocks. This skip-connection will only be applied every time the stride of the block is ready to 1. This is actually since the spatial dimension of the image will change every time we use stride 2, causing the output tensor to have different shape to that of the input. Such a difference in tensor shapes will effectively prevent us from performing element-wise summation between the unique flow and the skip-connection. See Figure 6 below for the main points. Note that the 2 illustrations on this figure are mainly just the visualization of the table in Figure 3.

Parameter Tuning
Just like MobileNetV1, MobileNetV2 also has two adjustable parameters called and . The previous is used to regulate the width of the network, while the latter is for changing the resolution of the input image. The architecture you see in Figure 5 is the bottom configuration, where we set the width multiplier to 1 and the input resolution to 224×224. With these two parameters, we will tune the model to seek out a sweet spot that balances accuracy and efficiency based on our needs.
We will technically select arbitrary numbers for the 2 parameters, but authors already provided several predetermined numbers for his or her experiments. To the width multiplier, we will use 0.75, 0.5 or 0.35, by which all of them will make the model smaller. As an example, if we use 0.5 then all numbers in column in Figure 5 shall be reduced to half of their defaults. To the input resolution, we will select either 192×192, 160×160, 128×128 or 96×96 as a substitute for 224×224 if you need to lower the variety of operations during inference.
Some Experimental Results
Figure 7 below shows what the experimental results done by the authors appear like. Although MobileNetV1 is taken into account lightweight already, MobileNetV2 proved that its performance is even higher when it comes to all metrics in comparison with its predecessor. Nevertheless, it’s obligatory to acknowledge that the bottom MobileNetV2 just isn’t completely superior to other lightweight models especially when bearing in mind all elements directly.

With the intention to achieve even higher accuracy, authors also tried to enlarge the model as an alternative by changing the width multiplier to 1.4 for the 224×224 input resolution, which within the above figure corresponds to the end in the last row. Doing this definitely causes the model complexity in addition to the computation time to get higher, but in return it allows the model to acquire the best accuracy. The leads to Figure 8 also show the same thing, where all MobileNetV2 variants completely outperform the MobileNetV1 counterpart, with the biggest MobileNetV2 obtaining the best accuracy amongst all models.

MobileNetV2 Implementation
Each time I finished learning something, I at all times wonder if I actually understand what I just learned. Within the case of deep learning, I (almost) at all times attempt to implement the architecture by myself right after reading the paper simply to prove to myself that I understand. And here’s the quote that drives me that way:
This is actually the explanation why I at all times include the code implementation of the paper I’m explaining in my post.
What an intermezzo that was. — Now let’s get back our focus to MobileNetV2. On this section I’m going to indicate you the way we will implement the architecture from scratch. As at all times, the very very first thing we want to do is to import the required modules.
# Codeblock 1
import torch
import torch.nn as nn
from torchinfo import summary
Next, we also must initialize some configuration variables in order that we will easily rescale our model if we wish to. The 2 variables I would like to spotlight within the Codeblock 2 below are the WIDTH_MULTIPLIER
and IMAGE_SIZE
, where these two essentially correspond to the and parameters we discussed earlier. Here I set the 2 to 1.0 and 224 because I would like to implement the bottom MobileNetV2 architecture.
# Codeblock 2
BATCH_SIZE = 1
IMAGE_SIZE = 224
IN_CHANNELS = 3
NUM_CLASSES = 1000
WIDTH_MULTIPLIER = 1.0
If we take a take a look at the architectural details in Figure 5, we will see that the rows labeled as is a bunch of blocks, which we previously confer with as . Meanwhile, each row labeled as is largely just a typical convolution layer. I’ll start with the latter first because that one is less complicated to implement.
The Standard Convolution Layer
Talking concerning the rows labeled with , you may be asking why we actually need to wrap this single convolution layer in a separate class. Can’t we just directly use nn.Conv2d
within the principal class? — Actually, it’s mentioned within the paper that each conv layer is at all times followed by a batch normalization layer before eventually being processed by the ReLU6 activation function. This is definitely in accordance with MobileNetV1, where it uses the structure. With the intention to make the code cleaner, we will just wrap these layers inside a single class in order that we don’t necessarily must define all of them repeatedly. Take a take a look at the Codeblock 3 below to see how I create the Conv
class.
# Codeblock 3
class Conv(nn.Module):
def __init__(self, first=False): #(1)
super().__init__()
if first:
in_channels = 3 #(2)
out_channels = int(32*WIDTH_MULTIPLIER) #(3)
kernel_size = 3 #(4)
stride = 2 #(5)
padding = 1 #(6)
else:
in_channels = int(320*WIDTH_MULTIPLIER) #(7)
out_channels = int(1280*WIDTH_MULTIPLIER) #(8)
kernel_size = 1 #(9)
stride = 1 #(10)
padding = 0 #(11)
self.conv = nn.Conv2d(in_channels=in_channels, #(12)
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
bias=False)
self.bn = nn.BatchNorm2d(num_features=out_channels) #(13)
self.relu6 = nn.ReLU6() #(14)
def forward(self, x):
x = self.relu6(self.bn(self.conv(x))) #(15)
return x
Each time we wish to instantiate a Conv
instance, we want to pass a price for the first
parameter as shown at the road marked with #(1)
within the above code. In the event you take a take a look at the architecture, you’ll notice that this Conv
layer shall be used either before the sequence of inverted residuals or right after the sequence. The Figure 9 below displays the architecture again with the 2 convolutions highlighted in pink and green, respectively. Later within the principal class, if we wish to instantiate the pink layer, we will simply set the first
flag to True
, and if we wish to instantiate the green one, we will run it without passing any arguments since I’ve set the flag to False
by default.

Using a flag like this helps us to use different configurations for the 2 convolutions. Once we use first=True
, we set the convolution layer to simply accept 3 input channels (#(2)
) and produce a 32-channel tensor (#(3)
). The kernel size used shall be 3×3 (#(4)
) with a stride of two (#(5)
), effectively downsampling the spatial dimension by half. With this kernel size, we want to set the padding to 1 (#(6)
) to forestall the convolution process from reducing the spatial dimension even further. All these configurations are essentially taken from the conv layer highlighted in pink.
Meanwhile, after we use first=False
, this convolution layer will take a tensor of 320 channels for the input (#(7)
) and produce one other one having 1280 channels (#(8)
). This green-highlighted layer is a pointwise convolution, hence we want to set the kernel size to 1 (#(9)
). Since here we won’t perform spatial downsampling, the stride parameter have to be set to 1 as shown at line #(10)
(notice that the input size of this layer and the subsequent one are each 7×7 spatially). Lastly, we set the padding to 0 (#(11)
) because by nature a 1×1 kernel cannot reduce spatial dimensions by itself.
Because the parameters for the convolution layer have been defined, the subsequent thing we do within the Conv
class above is to initialize the convolution layer itself using nn.Conv2d
(#(12)
) in addition to the batch normalization layer (#(13)
) and the ReLU6 activation function (#(14)
). Lastly, we assemble these layers to form the structure within the forward()
method (#(15)
). Along with the above code, don’t forget to use WIDTH_MULTIPLIER
when specifying the variety of input and output channels, i.e., at line #(3)
, #(7)
, and #(8)
, in order that we will adjust the model size just by changing the worth of the variable.
Now let’s check if we’ve got implemented the Conv
class accurately by running the 2 test cases below. The one in Codeblock 4 demonstrates the pink layer while the Codeblock 5 shows the green one. The form of the dummy tensor x
utilized in each tests are set in response to the input shapes required by each of the 2 layers. Based on the resulting outputs, we will confirm that our implementation is correct for the reason that output tensor shapes match exactly with the expected input shapes of the corresponding subsequent layers.
# Codeblock 4
conv = Conv(first=True)
x = torch.randn(1, 3, 224, 224)
out = conv(x)
out.shape
# Codeblock 4 Output
torch.Size([1, 32, 112, 112])
# Codeblock 5
conv = Conv(first=False)
x = torch.randn(1, int(320*WIDTH_MULTIPLIER), 7, 7)
out = conv(x)
out.shape
# Codeblock 5 Output
torch.Size([1, 1280, 7, 7])
Inverted Residual Block for Stride 2
As we’ve got accomplished the category for normal convolution layers, we are going to now talk concerning the one for the inverted residual blocks. Consider that there are cases where we use either stride 1 or 2, which ends up in a slight difference within the block structure (see Figure 6). On this case I made a decision to implement them in two separate classes. By way of practicality, it’d indeed be cleaner if we just put them throughout the same class. Nevertheless, for the sake of this tutorial I feel like breaking them down into two will make things easier to follow. I’m going to implement the one with stride 2 first since this one is easier because of the absence of the skip-connection. See the InvResidualS2
class in Codeblock 6 below for the main points.
# Codeblock 6
class InvResidualS2(nn.Module):
def __init__(self, in_channels, out_channels, t): #(1)
super().__init__()
in_channels = int(in_channels*WIDTH_MULTIPLIER) #(2)
out_channels = int(out_channels*WIDTH_MULTIPLIER) #(3)
self.pwconv0 = nn.Conv2d(in_channels=in_channels, #(4)
out_channels=in_channels*t,
kernel_size=1,
stride=1,
bias=False)
self.bn_pwconv0 = nn.BatchNorm2d(num_features=in_channels*t)
self.dwconv = nn.Conv2d(in_channels=in_channels*t, #(5)
out_channels=in_channels*t,
kernel_size=3, #(6)
stride=2,
padding=1,
groups=in_channels*t, #(7)
bias=False)
self.bn_dwconv = nn.BatchNorm2d(num_features=in_channels*t)
self.pwconv1 = nn.Conv2d(in_channels=in_channels*t, #(8)
out_channels=out_channels,
kernel_size=1,
stride=1,
bias=False)
self.bn_pwconv1 = nn.BatchNorm2d(num_features=out_channels)
self.relu6 = nn.ReLU6()
def forward(self, x):
print('originaltt:', x.shape)
x = self.pwconv0(x)
print('after pwconv0tt:', x.shape)
x = self.bn_pwconv0(x)
print('after bn0_pwconv0t:', x.shape)
x = self.relu6(x)
print('after relutt:', x.shape)
x = self.dwconv(x)
print('after dwconvtt:', x.shape)
x = self.bn_dwconv(x)
print('after bn_dwconvtt:', x.shape)
x = self.relu6(x)
print('after relutt:', x.shape)
x = self.pwconv1(x)
print('after pwconv1tt:', x.shape)
x = self.bn_pwconv1(x)
print('after bn_pwconv1t:', x.shape)
return x
The above class takes three parameters with the intention to work: in_channels
, out_channels
, and t
, as written at line #(1)
. The primary two corresponds to the variety of input and output channels of the inverted residual block, whereas t
is the expansion factor for determining the channel count of the a part of the block. So, what we mainly do here is simply to make the center tensors to have t
times more channels than the input. The variety of input and output channels themselves are adjustable via the WIDTH_MULTIPLIER
variable we initialized earlier as shown at line #(2)
and #(3)
.
What we want to do next is to initialize the layers throughout the inverted residual block in response to the structure in Figure 3 and 6. Notice within the two figures that we’ve got a depthwise convolution layer placed between two pointwise convolutions. The primary pointwise convolution (#(4)
) is used to expand the channel dimension from in_channels
to in_channels*t
. Subsequently, the depthwise convolution at line #(5)
is responsible to capture information along the spatial dimension. Here we set the kernel size to three×3 (#(6)
), which allows the layer to capture spatial information from its neighboring pixels. Don’t forget to set the groups
parameter to be the identical because the variety of input channels to this layer (#(7)
) since we wish the convolution operation to be performed independently of every channel. Next, we process the resulting tensor with the second pointwise convolution (#(8)
), by which this layer is used to project the tensor to the expected variety of output channels of the block.
Within the forward()
method, we place the layers one after one other. Do not forget that we use the structure aside from the last convolution, following the convention of linear bottleneck we discussed earlier. Moreover, here I also print out the output shape after each layer so which you can clearly see how the tensor transforms throughout the process.
Next, we’re going to test whether the InvResidualS2
class works properly. The next testing code simulates the primary inverted residual block () of the third row within the architecture (i.e., the one having 16×112×112 input shape).
# Codeblock 7
inv_residual_s2 = InvResidualS2(in_channels=16, out_channels=24, t=6)
x = torch.randn(1, int(16*WIDTH_MULTIPLIER), 112, 112)
out = inv_residual_s2(x)
You’ll be able to see at the road marked with #(1)
in the next output that the primary pointwise convolution successfully expands the channel axis from 16 to 96. The spatial dimension shrinks from 112×112 to 56×56 after the tensor being processed by the depthwise convolution layer in the center (#(2)
). Lastly, our second pointwise convolution compresses the variety of channels to 24 as written at line #(3)
. This final tensor dimension is now able to be passed through the subsequent inverted residual block throughout the same stage.
# Codeblock 7 Output
original : torch.Size([1, 16, 112, 112])
after pwconv0 : torch.Size([1, 96, 112, 112]) #(1)
after bn0_pwconv0 : torch.Size([1, 96, 112, 112])
after relu : torch.Size([1, 96, 112, 112])
after dwconv : torch.Size([1, 96, 56, 56]) #(2)
after bn_dwconv : torch.Size([1, 96, 56, 56])
after relu : torch.Size([1, 96, 56, 56])
after pwconv1 : torch.Size([1, 24, 56, 56]) #(3)
after bn_pwconv1 : torch.Size([1, 24, 56, 56])
Inverted Residual Block for Stride 1
The code used for implementing the inverted residual block with stride 1 is usually much like the one with stride 2. See the InvResidualS1
class in Codeblock 8 below.
# Codeblock 8
class InvResidualS1(nn.Module):
def __init__(self, in_channels, out_channels, t):
super().__init__()
in_channels = int(in_channels*WIDTH_MULTIPLIER) #(1)
out_channels = int(out_channels*WIDTH_MULTIPLIER) #(2)
self.in_channels = in_channels
self.out_channels = out_channels
self.pwconv0 = nn.Conv2d(in_channels=in_channels,
out_channels=in_channels*t,
kernel_size=1,
stride=1,
bias=False)
self.bn_pwconv0 = nn.BatchNorm2d(num_features=in_channels*t)
self.dwconv = nn.Conv2d(in_channels=in_channels*t,
out_channels=in_channels*t,
kernel_size=3,
stride=1, #(3)
padding=1,
groups=in_channels*t,
bias=False)
self.bn_dwconv = nn.BatchNorm2d(num_features=in_channels*t)
self.pwconv1 = nn.Conv2d(in_channels=in_channels*t,
out_channels=out_channels,
kernel_size=1,
stride=1,
bias=False)
self.bn_pwconv1 = nn.BatchNorm2d(num_features=out_channels)
self.relu6 = nn.ReLU6()
def forward(self, x):
if self.in_channels == self.out_channels: #(4)
residual = x #(5)
print(f'residualtt: {residual.size()}')
x = self.pwconv0(x)
print('after pwconv0tt:', x.shape)
x = self.bn_pwconv0(x)
print('after bn_pwconv0t:', x.shape)
x = self.relu6(x)
print('after relutt:', x.shape)
x = self.dwconv(x)
print('after dwconvtt:', x.shape)
x = self.bn_dwconv(x)
print('after bn_dwconvtt:', x.shape)
x = self.relu6(x)
print('after relutt:', x.shape)
x = self.pwconv1(x)
print('after pwconv1tt:', x.shape)
x = self.bn_pwconv1(x)
print('after bn_pwconv1t:', x.shape)
if self.in_channels == self.out_channels:
x = x + residual #(6)
print('after summationtt:', x.shape)
return x
The primary difference we’ve got here is unquestionably the stride
parameter itself, especially the one belongs to the depthwise convolution layer at line #(3)
. By setting the stride
parameter to 1 like this, the spatial output dimension of this inverted residual block goes to be the identical because the input.
One other thing that we didn’t do previously is creating instance attributes for in_channels
and out_channels
as shown at lines #(1)
and #(2)
. We do that now because afterward we are going to must access these values from the forward()
method. This is definitely only a basic OOP concept, where if we don’t assign them to self
, then they may only exist locally throughout the __init__()
method and won’t be available to other methods in the category.
Contained in the forward()
method itself, what we want to do first is to ascertain whether the variety of input and output channels are the identical (#(4)
). In that case, we are going to keep the unique input tensor (#(5)
) to implement the skip-connection, by which this tensor shall be element-wise summed with the one from the principal flow (#(6)
). This tensor dimensionality checking is performed because we want to be certain that the 2 tensors to be summed have the very same size. We indeed have guaranteed the spatial dimension to stay unchanged since we’ve got set all of the three convolution layers to make use of stride 1. Nevertheless, there remains to be a possibility that the variety of output channels differs from the input, identical to the primary block throughout the stages highlighted in purple, blue and orange in Figure 10 below. In such cases, skip-connection won’t be applied since it’s just not possible to perform element-wise summation on tensors with different shapes.

Now let’s test the InvResidualS1
class by running the Codeblock 9 below. Here I’m going to simulate the second inverted residual block (=2) of the third row within the architecture, by which this is definitely just the continuation of the previous test case. Here you possibly can see that the dummy tensor we use has the very same shape because the one we obtained from Codeblock 7, i.e., 24×56×56.
# Codeblock 9
inv_residual_s1 = InvResidualS1(in_channels=24, out_channels=24, t=6)
x = torch.randn(1, int(24*WIDTH_MULTIPLIER), 56, 56)
out = inv_residual_s1(x)
And below is what the resulting output looks like. It’s clearly seen here that the network indeed follows the structure, which on this case is 24 → 144 → 24. Along with this, for the reason that spatial dimensions of the input and the output tensors are the identical, we will technically stack this inverted residual block as persistently as we wish.
# Codeblock 9 Output
residual : torch.Size([1, 24, 56, 56])
after pwconv0 : torch.Size([1, 144, 56, 56])
after bn_pwconv0 : torch.Size([1, 144, 56, 56])
after relu : torch.Size([1, 144, 56, 56])
after dwconv : torch.Size([1, 144, 56, 56])
after bn_dwconv : torch.Size([1, 144, 56, 56])
after relu : torch.Size([1, 144, 56, 56])
after pwconv1 : torch.Size([1, 24, 56, 56])
after bn_pwconv1 : torch.Size([1, 24, 56, 56])
after summation : torch.Size([1, 24, 56, 56])
The Entire MobileNetV2 Architecture
As we’ve got accomplished defining the Conv
, InvResidualS2
and InvResidualS1
classes, we will now assemble all of them to construct all the MobileNetV2 architecture. Have a look at the Codeblock 10 below to see how I try this.
# Codeblock 10
class MobileNetV2(nn.Module):
def __init__(self):
super().__init__()
# Input shape: 3x224x224
self.first_conv = Conv(first=True)
# Input shape: 32x112x112
self.inv_residual0 = InvResidualS1(in_channels=32,
out_channels=16,
t=1)
# Input shape: 16x112x112
self.inv_residual1 = nn.ModuleList([InvResidualS2(in_channels=16,
out_channels=24,
t=6)])
self.inv_residual1.append(InvResidualS1(in_channels=24,
out_channels=24,
t=6))
# Input shape: 24x56x56
self.inv_residual2 = nn.ModuleList([InvResidualS2(in_channels=24,
out_channels=32,
t=6)])
for _ in range(2):
self.inv_residual2.append(InvResidualS1(in_channels=32,
out_channels=32,
t=6))
# Input shape: 32x28x28
self.inv_residual3 = nn.ModuleList([InvResidualS2(in_channels=32,
out_channels=64,
t=6)])
for _ in range(3):
self.inv_residual3.append(InvResidualS1(in_channels=64,
out_channels=64,
t=6))
# Input shape: 64x14x14
self.inv_residual4 = nn.ModuleList([InvResidualS1(in_channels=64,
out_channels=96,
t=6)])
for _ in range(2):
self.inv_residual4.append(InvResidualS1(in_channels=96,
out_channels=96,
t=6))
# Input shape: 96x14x14
self.inv_residual5 = nn.ModuleList([InvResidualS2(in_channels=96,
out_channels=160,
t=6)])
for _ in range(2):
self.inv_residual5.append(InvResidualS1(in_channels=160,
out_channels=160,
t=6))
# Input shape: 160x7x7
self.inv_residual6 = InvResidualS1(in_channels=160,
out_channels=320,
t=6)
# Input shape: 320x7x7
self.last_conv = Conv(first=False)
self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1)) #(1)
self.dropout = nn.Dropout(p=0.2) #(2)
self.fc = nn.Linear(in_features=int(1280*WIDTH_MULTIPLIER), #(3)
out_features=1000)
def forward(self, x):
x = self.first_conv(x)
print(f"after first_convt: {x.shape}")
x = self.inv_residual0(x)
print(f"after inv_residual0t: {x.shape}")
for i, layer in enumerate(self.inv_residual1):
x = layer(x)
print(f"after inv_residual1 #{i}t: {x.shape}")
for i, layer in enumerate(self.inv_residual2):
x = layer(x)
print(f"after inv_residual2 #{i}t: {x.shape}")
for i, layer in enumerate(self.inv_residual3):
x = layer(x)
print(f"after inv_residual3 #{i}t: {x.shape}")
for i, layer in enumerate(self.inv_residual4):
x = layer(x)
print(f"after inv_residual4 #{i}t: {x.shape}")
for i, layer in enumerate(self.inv_residual5):
x = layer(x)
print(f"after inv_residual5 #{i}t: {x.shape}")
x = self.inv_residual6(x)
print(f"after inv_residual6t: {x.shape}")
x = self.last_conv(x)
print(f"after last_convtt: {x.shape}")
x = self.avgpool(x)
print(f"after avgpooltt: {x.shape}")
x = torch.flatten(x, start_dim=1)
print(f"after flattentt: {x.shape}")
x = self.dropout(x)
print(f"after dropouttt: {x.shape}")
x = self.fc(x)
print(f"after fctt: {x.shape}")
return x
Despite being quite long, I feel the above code is pretty straightforward since what we mainly do here is just to put the blocks in response to the given architectural details. Nevertheless, I actually need you to concentrate to the variety of block repeats inside a single stage () in addition to whether or not the primary block in a stage performs downsampling (). It is because the architecture doesn’t appear to follow a selected pattern. There’s a case where the block is repeated 4 times, there are other cases where the repeats is completed two or 3 times, and there’s even a stage that consists of a single block only. Not only that, it is usually unclear under what conditions authors decided to make use of stride 1 or 2 for the primary block within the stage. Nevertheless, I consider that this final architecture was obtained based on their internal design iterations and experiments that are usually not discussed within the paper.
Going back to the code, after the stages have been initialized, what we want to do next is to initialize the remaining layers, namely a mean pooling layer (#(1)
), a dropout layer (#(2)
) and a linear layer (#(3)
) for the classification head. In the event you return to the architectural details, you’ll notice that the ultimate layer must be a pointwise convolution, not a linear layer like this. Actually, within the case when the spatial dimension of the input tensor is 1×1, a pointwise convolution and a linear layer are equivalent. So, it’s mainly fantastic to make use of either one.
To make sure our MobileNetV2 is working properly, we will run the Codeblock 11 below. Here we will see that this class instance runs with none errors. More importantly, the output shape also matches exactly with the architecture laid out in the paper. This confirms that our implementation is correct, and thus ready for training — just don’t forget to regulate the output size of the ultimate layer to match the variety of classes in your dataset.
# Codeblock 11
mobilenetv2 = MobileNetV2()
x = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
out = mobilenetv2(x)
# Codeblock 11 Output
after first_conv : torch.Size([1, 32, 112, 112])
after inv_residual1 : torch.Size([1, 16, 112, 112])
after inv_residual1 #0 : torch.Size([1, 24, 56, 56])
after inv_residual1 #1 : torch.Size([1, 24, 56, 56])
after inv_residual2 #0 : torch.Size([1, 32, 28, 28])
after inv_residual2 #1 : torch.Size([1, 32, 28, 28])
after inv_residual2 #2 : torch.Size([1, 32, 28, 28])
after inv_residual3 #0 : torch.Size([1, 64, 14, 14])
after inv_residual3 #1 : torch.Size([1, 64, 14, 14])
after inv_residual3 #2 : torch.Size([1, 64, 14, 14])
after inv_residual3 #3 : torch.Size([1, 64, 14, 14])
after inv_residual4 #0 : torch.Size([1, 96, 14, 14])
after inv_residual4 #1 : torch.Size([1, 96, 14, 14])
after inv_residual4 #2 : torch.Size([1, 96, 14, 14])
after inv_residual5 #0 : torch.Size([1, 160, 7, 7])
after inv_residual5 #1 : torch.Size([1, 160, 7, 7])
after inv_residual5 #2 : torch.Size([1, 160, 7, 7])
after inv_residual6 : torch.Size([1, 320, 7, 7])
after last_conv : torch.Size([1, 1280, 7, 7])
after avgpool : torch.Size([1, 1280, 1, 1])
after flatten : torch.Size([1, 1280])
after dropout : torch.Size([1, 1280])
after fc : torch.Size([1, 1000])
Alternatively, it is usually possible to check our MobileNetV2 model using the summary()
function from torchinfo
, which may even show us the variety of parameters contained inside each layer. In the event you scroll down all of the option to the tip of the output, you’ll see that this model with default has 3,505,960 trainable params. This number is different from the one disclosed within the paper, where in response to Figure 7 it must be 3.4 million. Nevertheless, if we go to the official PyTorch documentation [7], it says that the parameter count of this model is 3,504,872, which may be very near our implementation. Let me know within the comments in the event you know which parts of the code I should change to make this number match exactly with the one from PyTorch.
# Codeblock 12
mobilenetv2 = MobileNetV2()
summary(mobilenetv2, input_size=(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE))
# Codeblock 12 Output
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
MobileNetV2 [1, 1000] --
├─Conv: 1-1 [1, 32, 112, 112] --
│ └─Conv2d: 2-1 [1, 32, 112, 112] 864
│ └─BatchNorm2d: 2-2 [1, 32, 112, 112] 64
│ └─ReLU6: 2-3 [1, 32, 112, 112] --
├─InvResidualS1: 1-2 [1, 16, 112, 112] --
│ └─Conv2d: 2-4 [1, 32, 112, 112] 1,024
│ └─BatchNorm2d: 2-5 [1, 32, 112, 112] 64
│ └─ReLU6: 2-6 [1, 32, 112, 112] --
│ └─Conv2d: 2-7 [1, 32, 112, 112] 288
│ └─BatchNorm2d: 2-8 [1, 32, 112, 112] 64
│ └─ReLU6: 2-9 [1, 32, 112, 112] --
│ └─Conv2d: 2-10 [1, 16, 112, 112] 512
│ └─BatchNorm2d: 2-11 [1, 16, 112, 112] 32
├─ModuleList: 1-3 -- --
│ └─InvResidualS2: 2-12 [1, 24, 56, 56] --
│ │ └─Conv2d: 3-1 [1, 96, 112, 112] 1,536
│ │ └─BatchNorm2d: 3-2 [1, 96, 112, 112] 192
│ │ └─ReLU6: 3-3 [1, 96, 112, 112] --
│ │ └─Conv2d: 3-4 [1, 96, 56, 56] 864
│ │ └─BatchNorm2d: 3-5 [1, 96, 56, 56] 192
│ │ └─ReLU6: 3-6 [1, 96, 56, 56] --
│ │ └─Conv2d: 3-7 [1, 24, 56, 56] 2,304
│ │ └─BatchNorm2d: 3-8 [1, 24, 56, 56] 48
│ └─InvResidualS1: 2-13 [1, 24, 56, 56] --
│ │ └─Conv2d: 3-9 [1, 144, 56, 56] 3,456
│ │ └─BatchNorm2d: 3-10 [1, 144, 56, 56] 288
│ │ └─ReLU6: 3-11 [1, 144, 56, 56] --
│ │ └─Conv2d: 3-12 [1, 144, 56, 56] 1,296
│ │ └─BatchNorm2d: 3-13 [1, 144, 56, 56] 288
│ │ └─ReLU6: 3-14 [1, 144, 56, 56] --
│ │ └─Conv2d: 3-15 [1, 24, 56, 56] 3,456
│ │ └─BatchNorm2d: 3-16 [1, 24, 56, 56] 48
├─ModuleList: 1-4 -- --
│ └─InvResidualS2: 2-14 [1, 32, 28, 28] --
│ │ └─Conv2d: 3-17 [1, 144, 56, 56] 3,456
│ │ └─BatchNorm2d: 3-18 [1, 144, 56, 56] 288
│ │ └─ReLU6: 3-19 [1, 144, 56, 56] --
│ │ └─Conv2d: 3-20 [1, 144, 28, 28] 1,296
│ │ └─BatchNorm2d: 3-21 [1, 144, 28, 28] 288
│ │ └─ReLU6: 3-22 [1, 144, 28, 28] --
│ │ └─Conv2d: 3-23 [1, 32, 28, 28] 4,608
│ │ └─BatchNorm2d: 3-24 [1, 32, 28, 28] 64
│ └─InvResidualS1: 2-15 [1, 32, 28, 28] --
│ │ └─Conv2d: 3-25 [1, 192, 28, 28] 6,144
│ │ └─BatchNorm2d: 3-26 [1, 192, 28, 28] 384
│ │ └─ReLU6: 3-27 [1, 192, 28, 28] --
│ │ └─Conv2d: 3-28 [1, 192, 28, 28] 1,728
│ │ └─BatchNorm2d: 3-29 [1, 192, 28, 28] 384
│ │ └─ReLU6: 3-30 [1, 192, 28, 28] --
│ │ └─Conv2d: 3-31 [1, 32, 28, 28] 6,144
│ │ └─BatchNorm2d: 3-32 [1, 32, 28, 28] 64
│ └─InvResidualS1: 2-16 [1, 32, 28, 28] --
│ │ └─Conv2d: 3-33 [1, 192, 28, 28] 6,144
│ │ └─BatchNorm2d: 3-34 [1, 192, 28, 28] 384
│ │ └─ReLU6: 3-35 [1, 192, 28, 28] --
│ │ └─Conv2d: 3-36 [1, 192, 28, 28] 1,728
│ │ └─BatchNorm2d: 3-37 [1, 192, 28, 28] 384
│ │ └─ReLU6: 3-38 [1, 192, 28, 28] --
│ │ └─Conv2d: 3-39 [1, 32, 28, 28] 6,144
│ │ └─BatchNorm2d: 3-40 [1, 32, 28, 28] 64
├─ModuleList: 1-5 -- --
│ └─InvResidualS2: 2-17 [1, 64, 14, 14] --
│ │ └─Conv2d: 3-41 [1, 192, 28, 28] 6,144
│ │ └─BatchNorm2d: 3-42 [1, 192, 28, 28] 384
│ │ └─ReLU6: 3-43 [1, 192, 28, 28] --
│ │ └─Conv2d: 3-44 [1, 192, 14, 14] 1,728
│ │ └─BatchNorm2d: 3-45 [1, 192, 14, 14] 384
│ │ └─ReLU6: 3-46 [1, 192, 14, 14] --
│ │ └─Conv2d: 3-47 [1, 64, 14, 14] 12,288
│ │ └─BatchNorm2d: 3-48 [1, 64, 14, 14] 128
│ └─InvResidualS1: 2-18 [1, 64, 14, 14] --
│ │ └─Conv2d: 3-49 [1, 384, 14, 14] 24,576
│ │ └─BatchNorm2d: 3-50 [1, 384, 14, 14] 768
│ │ └─ReLU6: 3-51 [1, 384, 14, 14] --
│ │ └─Conv2d: 3-52 [1, 384, 14, 14] 3,456
│ │ └─BatchNorm2d: 3-53 [1, 384, 14, 14] 768
│ │ └─ReLU6: 3-54 [1, 384, 14, 14] --
│ │ └─Conv2d: 3-55 [1, 64, 14, 14] 24,576
│ │ └─BatchNorm2d: 3-56 [1, 64, 14, 14] 128
│ └─InvResidualS1: 2-19 [1, 64, 14, 14] --
│ │ └─Conv2d: 3-57 [1, 384, 14, 14] 24,576
│ │ └─BatchNorm2d: 3-58 [1, 384, 14, 14] 768
│ │ └─ReLU6: 3-59 [1, 384, 14, 14] --
│ │ └─Conv2d: 3-60 [1, 384, 14, 14] 3,456
│ │ └─BatchNorm2d: 3-61 [1, 384, 14, 14] 768
│ │ └─ReLU6: 3-62 [1, 384, 14, 14] --
│ │ └─Conv2d: 3-63 [1, 64, 14, 14] 24,576
│ │ └─BatchNorm2d: 3-64 [1, 64, 14, 14] 128
│ └─InvResidualS1: 2-20 [1, 64, 14, 14] --
│ │ └─Conv2d: 3-65 [1, 384, 14, 14] 24,576
│ │ └─BatchNorm2d: 3-66 [1, 384, 14, 14] 768
│ │ └─ReLU6: 3-67 [1, 384, 14, 14] --
│ │ └─Conv2d: 3-68 [1, 384, 14, 14] 3,456
│ │ └─BatchNorm2d: 3-69 [1, 384, 14, 14] 768
│ │ └─ReLU6: 3-70 [1, 384, 14, 14] --
│ │ └─Conv2d: 3-71 [1, 64, 14, 14] 24,576
│ │ └─BatchNorm2d: 3-72 [1, 64, 14, 14] 128
├─ModuleList: 1-6 -- --
│ └─InvResidualS1: 2-21 [1, 96, 14, 14] --
│ │ └─Conv2d: 3-73 [1, 384, 14, 14] 24,576
│ │ └─BatchNorm2d: 3-74 [1, 384, 14, 14] 768
│ │ └─ReLU6: 3-75 [1, 384, 14, 14] --
│ │ └─Conv2d: 3-76 [1, 384, 14, 14] 3,456
│ │ └─BatchNorm2d: 3-77 [1, 384, 14, 14] 768
│ │ └─ReLU6: 3-78 [1, 384, 14, 14] --
│ │ └─Conv2d: 3-79 [1, 96, 14, 14] 36,864
│ │ └─BatchNorm2d: 3-80 [1, 96, 14, 14] 192
│ └─InvResidualS1: 2-22 [1, 96, 14, 14] --
│ │ └─Conv2d: 3-81 [1, 576, 14, 14] 55,296
│ │ └─BatchNorm2d: 3-82 [1, 576, 14, 14] 1,152
│ │ └─ReLU6: 3-83 [1, 576, 14, 14] --
│ │ └─Conv2d: 3-84 [1, 576, 14, 14] 5,184
│ │ └─BatchNorm2d: 3-85 [1, 576, 14, 14] 1,152
│ │ └─ReLU6: 3-86 [1, 576, 14, 14] --
│ │ └─Conv2d: 3-87 [1, 96, 14, 14] 55,296
│ │ └─BatchNorm2d: 3-88 [1, 96, 14, 14] 192
│ └─InvResidualS1: 2-23 [1, 96, 14, 14] --
│ │ └─Conv2d: 3-89 [1, 576, 14, 14] 55,296
│ │ └─BatchNorm2d: 3-90 [1, 576, 14, 14] 1,152
│ │ └─ReLU6: 3-91 [1, 576, 14, 14] --
│ │ └─Conv2d: 3-92 [1, 576, 14, 14] 5,184
│ │ └─BatchNorm2d: 3-93 [1, 576, 14, 14] 1,152
│ │ └─ReLU6: 3-94 [1, 576, 14, 14] --
│ │ └─Conv2d: 3-95 [1, 96, 14, 14] 55,296
│ │ └─BatchNorm2d: 3-96 [1, 96, 14, 14] 192
├─ModuleList: 1-7 -- --
│ └─InvResidualS2: 2-24 [1, 160, 7, 7] --
│ │ └─Conv2d: 3-97 [1, 576, 14, 14] 55,296
│ │ └─BatchNorm2d: 3-98 [1, 576, 14, 14] 1,152
│ │ └─ReLU6: 3-99 [1, 576, 14, 14] --
│ │ └─Conv2d: 3-100 [1, 576, 7, 7] 5,184
│ │ └─BatchNorm2d: 3-101 [1, 576, 7, 7] 1,152
│ │ └─ReLU6: 3-102 [1, 576, 7, 7] --
│ │ └─Conv2d: 3-103 [1, 160, 7, 7] 92,160
│ │ └─BatchNorm2d: 3-104 [1, 160, 7, 7] 320
│ └─InvResidualS1: 2-25 [1, 160, 7, 7] --
│ │ └─Conv2d: 3-105 [1, 960, 7, 7] 153,600
│ │ └─BatchNorm2d: 3-106 [1, 960, 7, 7] 1,920
│ │ └─ReLU6: 3-107 [1, 960, 7, 7] --
│ │ └─Conv2d: 3-108 [1, 960, 7, 7] 8,640
│ │ └─BatchNorm2d: 3-109 [1, 960, 7, 7] 1,920
│ │ └─ReLU6: 3-110 [1, 960, 7, 7] --
│ │ └─Conv2d: 3-111 [1, 160, 7, 7] 153,600
│ │ └─BatchNorm2d: 3-112 [1, 160, 7, 7] 320
│ └─InvResidualS1: 2-26 [1, 160, 7, 7] --
│ │ └─Conv2d: 3-113 [1, 960, 7, 7] 153,600
│ │ └─BatchNorm2d: 3-114 [1, 960, 7, 7] 1,920
│ │ └─ReLU6: 3-115 [1, 960, 7, 7] --
│ │ └─Conv2d: 3-116 [1, 960, 7, 7] 8,640
│ │ └─BatchNorm2d: 3-117 [1, 960, 7, 7] 1,920
│ │ └─ReLU6: 3-118 [1, 960, 7, 7] --
│ │ └─Conv2d: 3-119 [1, 160, 7, 7] 153,600
│ │ └─BatchNorm2d: 3-120 [1, 160, 7, 7] 320
├─InvResidualS1: 1-8 [1, 320, 7, 7] --
│ └─Conv2d: 2-27 [1, 960, 7, 7] 153,600
│ └─BatchNorm2d: 2-28 [1, 960, 7, 7] 1,920
│ └─ReLU6: 2-29 [1, 960, 7, 7] --
│ └─Conv2d: 2-30 [1, 960, 7, 7] 8,640
│ └─BatchNorm2d: 2-31 [1, 960, 7, 7] 1,920
│ └─ReLU6: 2-32 [1, 960, 7, 7] --
│ └─Conv2d: 2-33 [1, 320, 7, 7] 307,200
│ └─BatchNorm2d: 2-34 [1, 320, 7, 7] 640
├─Conv: 1-9 [1, 1280, 7, 7] --
│ └─Conv2d: 2-35 [1, 1280, 7, 7] 409,600
│ └─BatchNorm2d: 2-36 [1, 1280, 7, 7] 2,560
│ └─ReLU6: 2-37 [1, 1280, 7, 7] --
├─AdaptiveAvgPool2d: 1-10 [1, 1280, 1, 1] --
├─Dropout: 1-11 [1, 1280] --
├─Linear: 1-12 [1, 1000] 1,281,000
==========================================================================================
Total params: 3,505,960
Trainable params: 3,505,960
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 313.65
==========================================================================================
Input size (MB): 0.60
Forward/backward pass size (MB): 113.28
Params size (MB): 14.02
Estimated Total Size (MB): 127.91
==========================================================================================
Ending
And that’s just about every little thing about MobileNetV2. I do encourage you to explore this architecture on your individual — no less than by actually training it on a picture classification dataset. Don’t forget to mess around with the and the parameters to seek out the precise balance between prediction accuracy and computational efficiency. You can too find the code utilized in this text in my GitHub repository [8] by the way in which.
I hope you learned something recent today. Thanks for reading!
References
[1] Muhammad Ardi. MobileNetV1 Paper Walkthrough: The Tiny Giant. Towards Data Science. https://towardsdatascience.com/the-tiny-giant-mobilenetv1/ [Accessed September 25, 2025].
[2] Andrew G. Howard MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Arxiv. https://arxiv.org/abs/1704.04861 [Accessed April 7, 2025].
[3] Mark Sandler MobileNetV2: Inverted Residuals and Linear Bottlenecks. Arxiv. https://arxiv.org/abs/1801.04381 [Accessed April 12, 2025].
[4] Kaiming He Deep Residual Learning for Image Recognition. Arxiv. https://arxiv.org/abs/1512.03385 [Accessed April 12, 2025].
[5] Zhuang Liu A ConvNet for the 2020s. Arxiv. https://arxiv.org/abs/2201.03545 [Accessed April 12, 2025].
[6] Image created originally by writer.
[7] mobilenet_v2. PyTorch. https://pytorch.org/vision/principal/models/generated/torchvision.models.mobilenet_v2.html#mobilenet-v2 [Accessed April 12, 2025].
[8] MuhammadArdiPutra. The Smarter Tiny Giant — MobileNetV2. GitHub. medium_articles/The Smarter Tiny Giant — MobileNetV2.ipynb at principal · MuhammadArdiPutra/medium_articles [Accessed April 12, 2025].