In the event you read the title of this text, you may probably think that ResNeXt is directly derived from ResNet. Well, that’s true, but I believe it’s not entirely accurate. In reality, to me ResNeXt is form of like the mix of ResNet, VGG, and Inception at the identical time — I’ll show you the rationale in a second. In this text we’re going to talk in regards to the ResNeXt architecture, which incorporates the history, the small print of the architecture itself, and the last but not least, the code implementation from scratch with PyTorch.
The History of ResNeXt
The hyperparameter we often put our concern on when tuning a neural network model is the depth and width, which corresponds to the variety of layers and the variety of channels, respectively. We see this in VGG and ResNet, where the authors of the 2 models proposed small-sized kernels and skip-connections in order that they will increase the depth of the model easily. In theory, this easy approach is indeed able to expanding model capability. Nonetheless, the 2 hyperparameter dimensions are at all times related to a big change within the variety of parameters, which is certainly an issue since in some unspecified time in the future we could have our model becoming too large simply to make a slight improvement on accuracy. Alternatively, we knew that in theory Inception is computationally cheaper, yet it has a posh architectural design, which requires us to place more effort to tune the depth and the width of this network. If you’ve ever learned about Inception, it essentially works by passing a tensor through several convolution layers of various kernel sizes and let the network resolve which one is healthier to represent the features of a selected task.
Xie wondered if they might extract one of the best a part of the three models in order that model tuning may be easier like VGG and ResNet while still maintaining the efficiency of Inception. All their ideas are wrapped in a paper titled “” [1], where they named the network . This is actually where a brand new concept known as got here from, wherein it essentially adopts the concept of Inception, i.e., passing a tensor through multiple branches, yet in a less complicated, more scalable way. We are able to perceive cardinality as a brand new parameter possible to be tuned along with depth and width. By doing so, we now essentially have the hyperparameter dimension — hence the name, — which allows us to have the next degree of freedom to perform parameter tuning.
ResNeXt Module
In response to the paper, there are 3 ways we will do to implement cardinality, which you’ll see in Figure 1 below. The paper also mentions that setting cardinality to 32 is one of the best practice because it generally provides a great balance between accuracy and computational complexity, so I’ll use this number to elucidate the next example.
The input of the three modules above is precisely the identical, i.e., a picture tensor having 256 channels. In variant (a), the input tensor is duplicated 32 times, wherein each copy will likely be processed independently to represent the 32 paths. The primary convolution layer in each path is responsible to project the 256-channel image into 4 using 1×1 kernel, which is followed by two more layers: a 3×3 convolution that preserves the variety of channels, and a 1×1 convolution that expands the channels back to 256. The tensors from the 32 branches are then aggregated by element-wise summation before eventually being summed again with the unique input tensor from the very starting of the module through skip-connection.
Keep in mind that Inception uses the concept of . This is precisely what I just explained for the ResNeXt block variant (a), where the is completed before the primary 1×1 convolution layer, the is performed inside each branch, and the is the element-wise summation operations. This concept also applies to the ResNeXt module variant (b), wherein case the operation is performed by channel-wise concatenation leading to 128-channel image (which comes from 4 channels × 32 paths). The resulting tensor is then projected back to the unique dimension by 1×1 convolution layer before eventually summed with the unique input tensor.
Notice that there’s a word within the top-left corner of the above figure. Which means these three ResNeXt block variants are mainly the same by way of the variety of parameters, FLOPs, and the resulting accuracy scores. This notion is smart because they’re all mainly derived from the identical mathematical formulation. I’ll talk more about it later in the following section. Despite this equivalency, I’ll go together with option (c) later within the implementation part. It’s because this variant employs the so-called , which is far easier to implement than (a) and (b). In case you’re not yet aware of the term, it is actually a way in a convolution operation where we divide all input channels into several groups wherein each of those is responsible to process channels inside the same group before eventually concatenating them. Within the case of (c), we reduce the variety of channels from 256 to 128 before the splitting is completed, allowing us to have 32 convolution kernel groups where each responsible to process 4 channels. We then project the tensor back to the unique variety of channels in order that we will sum it with the unique input tensor.
Mathematical Definition
As I discussed earlier, here’s what the formal mathematical definition of a ResNeXt module looks like.

The above equation encapsulates the complete operation, where is the unique input tensor, is the output tensor, is the cardinality parameter to find out the variety of parallel paths used, is the transformation function applied to every path, and indicates that we’ll merge all information from the transformed tensors. Nonetheless, it’s important to notice that though sigma often denotes summation, only (a) that really sums the tensors. Meanwhile, each (b) and (c) do the merging through concatenation followed by 1×1 convolution as an alternative, which in actual fact remains to be corresponding to (a).
The Entire ResNeXt Architecture
The structure displayed in Figure 1 and the equation in Figure 2 mainly only correspond to a single ResNeXt block. To be able to construct the complete architecture, we want to stack the block multiple times following the structure shown in Figure 3 below.

Here you’ll be able to see that the structure of ResNeXt is sort of equivalent to ResNet. So, I consider you’ll later find the ResNeXt implementation extremely easy, especially if you’ve ever implemented ResNet before. The primary difference you may notice within the architecture is the variety of kernels of the primary two convolution layers in each block, where the ResNeXt block generally has twice as many kernels as that of the corresponding ResNet block, especially ranging from the stage all of the strategy to the stage. Secondly, additionally it is clearly seen that we’ve the cardinality parameter applied to the second convolution layer in each ResNeXt block.
The ResNeXt variant implemented above, which is corresponding to ResNet-50, is the one known as . This naming convention indicates that this variant consists of fifty layers within the predominant branch with 32 cardinality and 4 variety of channels in each path inside the stage. As of this writing, there are three ResNeXt variants already implemented in PyTorch, namely , , and [2]. You possibly can definitely import them easily alongside the pretrained weights when you want. Nonetheless, in this text we’re going to implement the architecture from scratch as an alternative.
ResNeXt Implementation
As we’ve understood the underlying theory behind ResNeXt, let’s now get our hands dirty with the code! The very first thing we do is to import the required modules as shown in Codeblock 1 below.
# Codeblock 1
import torch
import torch.nn as nn
from torchinfo import summary
Here I’m going to implement the variant. So, I would like to set the parameters in Codeblock 2 in line with the architectural details shown back in Figure 3.
# Codeblock 2
CARDINALITY = 32 #(1)
NUM_CHANNELS = [3, 64, 256, 512, 1024, 2048] #(2)
NUM_BLOCKS = [3, 4, 6, 3] #(3)
NUM_CLASSES = 1000 #(4)
The CARDINALITY
variable at line #(1)
is self-explanatory, so I don’t think I would like to elucidate it any further. Next, the NUM_CHANNELS
variable is used to store the variety of output channels of every stage, aside from index 0 where it corresponds to the variety of input channels (#(2)
). At line #(3)
, NUM_BLOCKS
is used to find out how repeatedly we are going to repeat the corresponding block. Note that we don’t specify any number for the stage since this stage only consists of a single block. Lastly here we set the NUM_CLASSES
parameter to 1000 since ResNeXt is originally pretrained on ImageNet-1K dataset (#(4)
).
The ResNeXt Module
Since the complete ResNeXt architecture is essentially only a bunch of ResNeXt modules, we will mainly create a single class to define the module and later use it repeatedly within the predominant class. On this case, I discuss with the module as Block
. The implementation of this class is pretty long, though. So I made a decision to interrupt it down into several codeblocks. Just be sure that all of the codeblocks of the identical number are placed inside the same notebook cell if you need to run the code.
You possibly can see within the Codeblock 3a below that the __init__()
approach to this class accepts several parameters. The in_channels
parameter (#(1)
) is used to set the variety of channels of the tensor to be passed into the block. I set it to be adjustable since the blocks in several stage could have different input shapes. Secondly, the add_channel
and downsample
parameters (#(2,4)
) are flags to regulate whether the block will perform downsampling. In the event you take a more in-depth have a look at Figure 3, you’ll notice that each time we move from one stage to a different, the variety of output channels of the block becomes twice as large because the output from the previous stage while at the identical time the spatial dimension is reduced by half. We want to set each add_channel
and downsample
to True
every time we move from one stage to the subsequent one. Otherwise, we set the 2 parameters to False
if we only move from one block to a different inside the same stage. The channel_multiplier
parameter (#(3)
), however, is used to find out the variety of output channels relative to the variety of input channels by changing the multiplication factor. This parameter is essential because there’s a special case where we want to make the variety of output channels to be 4 times larger as an alternative of two, i.e., after we move from stage (64) to stage (256).
# Codeblock 3a
class Block(nn.Module):
def __init__(self,
in_channels, #(1)
add_channel=False, #(2)
channel_multiplier=2, #(3)
downsample=False): #(4)
super().__init__()
self.add_channel = add_channel
self.channel_multiplier = channel_multiplier
self.downsample = downsample
if self.add_channel: #(5)
out_channels = in_channels*self.channel_multiplier #(6)
else:
out_channels = in_channels #(7)
mid_channels = out_channels//2 #(8).
if self.downsample: #(9)
stride = 2 #(10)
else:
stride = 1
The parameters we just discussed directly control the if
statements at line #(5)
and #(9)
. The previous goes to be executed every time the add_channel
is True
, wherein case the variety of input channels will likely be multiplied by channel_multiplier
to acquire the variety of output channels (#(6)
). Meanwhile, whether it is False
, we are going to make input and the output tensor dimension to be the identical (#(7)
). Here we set mid_channels
to be half the scale of out_channels
(#(8)
). It’s because in line with Figure 3 the variety of channels within the output tensor of the primary two convolution layers inside each block is half of that of the third convolution layer. Next, the downsample
flag we defined earlier is used to regulate the if
statement at line #(9)
. Each time it is about to True
, it would assign the stride
variable to 2 (#(10)
), which can later cause the convolution layer to scale back the spatial dimension of the image by half.
Still contained in the __init__()
method, let’s now define the layers inside the ResNeXt block. See the Codeblock 3b below for the small print.
# Codeblock 3b
if self.add_channel or self.downsample: #(1)
self.projection = nn.Conv2d(in_channels=in_channels, #(2)
out_channels=out_channels,
kernel_size=1,
stride=stride,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.projection.weight, nonlinearity='relu')
self.bn_proj = nn.BatchNorm2d(num_features=out_channels)
self.conv0 = nn.Conv2d(in_channels=in_channels, #(3)
out_channels=mid_channels, #(4)
kernel_size=1,
stride=1,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.conv0.weight, nonlinearity='relu')
self.bn0 = nn.BatchNorm2d(num_features=mid_channels)
self.conv1 = nn.Conv2d(in_channels=mid_channels, #(5)
out_channels=mid_channels,
kernel_size=3,
stride=stride, #(6)
padding=1,
bias=False,
groups=CARDINALITY) #(7)
nn.init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
self.bn1 = nn.BatchNorm2d(num_features=mid_channels)
self.conv2 = nn.Conv2d(in_channels=mid_channels, #(8)
out_channels=out_channels, #(9)
kernel_size=1,
stride=1,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
self.bn2 = nn.BatchNorm2d(num_features=out_channels)
self.relu = nn.ReLU()
Keep in mind that there are cases where the output dimension of a ResNeXt block is different from the input. In such a case, element-wise summation on the last step can’t be performed (discuss with Figure 1). That is the rationale that we want to initialize a projection
layer every time either the add_channel
or downsample
flags are True
(#(1)
). This projection
layer (#(2)
), which is a 1×1 convolution, is used to process the tensor within the skip-connection in order that the output shape goes to match the tensor processed by the predominant flow, allowing them to be summed. Otherwise, if we would like the ResNeXt module to preserve the tensor dimension, we want to set each flags to False
in order that the projection layer won’t be initialized since we will directly sum the skip-connection with the tensor from the predominant flow.
The predominant flow of the ResNeXt module itself comprises three convolution layers, which I discuss with as conv0
, conv1
and conv2
, as written at line #(3)
, #(5)
and #(8)
respectively. If we take a more in-depth have a look at these layers, we will see that each conv0
and conv2
are liable for manipulating the variety of channels. At lines #(3)
and #(4)
, we will see that conv0
changes the variety of image channels from in_channels
to mid_channels
, while conv2
changes it from mid_channels
to out_channels
(#(8-9)
). Alternatively, the conv1
layer is responsible to regulate the spatial dimension through the stride
parameter (#(6)
), wherein the worth is set in line with the dowsample
flag we discussed earlier. Moreover, this conv1
layer will do the complete process through group convolution (#(7)
), which within the case of ResNeXt it corresponds to cardinality.
Moreover, here we initialize batch normalization layers named bn_proj
, bn0
, bn1
, and bn2
. Later within the forward()
method, we’re going to place them right after the corresponding convolution layers following the structure, which is a typical practice in the case of constructing a CNN-based model. Not only that, notice that here we also write nn.init.kaiming_normal_()
after the initialization of every convolution layer. This is actually done in order that the initial layer weights follow the Kaiming normal distribution as mentioned within the paper.
That was all the pieces in regards to the __init__()
method, now that we’re going to move on to the forward()
method to truly define the flow of the ResNeXt module. See the Codeblock 3c below.
# Codeblock 3c
def forward(self, x):
print(f'originaltt: {x.size()}')
if self.add_channel or self.downsample: #(1)
residual = self.bn_proj(self.projection(x)) #(2)
print(f'after projectiont: {residual.size()}')
else:
residual = x #(3)
print(f'no projectiontt: {residual.size()}')
x = self.conv0(x) #(4)
x = self.bn0(x)
x = self.relu(x)
print(f'after conv0-bn0-relut: {x.size()}')
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
print(f'after conv1-bn1-relut: {x.size()}')
x = self.conv2(x) #(5)
x = self.bn2(x)
print(f'after conv2-bn2tt: {x.size()}')
x = x + residual
x = self.relu(x) #(6)
print(f'after summationtt: {x.size()}')
return x
Here you’ll be able to see that this function accepts x
because the only input, wherein it is essentially a tensor produced by the previous ResNeXt block. The if
statement I write at line #(1)
checks whether we’re about to perform downsampling. In that case, the tensor within the skip-connection goes to be passed through the projection
layer and the corresponding batch normalization layer before eventually stored within the residual
variable (#(2)
). But when downsampling just isn’t performed, we’re going to set residual
to be the exact same as x
(#(3)
). Next, we are going to process the predominant tensor x
using the stack of convolution layers ranging from conv0
(#(4)
) all of the strategy to conv2
(#(5)
). It is crucial to notice that the structure of the conv2
layer is barely different, where the ReLU activation function is applied after element-wise summation is performed (#(6)
).
Now let’s test the ResNeXt block we just created to search out out whether we’ve implemented it accurately. There are three conditions I’m going to check here, namely after we move from one stage to a different (setting each add_channel
and downsample
to True
), after we move from one block to a different inside the same stage (each add_channel
and downsample
are False
), and after we move from stage to stage (setting downsample
to False
and add_channel
to True
with 4 channel multiplier).
Test Case 1
The Codeblock 4 below demonstrates the primary test case, wherein here I simulate the primary block of the stage. In the event you return to Figure 3, you will notice that the output from the previous stage is a 256-channel image. Thus, we want to set the in_channels
parameter in line with this number. Meanwhile, the output of the ResNeXt block within the stage has 512 channels with 28×28 spatial dimension. This tensor shape transformation is definitely the rationale that we set each flags to True
. Here we assume that the x
tensor passed through the network is a dummy image produced by the stage.
# Codeblock 4
block = Block(in_channels=256, add_channel=True, downsample=True)
x = torch.randn(1, 256, 56, 56)
out = block(x)
And below is what the output looks like. It’s seen at line #(1)
that our projection
layer successfully projected the tensor to 512×28×28, exactly matching the form of the output tensor from the predominant flow (#(4)
). The conv0
layer at line #(2)
doesn’t alter the tensor dimension in any respect since on this case our in_channels
and mid_channels
are the identical. The actual spatial downsampling is performed by the conv1
layer, where the image resolution is reduced from 56×56 to twenty-eight×28 (#(3)
) due to the stride which is about to 2 for this case. The method is then continued by the conv2
layer which doubles the variety of channels from 256 to 512 (#(4)
). Lastly, this tensor will likely be element-wise summed with the projected skip-connection tensor (#(5)
). And with that, we successfully converted our tensor from 256×56×56 to 512×28×28.
# Codeblock 4 Output
original : torch.Size([1, 256, 56, 56])
after projection : torch.Size([1, 512, 28, 28]) #(1)
after conv0-bn0-relu : torch.Size([1, 256, 56, 56]) #(2)
after conv1-bn1-relu : torch.Size([1, 256, 28, 28]) #(3)
after conv2-bn2 : torch.Size([1, 512, 28, 28]) #(4)
after summation : torch.Size([1, 512, 28, 28]) #(5)
Test Case 2
To be able to show the second test case, here I’ll simulate the block contained in the stage which the input is a tensor produced by the previous block inside the same stage. In such a case, we would like the input and output dimension of this ResNeXt module to be the identical, hence we want to set each add_channel
and downsample
to False
. See the Codeblock 5 and the resulting output below for the small print.
# Codeblock 5
block = Block(in_channels=512, add_channel=False, downsample=False)
x = torch.randn(1, 512, 28, 28)
out = block(x)
# Codeblock 5 Output
original : torch.Size([1, 512, 28, 28])
no projection : torch.Size([1, 512, 28, 28]) #(1)
after conv0-bn0-relu : torch.Size([1, 256, 28, 28]) #(2)
after conv1-bn1-relu : torch.Size([1, 256, 28, 28])
after conv2-bn2 : torch.Size([1, 512, 28, 28]) #(3)
after summation : torch.Size([1, 512, 28, 28])
As I’ve mentioned earlier, the projection layer just isn’t going for use if the input tensor just isn’t downsampled. That is the rationale that at line #(1)
we’ve our skip-connection tensor shape unchanged. Next, we’ve our channel count reduced to 256 by the conv0
layer since on this case mid_channels
is half the scale of out_channels
(#(2)
). We eventually expand this variety of channels back to 512 using the layer (#(3)
). Moreover, this type of structure is usually generally known as because it follows the pattern, which was first introduced in the unique ResNet paper [3].
Test Case 3
The third test is definitely a special case since we’re about to simulate the primary block within the stage, where we want to set the add_channel
flag to True
while the downsample
to False
. Here we don’t wish to perform spatial downsampling within the convolution layer since it is already done by a maxpooling layer. Moreover, you can even see in Figure 3 that the stage returns a picture of 64 channels. Because of this reason, we want to set the channel_multiplier
parameter to 4 since we would like the following stage to return 256 channels. See the small print within the Codeblock 6 below.
# Codeblock 6
block = Block(in_channels=64, add_channel=True, channel_multiplier=4, downsample=False)
x = torch.randn(1, 64, 56, 56)
out = block(x)
# Codeblock 6 Output
original : torch.Size([1, 64, 56, 56])
after projection : torch.Size([1, 256, 56, 56]) #(1)
after conv0-bn0-relu : torch.Size([1, 128, 56, 56]) #(2)
after conv1-bn1-relu : torch.Size([1, 128, 56, 56])
after conv2-bn2 : torch.Size([1, 256, 56, 56]) #(3)
after summation : torch.Size([1, 256, 56, 56])
It’s seen within the resulting output above that the ResNeXt module routinely utilize the projection
layer, which on this case it successfully converted the 64×56×56 tensor into 256×56×56 (#(1)
). Here you’ll be able to see that the variety of channels expanded to be 4 times larger while the spatial dimension remained the identical. Afterwards, we shrink the channel count to 128 (#(2)
) and expand it back to 256 (#(3)
) to simulate the mechanism. Thus, we will now perform summation between the tensor from the predominant flow and the one produced by the projection
layer.
At this point we already got our ResNeXt module works properly to handle the three cases. So, I consider this module is now able to be assembled to truly construct the complete ResNeXt architecture.
The Entire ResNeXt Architecture
Because the following ResNeXt class is pretty long, I break it down into two codeblocks to make things easier to follow. What we mainly have to do within the __init__()
method in Codeblock 7a is to initialize the ResNeXt modules using the Block
class we created earlier. The strategy to implement the (#(9)
), (#(12)
) and (#(15)
) stages are pretty straightforward since what we mainly have to do is simply to initialize the blocks inside nn.ModuleList
. Keep in mind that the primary block inside each stage is a downsampling block, while the remainder them usually are not intended to perform downsampling. Because of this reason, we want to initialize the primary block manually by setting each add_channel
and downsample
flags to True
(#(10,13,16)
) whereas the remaining blocks are initialized using loops which iterate in line with the numbers stored within the NUM_CHANNELS
list (#(11,14,17)
).
# Codeblock 7a
class ResNeXt(nn.Module):
def __init__(self):
super().__init__()
# conv1 stage #(1)
self.resnext_conv1 = nn.Conv2d(in_channels=NUM_CHANNELS[0],
out_channels=NUM_CHANNELS[1],
kernel_size=7, #(2)
stride=2, #(3)
padding=3,
bias=False)
nn.init.kaiming_normal_(self.resnext_conv1.weight,
nonlinearity='relu')
self.resnext_bn1 = nn.BatchNorm2d(num_features=NUM_CHANNELS[1])
self.relu = nn.ReLU()
self.resnext_maxpool1 = nn.MaxPool2d(kernel_size=3, #(4)
stride=2,
padding=1)
# conv2 stage #(5)
self.resnext_conv2 = nn.ModuleList([
Block(in_channels=NUM_CHANNELS[1],
add_channel=True, #(6)
channel_multiplier=4,
downsample=False) #(7)
])
for _ in range(NUM_BLOCKS[0]-1): #(8)
self.resnext_conv2.append(Block(in_channels=NUM_CHANNELS[2]))
# conv3 stage #(9)
self.resnext_conv3 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[2], #(10)
add_channel=True,
downsample=True)])
for _ in range(NUM_BLOCKS[1]-1): #(11)
self.resnext_conv3.append(Block(in_channels=NUM_CHANNELS[3]))
# conv4 stage #(12)
self.resnext_conv4 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[3], #(13)
add_channel=True,
downsample=True)])
for _ in range(NUM_BLOCKS[2]-1): #(14)
self.resnext_conv4.append(Block(in_channels=NUM_CHANNELS[4]))
# conv5 stage #(15)
self.resnext_conv5 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[4], #(16)
add_channel=True,
downsample=True)])
for _ in range(NUM_BLOCKS[3]-1): #(17)
self.resnext_conv5.append(Block(in_channels=NUM_CHANNELS[5]))
self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1)) #(18)
self.fc = nn.Linear(in_features=NUM_CHANNELS[5], #(19)
out_features=NUM_CLASSES)
As we discussed earlier, the stage (#(5)
) is a bit special because the first block inside this stage does increase the variety of channels yet it doesn’t reduce the spatial dimension. This is actually the rationale that I set the add_channel
parameter to True
(#(6)
) while the downsample
parameter is about to False
(#(7)
). The initialization of the remaining blocks is identical as the opposite stages we discussed earlier, where we will just do it with a straightforward loop (#(8)
).
The stage (#(1)
) however, doesn’t utilize the Block
class because the structure is totally different from the opposite stages. In response to Figure 3, this stage only comprises a single 7×7 convolution layer (#(2)
), which allows us to capture a bigger context from the input image. The tensor produced by this layer could have half the spatial dimensions of the input due to the stride
parameter which is about to 2 (#(3)
). Further downsampling is performed using maxpooling layer with the identical stride, which again, reduces the spatial dimension by half (#(4)
). — In reality, this maxpooling layer ought to be contained in the stage as an alternative, but on this implementation I put it outside the nn.ModuleList
of that stage for the sake of simplicity.
Lastly, we want to initialize a worldwide average pooling layer (#(18)
) which works by taking the typical value of every channel within the tensor produced by the last convolution layer. By doing this, we’re going to have a single number representing each channel. This tensor will then be connected to the output layer that produces NUM_CLASSES
(1000) neurons (#(19)
), wherein each of them corresponds to every class within the dataset.
Now have a look at the Codeblock 7b below to see how I define the forward()
method. I believe there just isn’t much thing I would like to elucidate since what we mainly do here is simply to pass the tensor from one layer to the following one sequentially.
# Codeblock 7b
def forward(self, x):
print(f'originaltt: {x.size()}')
x = self.relu(self.resnext_bn1(self.resnext_conv1(x)))
print(f'after resnext_conv1t: {x.size()}')
x = self.resnext_maxpool1(x)
print(f'after resnext_maxpool1t: {x.size()}')
for i, block in enumerate(self.resnext_conv2):
x = block(x)
print(f'after resnext_conv2 #{i}t: {x.size()}')
for i, block in enumerate(self.resnext_conv3):
x = block(x)
print(f'after resnext_conv3 #{i}t: {x.size()}')
for i, block in enumerate(self.resnext_conv4):
x = block(x)
print(f'after resnext_conv4 #{i}t: {x.size()}')
for i, block in enumerate(self.resnext_conv5):
x = block(x)
print(f'after resnext_conv5 #{i}t: {x.size()}')
x = self.avgpool(x)
print(f'after avgpooltt: {x.size()}')
x = torch.flatten(x, start_dim=1)
print(f'after flattentt: {x.size()}')
x = self.fc(x)
print(f'after fctt: {x.size()}')
return x
Next, let’s test our ResNeXt class using the next code. Here I’m going to check it by passing a dummy tensor of size 3×224×224 which simulates a single RGB image of size 224×224.
# Codeblock 8
resnext = ResNeXt()
x = torch.randn(1, 3, 224, 224)
out = resnext(x)
# Codeblock 8 Output
original : torch.Size([1, 3, 224, 224])
after resnext_conv1 : torch.Size([1, 64, 112, 112]) #(1)
after resnext_maxpool1 : torch.Size([1, 64, 56, 56]) #(2)
after resnext_conv2 #0 : torch.Size([1, 256, 56, 56]) #(3)
after resnext_conv2 #1 : torch.Size([1, 256, 56, 56]) #(4)
after resnext_conv2 #2 : torch.Size([1, 256, 56, 56]) #(5)
after resnext_conv3 #0 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #1 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #2 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #3 : torch.Size([1, 512, 28, 28])
after resnext_conv4 #0 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #1 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #2 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #3 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #4 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #5 : torch.Size([1, 1024, 14, 14])
after resnext_conv5 #0 : torch.Size([1, 2048, 7, 7])
after resnext_conv5 #1 : torch.Size([1, 2048, 7, 7])
after resnext_conv5 #2 : torch.Size([1, 2048, 7, 7])
after avgpool : torch.Size([1, 2048, 1, 1]) #(6)
after flatten : torch.Size([1, 2048]) #(7)
after fc : torch.Size([1, 1000]) #(8)
We are able to see within the above output that our stage accurately reduce the spatial dimension from 224×224 to 112×112 while at the identical time also increasing the variety of channels to 64 (#(1)
). The downsapling is sustained by the maxpooling layer, where it makes the spatial dimension of the image reduced to 56×56 (#(2)
). Moving on to the stage, we will see that our first block within the stage successfully converted the 64-channel image into 256 (#(3)
), wherein the following blocks in the identical stage preserve the dimension of this tensor (#(4–5)
). The identical thing can also be done by the subsequent stages until we reach the worldwide average pooling layer (#(6)
). It is crucial to notice that we want to perform tensor flattening (#(7)
) to drop the empty axes before eventually connecting it to the output layer (#(8)
). And that concludes how a tensor flows through the ResNeXt architecture.
Moreover, you need to use the summary()
function that we previously loaded from torchinfo
if you need to get even deeper into the architectural details. You possibly can see at the top of the output below that we got 25,028,904 parameters in total. In reality, this variety of params matches exactly with the one belongs to the model from PyTorch, so I consider our implementation here is correct. You possibly can confirm this within the link at reference number [4].
# Codeblock 9
resnext = ResNeXt()
summary(resnext, input_size=(1, 3, 224, 224))
# Codeblock 9 Output
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
ResNeXt [1000] --
├─Conv2d: 1-1 [1, 64, 112, 112] 9,408
├─BatchNorm2d: 1-2 [1, 64, 112, 112] 128
├─ReLU: 1-3 [1, 64, 112, 112] --
├─MaxPool2d: 1-4 [1, 64, 56, 56] --
├─ModuleList: 1-5 -- --
│ └─Block: 2-1 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-1 [1, 256, 56, 56] 16,384
│ │ └─BatchNorm2d: 3-2 [1, 256, 56, 56] 512
│ │ └─Conv2d: 3-3 [1, 128, 56, 56] 8,192
│ │ └─BatchNorm2d: 3-4 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-5 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-6 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-7 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-8 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-9 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-10 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-11 [1, 256, 56, 56] --
│ └─Block: 2-2 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-12 [1, 128, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-13 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-14 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-15 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-16 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-17 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-18 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-19 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-20 [1, 256, 56, 56] --
│ └─Block: 2-3 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-21 [1, 128, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-22 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-23 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-24 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-25 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-26 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-27 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-28 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-29 [1, 256, 56, 56] --
├─ModuleList: 1-6 -- --
│ └─Block: 2-4 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-30 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-31 [1, 512, 28, 28] 1,024
│ │ └─Conv2d: 3-32 [1, 256, 56, 56] 65,536
│ │ └─BatchNorm2d: 3-33 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-34 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-35 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-36 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-37 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-38 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-39 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-40 [1, 512, 28, 28] --
│ └─Block: 2-5 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-41 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-42 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-43 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-44 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-45 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-46 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-47 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-48 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-49 [1, 512, 28, 28] --
│ └─Block: 2-6 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-50 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-51 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-52 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-53 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-54 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-55 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-56 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-57 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-58 [1, 512, 28, 28] --
│ └─Block: 2-7 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-59 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-60 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-61 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-62 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-63 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-64 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-65 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-66 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-67 [1, 512, 28, 28] --
├─ModuleList: 1-7 -- --
│ └─Block: 2-8 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-68 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-69 [1, 1024, 14, 14] 2,048
│ │ └─Conv2d: 3-70 [1, 512, 28, 28] 262,144
│ │ └─BatchNorm2d: 3-71 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-72 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-73 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-74 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-75 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-76 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-77 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-78 [1, 1024, 14, 14] --
│ └─Block: 2-9 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-79 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-80 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-81 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-82 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-83 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-84 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-85 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-86 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-87 [1, 1024, 14, 14] --
│ └─Block: 2-10 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-88 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-89 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-90 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-91 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-92 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-93 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-94 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-95 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-96 [1, 1024, 14, 14] --
│ └─Block: 2-11 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-97 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-98 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-99 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-100 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-101 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-102 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-103 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-104 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-105 [1, 1024, 14, 14] --
│ └─Block: 2-12 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-106 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-107 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-108 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-109 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-110 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-111 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-112 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-113 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-114 [1, 1024, 14, 14] --
│ └─Block: 2-13 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-115 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-116 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-117 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-118 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-119 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-120 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-121 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-122 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-123 [1, 1024, 14, 14] --
├─ModuleList: 1-8 -- --
│ └─Block: 2-14 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-124 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-125 [1, 2048, 7, 7] 4,096
│ │ └─Conv2d: 3-126 [1, 1024, 14, 14] 1,048,576
│ │ └─BatchNorm2d: 3-127 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-128 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-129 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-130 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-131 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-132 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-133 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-134 [1, 2048, 7, 7] --
│ └─Block: 2-15 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-135 [1, 1024, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-136 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-137 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-138 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-139 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-140 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-141 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-142 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-143 [1, 2048, 7, 7] --
│ └─Block: 2-16 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-144 [1, 1024, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-145 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-146 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-147 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-148 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-149 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-150 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-151 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-152 [1, 2048, 7, 7] --
├─AdaptiveAvgPool2d: 1-9 [1, 2048, 1, 1] --
├─Linear: 1-10 [1, 1000] 2,049,000
==========================================================================================
Total params: 25,028,904
Trainable params: 25,028,904
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 6.28
==========================================================================================
Input size (MB): 0.60
Forward/backward pass size (MB): 230.42
Params size (MB): 100.12
Estimated Total Size (MB): 331.13
==========================================================================================
Ending
I believe that’s all the pieces about ResNeXt and its implementation. You may as well find the complete code utilized in this text on my GitHub repo [5].
I hope you learn something latest today, and thanks very much for reading! See you in my next article.
References
[1] Saining Xie Aggregated Residual Transformations for Deep Neural Networks. Arxiv. https://arxiv.org/abs/1611.05431 [Accessed March 1, 2025].
[2] ResNeXt. PyTorch. https://pytorch.org/vision/predominant/models/resnext.html [Accessed March 1, 2025].
[3] Kaiming He Deep Residual Learning for Image Recognition. Arxiv. https://arxiv.org/abs/1512.03385 [Accessed March 1, 2025].
[4] resnext50_32x4d. PyTorch. https://pytorch.org/vision/predominant/models/generated/torchvision.models.resnext50_32x4d.html#torchvision.models.resnext50_32x4d [Accessed March 1, 2025].
[5] MuhammadArdiPutra. Taking ResNet to the NeXt Level — ResNeXt. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/predominant/Taking%20ResNet%20to%20the%20NeXt%20Level%20-%20ResNeXt.ipynb [Accessed April 7, 2025].