The Channel-Sensible Attention | Squeeze and Excitation

After we speak about attention in computer vision, one thing that probably involves your mind first is the one utilized in the Vision Transformer (ViT) architecture. Actually, that’s not the one attention mechanism we’ve got for image data. There’s actually one other one called Squeeze and Excitation Network (SENet). If the eye in ViT operates spatially, i.e., assigning weights to different patches of a picture, the eye mechanism proposed in SENet operates in channel-wise manner, i.e., assigning weights to different channels. — In this text, we’re going to discuss how the Squeeze and Excitation architecture works, the right way to implement it from scratch, and the right way to integrate the network into the ResNeXt model.

The Squeeze and Excitation Module

SENet, which was first proposed in a paper titled “” by Hu [1], is just not a standalone network like VGG, Inception, or ResNet. As an alternative, it is definitely a constructing block to be placed on an existing network. In CNN-based models, we assume that pixels spatially close to one another have high correlations, which is the rationale that we employ small-sized kernels to capture these correlations. This type of assumption is essentially the of CNN. Then again, SENet introduces a brand new inductive bias, where the authors assume that each image channel contributes in a different way to predicting a particular class. By applying SE modules to a CNN, the model not only relies on spatial patterns but additionally captures the importance of every channel. To higher illustrate this, we will consider a picture of fireplace, where the red channel would theoretically give a better contribution to the ultimate prediction than the blue and green channels.

The structure of the SE module itself is shown in Figure 1. Because the name of the network suggests, there are two major steps done on this module: and . The part corresponds to the operation denoted as , while the part includes each and . Then again, the operation, is definitely not the a part of the SE module. Quite, it represents a change function that originally belongs to the model where the SE module is applied. For instance, if we were to position this SE module on ResNet, the operation refers back to the stack of convolution layers throughout the bottleneck block.

Figure 1. The structure of the Squeeze and Excitation module [1].

Talking more specifically in regards to the operation, it essentially works by utilizing global average pooling mechanism, where it’s used to capture the data from the whole spatial dimension of every channel. By doing so, every channel of the input tensor goes to be represented by a single number, which is essentially just the common value of the corresponding channel. The authors consult with this operation as . Mathematically speaking, this could formally be written within the equation shown in Figure 2, where we mainly sum all values across the peak and width before eventually dividing it with the variety of pixels inside that channel ().

Figure 2. The mathematical expression of the worldwide average pooling mechanism in SE module [1].

Meanwhile, each excitation and scaling operations are known as since what they essentially do is to dynamically adjust the weightings of every channel within the input tensor in accordance with its importance. Actually, the diagram in Figure 1 doesn’t completely depict the whole SENet architecture. You may see within the figure that appears to be a single operation, yet it actually consists of two linear layers each followed by an activation function. See the Figure 3 below for the small print.

Figure 3. The mathematical formulation of the operation [1].

The 2 linear layers are denoted as and , whereas and represent ReLU and sigmoid activation functions, respectively. So, based on this mathematical definition, what we mainly must do later within the implementation is to pass tensor (the average-pooled tensor) through the primary linear layer, followed by the ReLU activation function, the second linear layer, and lastly the sigmoid activation function. Do not forget that the sigmoid function normalizes input values to be throughout the range of 0 to 1. On this case, we are going to perceive the resulting output as the burden of every channel, where a price near 1 indicates that the corresponding channel comprises necessary information, hence we allow the model to pay more attention to that channel. Otherwise, if the resulting number is near 0, it indicates that the corresponding channel doesn’t contribute that much to the output.

As a way to utilize these channel weights, we will perform the operation, which is essentially only a multiplication of the unique tensor and the burden tensor , as shown in Figure 4 below. By doing this, we essentially retain the values throughout the necessary channels while at the identical time suppressing the values of the unimportant ones.

Figure 4. The scaling process is only a multiplication of the unique and the burden tensors [1].

By the way in which sorry for getting a bit too mathy here, lol. But I consider this can allow you to understand the code later within the implementation section.

Where to Put the SE Module

Applying the SE module on a plain CNN model like VGG is simple, as we will simply place it right after each convolution layer. Nonetheless, it won’t be straightforward within the case of Inception or ResNet due to the presence of parallel branches in these two networks. To handle this confusion, authors provide a guide to implement the SE module specifically on the 2 models as shown in Figure 5 below.

Figure 5. Where SE module is placed in Inception and ResNet [1].

For the Inception model, as a substitute of placing SE module right after each convolution layer, we pass the input tensor through the whole Inception block (including all of the branches inside) after which attach the SE module afterwards. The identical approach also works for ResNet, but take into accout that the summation between the tensor in skip connection and the major flow happens after the major tensor has been processed by the SE module.

As I discussed earlier, the excitation stage essentially consists of two linear layers. If we take a more in-depth take a look at the above structure, we will see that the output shape of the primary linear layer is 1×1×C/. The variable known as which reduces the dimensionality of the burden tensor before eventually projecting it back to 1×1×C through the second linear layer. The dimensionality reduction done by the primary layer acts as a operation, which is helpful to limit model complexity and to enhance generalization. Authors conducted experiments on different values, and so they found that = 16 produces the most effective balance between accuracy and complexity.

Figure 6. Several ways possible for use to connect SE module in ResNet [1].

Along with implementing the SE module in ResNet, it’s seen in Figure 6 that there are literally several ways we will follow to accomplish that. In accordance with the experimental ends in Figure 7, it looks like the usual SE, SE-PRE, and SE-Identity blocks obtained similar results, while at the identical time all of them outperformed SE-POST by a big margin. This means that the position of the SE module affects model performance by way of accuracy. Based on these findings, the authors argue that we’re going to obtain good results so long as we apply the SE module before the element-wise summation operation. Later within the coding section, I’m going to reveal the right way to implement the usual SE block.

Figure 7. Experimental results on different SE module integration strategies [1].

More Experimental Results

There are literally rather a lot more experimental results discussed within the paper. One in every of them is a table displaying accuracy rating improvements when SE module is applied to existing CNN-based models. The table I’m referring to is displayed in Figure 8 below.

Figure 8. Experimental results on applying SE module on different models [1][2].

The columns highlighted in blue represent the error rates of every model and those in pink indicate the computational complexity measured in GFLOPs. The column refers back to the plain model that the authors implemented themselves, whereas the column represents the identical model equipped with SE module. The table clearly shows that each top-1 and top-5 errors decrease when the SE module is applied. It can be crucial to know that although adding the SE module causes the GFLOPs to get higher, yet this increase is considerably marginal in comparison with the reduction in error rate.

Next, we will actually reveal interesting insights by printing out the values contained within the SE modules in the course of the inference phase. Let’s take a take a look at the charts in Figure 9 below to raised illustrate this. The axis of those charts denotes the channel numbers, the axis represents how much weight does each channel have in accordance with its importance, and the colour of the lines indicates the category being predicted.

Figure 9. What the activation of SE modules looks like in several network depth [1].

In shallower layers, the features captured by SE module are , which mainly signifies that it captures generic information required to predict all classes. The charts known as (a) and (b), that are the SE modules from ResNet stage 2 and three, show that there is just not much difference in channel activity from one class to a different, indicating that these two modules don’t capture information regarding a particular class. The case is definitely different from the SE modules in deeper layers, i.e., those in stage 4 (c) and stage 5 (d). We will see that these two modules adjust channel weights in a different way depending on the category being predicted. This is basically the rationale that the SE modules in deeper layers are said to be . Nonetheless, the authors acknowledge that there may be unusual behavior happening in a number of the SE modules which happens within the 2nd block of stage 5 (e). Here the SE module doesn’t show meaningful channel recalibration behavior, indicating that it doesn’t contribute as much because the ones we discussed earlier.

The Detailed Architecture

In this text we’re going to implement the model, which in Figure 10 it corresponds to the one within the rightmost column. The ResNeXt model itself is analogous to ResNet, except that the group parameter of the second convolution layer inside each block is about to 32. In case you’re conversant in ResNeXt, this is basically the best yet effective option to implement the so-called . I like to recommend you read my previous article about ResNeXt in case you will not be yet conversant in it, which the link is provided at reference number [3] at the tip of this text.

Taking a more in-depth take a look at the architecture, what differentiates from is solely the presence of SE modules. The identical also applies to in comparison with (not displayed within the table). Notice within the figure below that the models with SE modules have an layer attached after the last convolution layer inside each block, which the corresponding two numbers indicate the primary and second fully-connected layers contained in the SE module.

Figure 10. The whole architecture of ResNet-50, SE-ResNet-50 and SE-ResNeXt-50 (32×4d) [1].

From Scratch Implementation

Do not forget that here we’re about to integrate the SE module on ResNeXt, so we’d like to implement each of them from scratch. Technically speaking, it is definitely possible to take the ResNeXt architecture directly from PyTorch, then manually attach the SE module on it. Nonetheless, here I made a decision to make use of the ResNeXt implementation from my previous article as a substitute since I feel prefer it is rather a lot easier to know than the one from PyTorch. Note that here I’ll concentrate on constructing the SE module and the right way to attach it to the ResNeXt model moderately than explaining the ResNeXt itself since I’ve already covered it in that article [3].

Now let’s start the code by importing the required modules.

# Codeblock 1
import torch
import torch.nn as nn

Squeeze and Excitation Module

The next SE module implementation follows the diagram shown in Figure 5 (right). It’s price noting that the SEModule class below doesn’t include the skip-connection (curved arrow), as the whole SE module is applied after the initial branching but before the merging (summation).

The __init__() approach to this class accepts two parameters: num_channels and r, as shown at line #(1) in Codeblock 2a. We definitely want this SE module to be usable throughout the whole network. So, we’d like to set the num_channels parameter to be adjustable since the variety of output channels varies across ResNeXt blocks at different stages, as shown back in Figure 10. Meanwhile, although we typically use the identical reduction ratio within the SE modules inside the whole network, but it surely is technically possible for us to make use of different r for various stage, which could probably be an interesting thing to experiment with. So, this is basically the rationale that I also set the r parameter to be adjustable.

# Codeblock 2a
class SEModule(nn.Module):
    def __init__(self, num_channels, r):                     #(1)
        super().__init__()
        
        self.global_pooling = nn.AdaptiveAvgPool2d(output_size=(1,1))  #(2)
        self.fc0 = nn.Linear(in_features=num_channels,       #(3)
                             out_features=num_channels//r, 
                             bias=False)
        self.relu = nn.ReLU()                                #(4)
        self.fc1 = nn.Linear(in_features=num_channels//r,    #(5)
                             out_features=num_channels, 
                             bias=False)
        self.sigmoid = nn.Sigmoid()                          #(6)

There are 5 layers we’d like to initialize contained in the __init__() method. I write them down in accordance with the sequence given in Figure 5, i.e., global average pooling layer (#(2)), linear layer (#(3)), ReLU activation function (#(4)), one other linear layer (#(5)), and sigmoid activation function (#(6)). Here you’ll be able to see that the primary linear layer is responsible to perform dimensionality reduction by shrinking the variety of channels from num_channels to num_channels//r, which is able to then be expanded back to num_channels by the second linear layer. Note that we set the bias term of each linear layers to False, which essentially means that we’ll only utilize the burden tensors. The absence of bias terms within the two layers forces the SE module to learn the correlation between one channel to the others moderately than simply adding fixed adjustments.

Still with the SEModule class, let’s now move on to the forward() method to define the flow of the network. You may see at line #(1) in Codeblock 2b that we start from a single input x, which within the case of ResNeXt it is basically a tensor produced by the third convolution layer throughout the same ResNeXt block. As shown in Figure 5, what we’d like to do next is to branch out the network. Here we directly process the branch using the global_pooling layer, which I name the resulting tensor squeezed (#(2)). The unique input tensor x itself might be left as is since we will not be going to perform any operation on it until the scaling phase. Next, we’d like to drop the spatial dimension of the squeezed tensor using torch.flatten() (#(3)). This is essentially done because we would like to process it further with the linear layers at line #(4) and #(5), which may only work with a single-dimensional tensor. The spatial dimension is then introduced again at line #(6), allowing us to perform multiplication between x (the unique tensor) and excited (the channel weights) at line #(7). This whole process produces a recalibrated version of x which we consult with as scaled. Here I print out the tensor dimension after each step so that you could higher understand the flow of this SE module.

# Codeblock 2b
    def forward(self, x):                                  #(1)
        print(f'originaltt: {x.size()}')
        
        squeezed = self.global_pooling(x)                  #(2)
        print(f'after avgpooltt: {squeezed.size()}')
        
        squeezed = torch.flatten(squeezed, 1)              #(3)
        print(f'after flattentt: {squeezed.size()}')
        
        excited = self.relu(self.fc0(squeezed))            #(4)
        print(f'after fc0-relutt: {excited.size()}')
        
        excited = self.sigmoid(self.fc1(excited))          #(5)
        print(f'after fc1-sigmoidt: {excited.size()}')
        
        excited = excited[:, :, None, None]                #(6)
        print(f'after reshapett: {excited.size()}')
        
        scaled = x * excited                               #(7)
        print(f'after scalingtt: {scaled.size()}')
        
        return scaled

Now we’re going to see if we’ve got implemented the network appropriately by passing a dummy tensor through it. In Codeblock 3 below, I initialize an SE module and configure it to simply accept a picture tensor of 512 channels and has a discount ratio of 16 (#(1)). In case you take a take a look at the SE-ResNeXt architecture in Figure 10, this SE module mainly corresponds to the one within the third stage (which the output size is 28×28). Thus, at line #(2) we’d like to regulate the form of the dummy tensor accordingly. We then feed this tensor into the network using the code at line #(3).

# Codeblock 3
semodule = SEModule(num_channels=512, r=16)    #(1)
x = torch.randn(1, 512, 28, 28)                #(2)

out = semodule(x)      #(3)

And below is what the print functions give us.

# Codeblock 3 Output
original          : torch.Size([1, 512, 28, 28])    #(1)
after avgpool     : torch.Size([1, 512, 1, 1])      #(2)
after flatten     : torch.Size([1, 512])            #(3)
after fc0-relu    : torch.Size([1, 32])             #(4)
after fc1-sigmoid : torch.Size([1, 512])            #(5)
after reshape     : torch.Size([1, 512, 1, 1])      #(6)
after scaling     : torch.Size([1, 512, 28, 28])    #(7)

You may see that the unique tensor shape matches exactly with our dummy tensor, i.e., 1×512×28×28 (#(1)). By the way in which we will ignore the #1 within the 0th axis because it essentially denotes the batch size, which on this case I assume that we only got a single image in a batch. After being pooled, the spatial dimension collapses to 1×1 since now each channel is represented by a single number (#(2)). The aim of the flatten operation I explained earlier is to drop the 2 empty axes (#(3)) for the reason that subsequent linear layers can only work with single-dimensional tensor. Here you’ll be able to see that the primary linear layer reduces the tensor dimension to 32 due to the reduction ratio which we previously set to 16 (#(4)). The length of this tensor is then expanded back to 512 by the second linear layer (#(5)). Next, we unsqueeze the tensor in order that we get our 1×1 spatial dimension back (#(6)), allowing us to multiply it with the input tensor (#(7)). Based on this detailed flow, you’ll be able to see that an SE module mainly preserves the unique tensor dimension, proving that this module could be attached to any CNN-based model without disrupting the unique flow of the network.

ResNeXt

As we’ve got understood the right way to implement SE module from scratch, now that I’m going to point out you ways we will attach it on a ResNeXt model. Before doing so, we’d like to initialize the parameters required to implement the ResNeXt architecture. Within the Codeblock 4 below the primary 4 variables are determined in accordance with the variant, whereas the last one (R) represents the reduction ratio for the SE modules.

# Codeblock 4
CARDINALITY  = 32
NUM_CHANNELS = [3, 64, 256, 512, 1024, 2048]
NUM_BLOCKS   = [3, 4, 6, 3]
NUM_CLASSES  = 1000
R = 16

The Block class defined in Codeblock 5a and 5b is the ResNeXt block from my previous article. There are literally a lot of things we do contained in the __init__() method, but the overall idea is that we initialize three convolution layers known as conv0 (#(1)), conv1 (#(2)), and conv2 (#(3)) before initializing the SE module at line #(4). We’ll later configure these layers in accordance with the SE-ResNeXt architecture shown back in Figure 10.

# Codeblock 5a
class Block(nn.Module):
    def __init__(self, 
                 in_channels,
                 add_channel=False,
                 channel_multiplier=2,
                 downsample=False):
        super().__init__()

        self.add_channel = add_channel
        self.channel_multiplier = channel_multiplier
        self.downsample = downsample
        
        
        if self.add_channel:
            out_channels = in_channels*self.channel_multiplier
        else:
            out_channels = in_channels
        
        mid_channels = out_channels//2
        
        
        if self.downsample:
            stride = 2
        else:
            stride = 1
            

        if self.add_channel or self.downsample:
            self.projection = nn.Conv2d(in_channels=in_channels,
                                        out_channels=out_channels, 
                                        kernel_size=1, 
                                        stride=stride, 
                                        padding=0, 
                                        bias=False)
            nn.init.kaiming_normal_(self.projection.weight, nonlinearity='relu')
            self.bn_proj = nn.BatchNorm2d(num_features=out_channels)

        self.conv0 = nn.Conv2d(in_channels=in_channels,       #(1)
                               out_channels=mid_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv0.weight, nonlinearity='relu')
        self.bn0 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv1 = nn.Conv2d(in_channels=mid_channels,      #(2)
                               out_channels=mid_channels, 
                               kernel_size=3, 
                               stride=stride,
                               padding=1, 
                               bias=False, 
                               groups=CARDINALITY)
        nn.init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
        self.bn1 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv2 = nn.Conv2d(in_channels=mid_channels,      #(3)
                               out_channels=out_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
        self.bn2 = nn.BatchNorm2d(num_features=out_channels)
        
        self.relu = nn.ReLU()
        
        self.semodule = SEModule(num_channels=out_channels, r=R)    #(4)

The forward() method itself is usually also the identical as the unique ResNeXt model, except that here we’d like to place the SE module right before the element-wise summation as shown at line #(1) within the Codeblock 5b below. Do not forget that this implementation follows the usual SE block architecture in Figure 6 (b).

# Codeblock 5b
    def forward(self, x):
        print(f'originaltt: {x.size()}')
        
        if self.add_channel or self.downsample:
            residual = self.bn_proj(self.projection(x))
            print(f'after projectiont: {residual.size()}')
        else:
            residual = x
            print(f'no projectiontt: {residual.size()}')
        
        x = self.conv0(x)
        x = self.bn0(x)
        x = self.relu(x)
        print(f'after conv0-bn0-relut: {x.size()}')

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        print(f'after conv1-bn1-relut: {x.size()}')
        
        x = self.conv2(x)
        x = self.bn2(x)
        print(f'after conv2-bn2tt: {x.size()}')
        
        x = self.semodule(x)      #(1)
        print(f'after semodulett: {x.size()}')
        
        x = x + residual
        x = self.relu(x)
        print(f'after summationtt: {x.size()}')
        
        return x

With the above implementation, each time we instantiate a Block object we could have a ResNeXt block which is already equipped with an SE module. Now we’re going to test the above class to see if we’ve got implemented it appropriately. Here I’m going to simulate a ResNeXt block throughout the third stage. The add_channel and downsample parameters are set to False since we would like to preserve each the variety of channels and the spatial dimension of the input tensor.

# Codeblock 6
block = Block(in_channels=512, add_channel=False, downsample=False)
x = torch.randn(1, 512, 28, 28)

out = block(x)

Below is what the output looks like. Here you’ll be able to see that our first convolution layer successfully reduced the variety of channels from 512 to 256 (#(1)), which is then expanded back to its original dimension by the third convolution layer (#(2)). Afterwards, the tensor goes through the SE block which the resulting output size is similar as its input, identical to what we saw earlier in Codeblock 3 (#(3)). Because the processing with SE module is completed, we will finally perform the element-wise summation between the tensor from the major branch and the one from the skip-connection (#(4)).

original             : torch.Size([1, 512, 28, 28])
no projection        : torch.Size([1, 512, 28, 28])
after conv0-bn0-relu : torch.Size([1, 256, 28, 28])    #(1)
after conv1-bn1-relu : torch.Size([1, 256, 28, 28])
after conv2-bn2      : torch.Size([1, 512, 28, 28])    #(2)
after semodule       : torch.Size([1, 512, 28, 28])    #(3)
after summation      : torch.Size([1, 512, 28, 28])    #(4)

And below is how I implement the whole architecture. What we essentially must do is simply to stack multiple SE-ResNeXt blocks in accordance with the architecture in Figure 10. Actually, the SEResNeXt class in Codeblock 7 is precisely the identical because the ResNeXt class in my previous article [3] (I literally copy-pasted it) since what makes SE-ResNeXt different from the unique ResNeXt is just the presence of SE module throughout the Block class we discussed earlier.

# Codeblock 7
class SEResNeXt(nn.Module):
    def __init__(self):
        super().__init__()

        # conv1 stage
        self.resnext_conv1 = nn.Conv2d(in_channels=NUM_CHANNELS[0],
                                       out_channels=NUM_CHANNELS[1],
                                       kernel_size=7,
                                       stride=2,
                                       padding=3, 
                                       bias=False)
        nn.init.kaiming_normal_(self.resnext_conv1.weight, 
                                nonlinearity='relu')
        self.resnext_bn1 = nn.BatchNorm2d(num_features=NUM_CHANNELS[1])
        self.relu = nn.ReLU()
        self.resnext_maxpool1 = nn.MaxPool2d(kernel_size=3,
                                             stride=2, 
                                             padding=1)

        # conv2 stage
        self.resnext_conv2 = nn.ModuleList([
            Block(in_channels=NUM_CHANNELS[1],
                  add_channel=True,
                  channel_multiplier=4,
                  downsample=False)
        ])
        for _ in range(NUM_BLOCKS[0]-1):
            self.resnext_conv2.append(Block(in_channels=NUM_CHANNELS[2]))

        # conv3 stage
        self.resnext_conv3 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[2],
                                                  add_channel=True, 
                                                  downsample=True)])
        for _ in range(NUM_BLOCKS[1]-1):
            self.resnext_conv3.append(Block(in_channels=NUM_CHANNELS[3]))
            
            
        # conv4 stage
        self.resnext_conv4 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[3],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in range(NUM_BLOCKS[2]-1):
            self.resnext_conv4.append(Block(in_channels=NUM_CHANNELS[4]))
            
            
        # conv5 stage
        self.resnext_conv5 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[4],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in range(NUM_BLOCKS[3]-1):
            self.resnext_conv5.append(Block(in_channels=NUM_CHANNELS[5]))
 
       
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))

        self.fc = nn.Linear(in_features=NUM_CHANNELS[5],
                            out_features=NUM_CLASSES)
        

    def forward(self, x):
        print(f'originaltt: {x.size()}')
        
        x = self.relu(self.resnext_bn1(self.resnext_conv1(x)))
        print(f'after resnext_conv1t: {x.size()}')
        
        x = self.resnext_maxpool1(x)
        print(f'after resnext_maxpool1t: {x.size()}')
        
        for i, block in enumerate(self.resnext_conv2):
            x = block(x)
            print(f'after resnext_conv2 #{i}t: {x.size()}')
            
        for i, block in enumerate(self.resnext_conv3):
            x = block(x)
            print(f'after resnext_conv3 #{i}t: {x.size()}')
            
        for i, block in enumerate(self.resnext_conv4):
            x = block(x)
            print(f'after resnext_conv4 #{i}t: {x.size()}')
            
        for i, block in enumerate(self.resnext_conv5):
            x = block(x)
            print(f'after resnext_conv5 #{i}t: {x.size()}')
        
        x = self.avgpool(x)
        print(f'after avgpooltt: {x.size()}')
        
        x = torch.flatten(x, start_dim=1)
        print(f'after flattentt: {x.size()}')
        
        x = self.fc(x)
        print(f'after fctt: {x.size()}')
        
        return x

As the whole architecture is accomplished, now that we’re going to test it by passing through a tensor of size 1×3×224×224 through the network, simulating a single RGB image of size 224×224. You may see within the output of the Codeblock 8 below that it looks as if model works properly for the reason that tensor successfully passed through all layers throughout the seresnext model without returning any error. Thus, I consider this model is now able to be trained. By the way in which don’t forget to alter the variety of neurons within the output channel in accordance with the variety of classes in your dataset if you need to actually train this model.

# Codeblock 8
seresnext = SEResNeXt()
x = torch.randn(1, 3, 224, 224)

out = seresnext(x)

# Codeblock 8 Output
original               : torch.Size([1, 3, 224, 224])
after resnext_conv1    : torch.Size([1, 64, 112, 112])
after resnext_maxpool1 : torch.Size([1, 64, 56, 56])
after resnext_conv2 #0 : torch.Size([1, 256, 56, 56])
after resnext_conv2 #1 : torch.Size([1, 256, 56, 56])
after resnext_conv2 #2 : torch.Size([1, 256, 56, 56])
after resnext_conv3 #0 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #1 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #2 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #3 : torch.Size([1, 512, 28, 28])
after resnext_conv4 #0 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #1 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #2 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #3 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #4 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #5 : torch.Size([1, 1024, 14, 14])
after resnext_conv5 #0 : torch.Size([1, 2048, 7, 7])
after resnext_conv5 #1 : torch.Size([1, 2048, 7, 7])
after resnext_conv5 #2 : torch.Size([1, 2048, 7, 7])
after avgpool          : torch.Size([1, 2048, 1, 1])
after flatten          : torch.Size([1, 2048])
after fc               : torch.Size([1, 1000])

Moreover, we also can print out the variety of parameters this model has using the next code. Here you’ll be able to see that the codeblock returns 27,543,848. This variety of parameters is barely higher than the unique ResNeXt model counterpart, which only has 25,028,904 parameters as mentioned in my previous article in addition to the official PyTorch documentation [4]. Such a rise within the model size definitely is sensible for the reason that ResNeXt blocks throughout the whole network now have more layers due to the presence of SE modules.

# Codeblock 9
def count_parameters(model):
    return sum([params.numel() for params in model.parameters()])

count_parameters(seresnext)

# Codeblock 9 Output
27543848

Ending

And that’s just about all the pieces in regards to the Squeeze and Excitation module. I do encourage you to explore from here by training this model on your individual dataset so that you’re going to see whether the findings presented within the paper also apply to your case. Not only that, I believe it could even be interesting in case you attempt to implement SE module on other neural network architectures like VGG or Inception by yourself.

I hope you learn something latest today. Thanks for reading!

[1] Jie Hu Squeeze and Excitation Networks. Arxiv. https://arxiv.org/abs/1709.01507 [Accessed March 17, 2025].

[2] Image originally created by writer.

[3] Taking ResNet to the Next Level. Towards Data Science. https://towardsdatascience.com/taking-resnet-to-the-next-level/ [Accessed July 22, 2025].

[4] Resnext50_32x4d. PyTorch. https://pytorch.org/vision/major/models/generated/torchvision.models.resnext50_32x4d.html#torchvision.models.resnext50_32x4d [Accessed March 17, 2025].

[5] MuhammadArdiPutra. The Channel-Sensible Attention — Squeeze and Excitation. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/major/The%20Channel-Sensible%20Attention%20-%20Squeeze%20and%20Excitation.ipynb [Accessed April 7, 2025].

The Channel-Sensible Attention | Squeeze and Excitation

The Squeeze and Excitation Module

Where to Put the SE Module

More Experimental Results

The Detailed Architecture

From Scratch Implementation

Squeeze and Excitation Module

ResNeXt

Ending

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

OpenAI enters browser war with Atlas

Scaling Recommender Transformers to a Billion Parameters

Creating AI that matters

Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI

Sora breaks bad with Hollywood

The Channel-Sensible Attention | Squeeze and Excitation

The Squeeze and Excitation Module

Where to Put the SE Module

More Experimental Results

The Detailed Architecture

From Scratch Implementation

Squeeze and Excitation Module

ResNeXt

Ending

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.