Image Captioning, Transformer Mode On

-

Introduction

In my previous article, I discussed one in all the earliest Deep Learning approaches for image captioning. If you happen to’re concerned about reading it, you’ll find the link to that article at the tip of this one.

Today, I would love to discuss Image Captioning again, but this time with the more advanced neural network architecture. The deep learning I’m going to discuss is the one proposed within the paper titled “,” written by Liu back in 2021 [1]. Specifically, here I’ll reproduce the model proposed within the paper and explain the underlying theory behind the architecture. Nevertheless, be mindful that I won’t actually exhibit the training process since I only need to give attention to the model architecture.

The concept behind CPTR

In truth, the primary idea of the CPTR architecture is precisely the identical as the sooner image captioning model, as each use the encoder-decoder structure. Previously, within the paper titled “” [2], the models used are GoogLeNet (a.k.a. Inception V1) and LSTM for the 2 components, respectively. The illustration of the model proposed within the paper is shown in the next figure.

Figure 1. The neural network architecture for image captioning proposed within the Show and Tell paper [2].

Despite having the identical encoder-decoder structure, what makes CPTR different from the previous approach is the premise of the encoder and the decoder themselves. In CPTR, we mix the encoder a part of the ViT (Vision Transformer) model with the decoder a part of the unique Transformer model. The usage of transformer-based architecture for each components is basically where the name CPTR comes from: CaPtion TransformeR.

Note that the discussions in this text are going to be highly related to ViT and Transformer, so I highly recommend you read my previous article about these two topics when you’re not yet acquainted with them. Yow will discover the links at the tip of this text.

Figure 2 shows what the unique ViT architecture looks like. All the pieces contained in the green box is the encoder a part of the architecture to be adopted because the CPTR encoder.

Figure 2. The Vision Transformer (ViT) architecture [3].

Next, Figure 3 displays the unique Transformer architecture. The components enclosed within the blue box are the layers that we’re going to implement within the CPTR decoder.

Figure 3. The unique Transformer architecture [4].

If we mix the components contained in the green and blue boxes above, we’re going to obtain the architecture shown in Figure 4 below. This is precisely what the CPTR model we’re going to implement looks like. The concept here is that the ViT Encoder (green) works by encoding the input image into a selected tensor representation which can then be used as the premise of the Transformer Decoder (blue) to generate the corresponding caption.

Figure 4. The CPTR architecture [5].

That’s just about all the things it’s worthwhile to know for now. I’ll explain more about the main points as we undergo the implementation.

Module imports & parameter configuration

As all the time, the very first thing we want to do within the code is to import the required modules. On this case, we only import torch and torch.nn since we’re about to implement the model from scratch.

# Codeblock 1
import torch
import torch.nn as nn

Next, we’re going to initialize some parameters in Codeblock 2. If you could have read my previous article about image captioning with GoogLeNet and LSTM, you’ll notice that here, we got so much more parameters to initialize. In this text, I would like to breed the CPTR model as closely as possible to the unique one, so the parameters mentioned within the paper can be utilized in this implementation.

# Codeblock 2
BATCH_SIZE         = 1              #(1)

IMAGE_SIZE         = 384            #(2)
IN_CHANNELS        = 3              #(3)

SEQ_LENGTH         = 30             #(4)
VOCAB_SIZE         = 10000          #(5)

EMBED_DIM          = 768            #(6)
PATCH_SIZE         = 16             #(7)
NUM_PATCHES        = (IMAGE_SIZE//PATCH_SIZE) ** 2  #(8)
NUM_ENCODER_BLOCKS = 12             #(9)
NUM_DECODER_BLOCKS = 4              #(10)
NUM_HEADS          = 12             #(11)
HIDDEN_DIM         = EMBED_DIM * 4  #(12)
DROP_PROB          = 0.1            #(13)

The primary parameter I would like to elucidate is the BATCH_SIZE, which is written at the road marked with #(1). The number assigned to this variable shouldn’t be quite necessary in our case since we aren’t actually going to coach this model. This parameter is ready to 1 because, by default, PyTorch treats input tensors as a batch of samples. Here I assume that we only have a single sample in a batch. 

Next, keep in mind that within the case of image captioning we’re coping with images and texts concurrently. This essentially implies that we want to set the parameters for the 2. It’s mentioned within the paper that the model accepts an RGB image of size 384×384 for the encoder input. Hence, we assign the values for IMAGE_SIZE and IN_CHANNELS variables based on this information (#(2) and #(3)). However, the paper doesn’t mention the parameters for the captions. So, here I assume that the length of the caption is not more than 30 words (#(4)), with the vocabulary size estimated at 10000 unique words (#(5)).

The remaining parameters are related to the model configuration. Here we set the EMBED_DIM variable to 768 (#(6)). Within the encoder side, this number indicates the length of the feature vector that represents each 16×16 image patch (#(7)). The identical concept also applies to the decoder side, but in that case the feature vector will represent a single word within the caption. Talking more specifically in regards to the PATCH_SIZE parameter, we’re going to use the worth to compute the full variety of patches within the input image. Because the image has the dimensions of 384×384, there can be 576 patches in total (#(8)).

With regards to using an encoder-decoder architecture, it is feasible to specify the variety of encoder and decoder blocks for use. Using more blocks typically allows the model to perform higher by way of the accuracy, yet in return, it is going to require more computational power. The authors of this paper decided to stack 12 encoder blocks (#(9)) and 4 decoder blocks (#(10)). Next, since CPTR is a transformer-based model, it’s vital to specify the variety of attention heads throughout the attention blocks contained in the encoders and the decoders, which on this case authors use 12 attention heads (#(11)). The worth for the HIDDEN_DIM parameter shouldn’t be mentioned anywhere within the paper. Nevertheless, in response to the ViT and the Transformer paper, this parameter is configured to be 4 times larger than EMBED_DIM (#(12)). The dropout rate shouldn’t be mentioned within the paper either. Hence, I arbitrarily set DROP_PROB to 0.1 (#(13)).

Encoder

Because the modules and parameters have been arrange, now that we’ll get into the encoder a part of the network. On this section we’re going to implement and explain each component contained in the green box in Figure 4 one after the other.

Patch embedding

Figure 5. Dividing the input image into patches and converting them into vectors [5].

You possibly can see in Figure 5 above that step one to be done is dividing the input image into patches. This is basically done because as a substitute of specializing in local patterns like CNNs, ViT captures global context by learning the relationships between these patches. We will model this process with the Patcher class shown within the Codeblock 3 below. For the sake of simplicity, here I also include the method contained in the block throughout the same class.

# Codeblock 3
class Patcher(nn.Module):
   def __init__(self):
       super().__init__()

       #(1)
       self.unfold = nn.Unfold(kernel_size=PATCH_SIZE, stride=PATCH_SIZE)

       #(2)
       self.linear_projection = nn.Linear(in_features=IN_CHANNELS*PATCH_SIZE*PATCH_SIZE,
                                          out_features=EMBED_DIM)
      
   def forward(self, images):
       print(f'imagestt: {images.size()}')
       images = self.unfold(images)  #(3)
       print(f'after unfoldt: {images.size()}')
      
       images = images.permute(0, 2, 1)  #(4)
       print(f'after permutet: {images.size()}')
      
       features = self.linear_projection(images)  #(5)
       print(f'after lin projt: {features.size()}')
      
       return features

The patching itself is completed using the nn.Unfold layer (#(1)). Here we want to set each the kernel_size and stride parameters to PATCH_SIZE (16) in order that the resulting patches don’t overlap with one another. This layer also robotically flattens these patches once it’s applied to the input image. Meanwhile, the nn.Linear layer (#(2)) is employed to perform linear projection, i.e., the method done by the block. By setting the out_features parameter to EMBED_DIM, this layer will map each flattened patch right into a feature vector of length 768.

All the process should make more sense when you read the forward() method. You possibly can see at line #(3) in the identical codeblock that the input image is directly processed by the unfold layer. Next, we want to process the resulting tensor with the permute() method (#(4)) to swap the primary and the second axis before feeding it to the linear_projection layer (#(5)). Moreover, here I also print out the tensor dimension after each layer so that you may higher understand the transformation made at each step.

With the intention to check if our Patcher class works properly, we will just pass a dummy tensor through the network. Take a look at the Codeblock 4 below to see how I do it.

# Codeblock 4
patcher  = Patcher()

images   = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
features = patcher(images)
# Codeblock 4 Output
images         : torch.Size([1, 3, 384, 384])
after unfold   : torch.Size([1, 768, 576])  #(1)
after permute  : torch.Size([1, 576, 768])  #(2)
after lin proj : torch.Size([1, 576, 768])  #(3)

The tensor I passed above represents an RGB image of size 384×384. Here we will see that after the unfold operation is performed, the tensor dimension modified to 1×768×576 (#(1)), denoting the flattened 3×16×16 patch for every of the 576 patches. Unfortunately, this output shape doesn’t match what we want. Keep in mind that in ViT, we perceive image patches as a sequence, so we want to swap the first and 2nd axes because typically, the first dimension of a tensor represents the temporal axis, while the 2nd one represents the feature vector of every timestep. Because the permute() operation is performed, our tensor is now having the dimension of 1×576×768 (#(2)). Lastly, we pass this tensor through the linear projection layer, which the resulting tensor shape stays the identical since we set the EMBED_DIM parameter to the identical size (768) (#(3)). Despite having the identical dimension, the knowledge contained in the ultimate tensor must be richer because of the transformation applied by the trainable weights of the linear projection layer.

Learnable positional embedding

Figure 6. Injecting the learnable positional embeddings into the embedded image patches [5].

After the input image has successfully been converted right into a sequence of patches, the following thing to do is to inject the so-called tensor. This is basically done because a transformer without positional embedding is permutation-invariant, meaning that it treats the input sequence as if their order doesn’t matter. Interestingly, since a picture shouldn’t be a literal sequence, we must always set the positional embedding to be such that it is going to give you the option to somewhat reorder the patch sequence that it thinks works best in representing the spatial information. Nevertheless, be mindful that the term “reordering” here doesn’t mean that we physically rearrange the sequence. Moderately, it does so by adjusting the embedding weights.

The implementation is pretty easy. All we want to do is simply to initialize a tensor using nn.Parameter which the dimension is ready to match with the output from the Patcher model, i.e., 576×768. Also, don’t forget to put in writing requires_grad=True simply to be sure that the tensor is trainable. Take a look at the Codeblock 5 below for the main points.

# Codeblock 5
class LearnableEmbedding(nn.Module):
   def __init__(self):
       super().__init__()
       self.learnable_embedding = nn.Parameter(torch.randn(size=(NUM_PATCHES, EMBED_DIM)),
                                               requires_grad=True)
      
   def forward(self):
       pos_embed = self.learnable_embedding
       print(f'learnable embeddingt: {pos_embed.size()}')
      
       return pos_embed

Now let’s run the next codeblock to see whether our LearnableEmbedding class works properly. You possibly can see within the printed output that it successfully created the positional embedding tensor as expected.

# Codeblock 6
learnable_embedding = LearnableEmbedding()

pos_embed = learnable_embedding()
# Codeblock 6 Output
learnable embedding : torch.Size([576, 768])

The primary encoder block

Figure 7. The primary encoder block [5].

The subsequent thing we’re going to do is to construct the primary encoder block displayed within the Figure 7 above. Here you possibly can see that this block consists of several sub-components, namely , , FFN (Feed-Forward Network), and one other . The Codeblock 7a below shows how I initialize these layers contained in the __init__() approach to the EncoderBlock class.

# Codeblock 7a
class EncoderBlock(nn.Module):
   def __init__(self):
       super().__init__()
      
       #(1)
       self.self_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,
                                                   num_heads=NUM_HEADS,
                                                   batch_first=True,  #(2)
                                                   dropout=DROP_PROB)
      
       self.layer_norm_0 = nn.LayerNorm(EMBED_DIM)  #(3)
      
       self.ffn = nn.Sequential(  #(4)
           nn.Linear(in_features=EMBED_DIM, out_features=HIDDEN_DIM),
           nn.GELU(),
           nn.Dropout(p=DROP_PROB),
           nn.Linear(in_features=HIDDEN_DIM, out_features=EMBED_DIM),
       )
      
       self.layer_norm_1 = nn.LayerNorm(EMBED_DIM)  #(5)

I’ve previously mentioned that the thought of ViT is to capture the relationships between patches inside a picture. This process is completed by the layer I initialize at line #(1) within the above codeblock. One thing to be mindful here is that we want to set the batch_first parameter to True (#(2)). This is basically done in order that the eye layer can be compatible with our tensor shape, wherein the batch dimension (batch_size) is on the 0th axis of the tensor. Next, the 2 layer normalization layers must be initialized individually, as shown at line #(3) and #(5). Lastly, we initialize the FFN block at line #(4), which the layers stacked using nn.Sequential follows the structure defined in the next equation.

Figure 8. The operations done contained in the FFN block [1].

Because the __init__() method is complete, we’ll now proceed with the forward() method. Let’s take a have a look at the Codeblock 7b below.

# Codeblock 7b
   def forward(self, features):  #(1)
      
       residual = features  #(2)
       print(f'features & residualt: {residual.size()}')
      
       #(3)
       features, self_attn_weights = self.self_attention(query=features,
                                                         key=features,
                                                         value=features)
       print(f'after self attentiont: {features.size()}')
       print(f"self attn weightst: {self_attn_weights.shape}")
      
       features = self.layer_norm_0(features + residual)  #(4)
       print(f'after normtt: {features.size()}')
      

       residual = features
       print(f'nfeatures & residualt: {residual.size()}')
      
       features = self.ffn(features)  #(5)
       print(f'after ffntt: {features.size()}')
      
       features = self.layer_norm_1(features + residual)
       print(f'after normtt: {features.size()}')
      
       return features

Here you possibly can see that the input tensor is called features (#(1)). I name it this manner since the input of the EncoderBlock is the image that has already been processed with Patcher and LearnableEmbedding, as a substitute of a raw image. Before doing anything, notice within the encoder block that there’s a branch separated from the primary flow which then returns back to the normalization layer. This branch is often often known as a . To implement this, we want to store the unique input tensor to the residual variable as I exhibit at line #(2). Because the input tensor has been copied, now we’re able to process the unique input with the multihead attention layer (#(3)). Since it is a -attention (not a -attention), the query, key, and value inputs for this layer are all derived from the features tensor. Next, the layer normalization operation is then performed at line #(4), which the input for this layer already accommodates information from the eye block in addition to the residual connection. The remaining steps are principally the identical as what I just explained, except that here we replace the self-attention block with FFN (#(5)).

In the next codeblock, I’ll test the EncoderBlock class by passing a dummy tensor of size 1×576×768, simulating an output tensor from the previous operations.

# Codeblock 8
encoder_block = EncoderBlock()

features = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)
features = encoder_block(features)

Below is what the tensor dimension looks like throughout your entire process contained in the model.

# Codeblock 8 Output
features & residual  : torch.Size([1, 576, 768])  #(1)
after self attention : torch.Size([1, 576, 768])
self attn weights    : torch.Size([1, 576, 576])  #(2)
after norm           : torch.Size([1, 576, 768])

features & residual  : torch.Size([1, 576, 768])
after ffn            : torch.Size([1, 576, 768])  #(3)
after norm           : torch.Size([1, 576, 768])  #(4)

Here you possibly can see that the ultimate output tensor (#(4)) has the identical size because the input (#(1)), allowing us to stack multiple encoder blocks without having to fret about messing up the tensor dimensions. Not only that, the dimensions of the tensor also appears to be unchanged from the start all of the approach to the last layer. In truth, there are literally plenty of transformations performed contained in the attention block, but we just can’t see it since your entire process is completed internally by the nn.MultiheadAttention layer. One among the tensors produced within the layer that we will observe is the eye weight (#(2)). This weight matrix, which has the dimensions of 576×576, is answerable for storing information regarding the relationships between one patch and each other patch within the image. Moreover, changes in tensor dimension actually also happened contained in the FFN layer. The feature vector of every patch which has the initial length of 768 modified to 3072 and immediately shrunk back to 768 again (#(3)). Nevertheless, this transformation shouldn’t be printed because the process is wrapped with nn.Sequential back at line #(4) in Codeblock 7a.

ViT encoder

Figure 9. All the ViT Encoder within the CPTR architecture [5].

As we have now finished implementing all encoder components, now that we’ll assemble them to construct the actual ViT Encoder. We’re going to do it within the Encoder class in Codeblock 9.

# Codeblock 9
class Encoder(nn.Module):
   def __init__(self):
       super().__init__()
       self.patcher = Patcher()  #(1)
       self.learnable_embedding = LearnableEmbedding()  #(2)

       #(3)
       self.encoder_blocks = nn.ModuleList(EncoderBlock() for _ in range(NUM_ENCODER_BLOCKS))
  
   def forward(self, images):  #(4)
       print(f'imagesttt: {images.size()}')
      
       features = self.patcher(images)  #(5)
       print(f'after patchertt: {features.size()}')
      
       features = features + self.learnable_embedding()  #(6)
       print(f'after learn embedt: {features.size()}')
      
       for i, encoder_block in enumerate(self.encoder_blocks):
           features = encoder_block(features)  #(7)
           print(f"after encoder block #{i}t: {features.shape}")

       return features

Contained in the __init__() method, what we want to do is to initialize all components we created earlier, i.e., Patcher (#(1)), LearnableEmbedding (#(2)), and EncoderBlock (#(3)). On this case, the EncoderBlock is initialized inside nn.ModuleList since we would like to repeat it NUM_ENCODER_BLOCKS (12) times. To the forward() method, it initially works by accepting raw image because the input (#(4)). We then process it with the patcher layer (#(5)) to divide the image into small patches and transform them with the linear projection operation. The learnable positional embedding tensor is then injected into the resulting output by element-wise addition (#(6)). Lastly, we pass it into the 12 encoder blocks sequentially with an easy for loop (#(7)).

Now, in Codeblock 10, I’m going to pass a dummy image through your entire encoder. Note that since I would like to give attention to the flow of this Encoder class, I re-run the previous classes we created earlier with the print() functions commented out in order that the outputs will look neat.

# Codeblock 10
encoder = Encoder()

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
features = encoder(images)

And below is what the flow of the tensor looks like. Here, we will see that our dummy input image successfully passed through all layers within the network, including the encoder blocks that we repeat 12 times. The resulting output tensor is now context-aware, meaning that it already accommodates information in regards to the relationships between patches throughout the image. Due to this fact, this tensor is now able to be processed further with the decoder, which can later be discussed in the following section.

# Codeblock 10 Output
images                  : torch.Size([1, 3, 384, 384])
after patcher           : torch.Size([1, 576, 768])
after learn embed       : torch.Size([1, 576, 768])
after encoder block #0  : torch.Size([1, 576, 768])
after encoder block #1  : torch.Size([1, 576, 768])
after encoder block #2  : torch.Size([1, 576, 768])
after encoder block #3  : torch.Size([1, 576, 768])
after encoder block #4  : torch.Size([1, 576, 768])
after encoder block #5  : torch.Size([1, 576, 768])
after encoder block #6  : torch.Size([1, 576, 768])
after encoder block #7  : torch.Size([1, 576, 768])
after encoder block #8  : torch.Size([1, 576, 768])
after encoder block #9  : torch.Size([1, 576, 768])
after encoder block #10 : torch.Size([1, 576, 768])
after encoder block #11 : torch.Size([1, 576, 768])

ViT encoder (alternative)

I would like to point out you something before we talk in regards to the decoder. If you happen to think that our approach above is simply too complicated, it is definitely possible so that you can use nn.TransformerEncoderLayer from PyTorch so that you just don’t must implement the EncoderBlock class from scratch. To achieve this, I’m going to reimplement the Encoder class, but this time I’ll name it EncoderTorch.

# Codeblock 11
class EncoderTorch(nn.Module):
   def __init__(self):
       super().__init__()
       self.patcher = Patcher()
       self.learnable_embedding = LearnableEmbedding()
      
       #(1)
       encoder_block = nn.TransformerEncoderLayer(d_model=EMBED_DIM,
                                                  nhead=NUM_HEADS,
                                                  dim_feedforward=HIDDEN_DIM,
                                                  dropout=DROP_PROB,
                                                  batch_first=True)
      
       #(2)
       self.encoder_blocks = nn.TransformerEncoder(encoder_layer=encoder_block,
                                                   num_layers=NUM_ENCODER_BLOCKS)
  
   def forward(self, images):
       print(f'imagesttt: {images.size()}')
      
       features = self.patcher(images)
       print(f'after patchertt: {features.size()}')
      
       features = features + self.learnable_embedding()
       print(f'after learn embedt: {features.size()}')
      
       features = self.encoder_blocks(features)  #(3)
       print(f'after encoder blockst: {features.size()}')

       return features

What we principally do within the above codeblock is that as a substitute of using the EncoderBlock class, here we use nn.TransformerEncoderLayer (#(1)), which can robotically create a single encoder block based on the parameters we pass to it. To repeat it multiple times, we will just use nn.TransformerEncoder and pass a number to the num_layers parameter (#(2)). With this approach, we don’t necessarily need to put in writing the forward pass in a loop like what we did earlier (#(3)).

The testing code within the Codeblock 12 below is precisely the identical because the one in Codeblock 10, except that here I exploit the EncoderTorch class. It’s also possible to see here that the output is essentially the identical because the previous one.

# Codeblock 12
encoder_torch = EncoderTorch()

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
features = encoder_torch(images)
# Codeblock 12 Output
images               : torch.Size([1, 3, 384, 384])
after patcher        : torch.Size([1, 576, 768])
after learn embed    : torch.Size([1, 576, 768])
after encoder blocks : torch.Size([1, 576, 768])

Decoder

As we have now successfully created the encoder a part of the CPTR architecture, now that we’ll talk in regards to the decoder. On this section I’m going to implement each component contained in the blue box in Figure 4. Based on the figure, we will see that the decoder accepts two inputs, i.e., the image caption ground truth (the lower a part of the blue box) and the sequence of embedded patches produced by the encoder (the arrow coming from the green box). It will be significant to know that the architecture drawn in Figure 4 is meant for example the training phase, where your entire caption ground truth is fed into the decoder. Later within the inference phase, we only provide a () token for the caption input. The decoder will then predict each word sequentially based on the given image and the previously generated words. This process is often often known as an mechanism.

Sinusoidal positional embedding

Figure 10. Where the sinusoidal positional embedding component is positioned within the decoder [5].

If you happen to take a have a look at the CPTR model, you’ll see that step one within the decoder is to convert each word into the corresponding feature vector representation using the block. Nevertheless, since this step may be very easy, we’re going to implement it later. Now let’s assume that this word vectorization process is already done, so we will move to the positional embedding part.

As I’ve mentioned earlier, since transformer is permutation-invariant by nature, we want to use positional embedding to the input sequence. Different from the previous one, here we use the so-called . We will consider it like a technique to label each word vector by assigning numbers obtained from a sinusoidal wave. By doing so, we will expect our model to know word orders because of the knowledge given by the wave patterns.

If you happen to return to Codeblock 6 Output, you’ll see that the positional embedding tensor within the encoder has the dimensions of NUM_PATCHES × EMBED_DIM (576×768). What we principally need to do within the decoder is to create a tensor having the dimensions of SEQ_LENGTH × EMBED_DIM (30×768), which the values are computed based on the equation shown in Figure 11. This tensor is then set to be non-trainable because a sequence of words must maintain a set order to preserve its meaning.

Figure 11. The equation for creating sinusoidal positional encoding proposed within the Transformer paper [6].

Here I would like to elucidate the next code quickly because I even have discussed this more thoroughly in my previous article about Transformer. Generally speaking, what we principally do here is to create the sine and cosine wave using torch.sin() (#(1)) and torch.cos() (#(2)). The resulting two tensors are then merged using the code at line #(3) and #(4).

# Codeblock 13
class SinusoidalEmbedding(nn.Module):
   def forward(self):
       pos = torch.arange(SEQ_LENGTH).reshape(SEQ_LENGTH, 1)
       print(f"postt: {pos.shape}")
      
       i = torch.arange(0, EMBED_DIM, 2)
       denominator = torch.pow(10000, i/EMBED_DIM)
       print(f"denominatort: {denominator.shape}")
      
       even_pos_embed = torch.sin(pos/denominator)  #(1)
       odd_pos_embed  = torch.cos(pos/denominator)  #(2)
       print(f"even_pos_embedt: {even_pos_embed.shape}")
      
       stacked = torch.stack([even_pos_embed, odd_pos_embed], dim=2)  #(3)
       print(f"stackedtt: {stacked.shape}")

       pos_embed = torch.flatten(stacked, start_dim=1, end_dim=2)  #(4)
       print(f"pos_embedt: {pos_embed.shape}")
      
       return pos_embed

Now we will check if the SinusoidalEmbedding class above works properly by running the Codeblock 14 below. As expected earlier, here you possibly can see that the resulting tensor has the dimensions of 30×768. This dimension matches with the tensor obtained by the method done within the block, allowing them to be summed in an element-wise manner.

# Codeblock 14
sinusoidal_embedding = SinusoidalEmbedding()
pos_embed = sinusoidal_embedding()
# Codeblock 14 Output
pos            : torch.Size([30, 1])
denominator    : torch.Size([384])
even_pos_embed : torch.Size([30, 384])
stacked        : torch.Size([30, 384, 2])
pos_embed      : torch.Size([30, 768])

Look-ahead mask

Figure 12. A glance-ahead mask must be applied to the masked-self attention layer [5].

The subsequent thing I’m going to discuss within the decoder is the layer highlighted within the above figure. I’m not going to code the eye mechanism from scratch. Moderately, I’ll only implement the so-called , which can be useful for the self-attention layer in order that it doesn’t attend to the following words within the caption in the course of the training phase.

The approach to do it’s pretty easy, what we want to do is simply to create a triangular matrix which the dimensions is ready to match with the eye weight matrix, i.e., SEQ_LENGTH × SEQ_LENGTH (30×30). Take a look at the create_mask()function below for the main points.

# Codeblock 15
def create_mask(seq_length):
   mask = torch.tril(torch.ones((seq_length, seq_length)))  #(1)
   mask[mask == 0] = -float('inf')  #(2)
   mask[mask == 1] = 0  #(3)
   return mask

Regardless that making a triangular matrix can simply be done with torch.tril() and torch.ones() (#(1)), but here we want to make a little bit modification by changing the 0 values to (#(2)) and the 1s to 0 (#(3)). This is basically done since the nn.MultiheadAttention layer applies the mask by element-wise addition. By assigning to the following words, the eye mechanism will completely ignore them. Again, the inner process inside an attention layer has also been discussed intimately in my previous article about transformer.

Now I’m going to run the function with seq_length=7 so that you may see what the mask actually looks like. Later in the entire flow, we want to set the seq_length parameter to SEQ_LENGTH (30) in order that it matches with the actual caption length.

# Codeblock 16
mask_example = create_mask(seq_length=7)
mask_example
# Codeblock 16 Output
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf],
       [0., 0., -inf, -inf, -inf, -inf, -inf],
       [0., 0., 0., -inf, -inf, -inf, -inf],
       [0., 0., 0., 0., -inf, -inf, -inf],
       [0., 0., 0., 0., 0., -inf, -inf],
       [0., 0., 0., 0., 0., 0., -inf],
       [0., 0., 0., 0., 0., 0., 0.]])

The primary decoder block

Figure 13. The primary decoder block [5].

We will see within the above figure that the structure of the decoder block is a bit longer than that of the encoder block. It looks as if all the things is sort of the identical, except that the decoder part has a mechanism and an extra layer normalization step placed after it. This cross-attention layer can actually be perceived because the bridge between the encoder and the decoder, because it is employed to capture the relationships between each word within the caption and each single patch within the input image. The 2 arrows coming from the encoder are the and inputs for the eye layer, whereas the is derived from the previous layer within the decoder itself. Take a look at the Codeblock 17a and 17b below to see the implementation of your entire decoder block.

# Codeblock 17a
class DecoderBlock(nn.Module):
   def __init__(self):
       super().__init__()
      
       #(1)
       self.self_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,
                                                   num_heads=NUM_HEADS,
                                                   batch_first=True,
                                                   dropout=DROP_PROB)
       #(2)
       self.layer_norm_0 = nn.LayerNorm(EMBED_DIM)
       #(3)
       self.cross_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,
                                                    num_heads=NUM_HEADS,
                                                    batch_first=True,
                                                    dropout=DROP_PROB)

       #(4)
       self.layer_norm_1 = nn.LayerNorm(EMBED_DIM)
      
       #(5)      
       self.ffn = nn.Sequential(
           nn.Linear(in_features=EMBED_DIM, out_features=HIDDEN_DIM),
           nn.GELU(),
           nn.Dropout(p=DROP_PROB),
           nn.Linear(in_features=HIDDEN_DIM, out_features=EMBED_DIM),
       )
      
       #(6)
       self.layer_norm_2 = nn.LayerNorm(EMBED_DIM)

Within the __init__() method, we first initialize each self-attention (#(1)) and cross-attention (#(3)) layers with nn.MultiheadAttention. These two layers seem like the exact same now, but later you’ll see the difference within the forward() method. The three layer normalization operations are initialized individually as shown at line #(2), #(4) and #(6), since each of them will contain different normalization parameters. Lastly, the ffn layer (#(5)) is precisely the identical because the one within the encoder, which principally follows the equation back in Figure 8.

Talking in regards to the forward() method below, it initially works by accepting three inputs: features, captions, and attn_mask, which each of them denotes the tensor coming from the encoder, the tensor from the decoder itself, and a look-ahead mask, respectively (#(1)). The remaining steps are somewhat much like that of the EncoderBlock, except that here we repeat the multihead attention block twice. The primary attention mechanism takes captions because the query, key, and value parameters (#(2)). This is basically done because we would like the layer to capture the context throughout the captions tensor itself — hence the name . Here we also must pass the attn_mask parameter to this layer in order that it cannot see the following words in the course of the training phase. The second attention mechanism is different (#(3)). Since we would like to mix the knowledge from the encoder and the decoder, we want to pass the captions tensor because the query, whereas the features tensor can be passed because the key and value — hence the name . A glance-ahead mask shouldn’t be vital within the cross-attention layer since later within the inference phase the model will give you the option to see your entire input image without delay quite than taking a look at the patches one after the other. Because the tensor has been processed by the 2 attention layers, we’ll then pass it through the feed forward network (#(4)). Lastly, don’t forget to create the residual connections and apply the layer normalization steps after each sub-component.

# Codeblock 17b
   def forward(self, features, captions, attn_mask):  #(1)
       print(f"attn_masktt: {attn_mask.shape}")
       residual = captions
       print(f"captions & residualt: {captions.shape}")
      
       #(2)
       captions, self_attn_weights = self.self_attention(query=captions,
                                                         key=captions,
                                                         value=captions,
                                                         attn_mask=attn_mask)
       print(f"after self attentiont: {captions.shape}")
       print(f"self attn weightst: {self_attn_weights.shape}")
      
       captions = self.layer_norm_0(captions + residual)
       print(f"after normtt: {captions.shape}")
      
      
       print(f"nfeaturestt: {features.shape}")
       residual = captions
       print(f"captions & residualt: {captions.shape}")
      
       #(3)
       captions, cross_attn_weights = self.cross_attention(query=captions,
                                                           key=features,
                                                           value=features)
       print(f"after cross attentiont: {captions.shape}")
       print(f"cross attn weightst: {cross_attn_weights.shape}")
      
       captions = self.layer_norm_1(captions + residual)
       print(f"after normtt: {captions.shape}")
      
       residual = captions
       print(f"ncaptions & residualt: {captions.shape}")
      
       captions = self.ffn(captions)  #(4)
       print(f"after ffntt: {captions.shape}")
      
       captions = self.layer_norm_2(captions + residual)
       print(f"after normtt: {captions.shape}")
      
       return captions

Because the DecoderBlock class is accomplished, we will now test it with the next code.

# Codeblock 18
decoder_block = DecoderBlock()

features = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)  #(1)
captions = torch.randn(BATCH_SIZE, SEQ_LENGTH, EMBED_DIM)   #(2)
look_ahead_mask = create_mask(seq_length=SEQ_LENGTH)  #(3)

captions = decoder_block(features, captions, look_ahead_mask)

Here we assume that features is a tensor containing a sequence of patch embeddings produced by the encoder (#(1)), while captions is a sequence of embedded words (#(2)). The seq_length parameter of the look-ahead mask is ready to SEQ_LENGTH (30) to match it to the variety of words within the caption (#(3)). The tensor dimensions after each step are displayed in the next output.

# Codeblock 18 Output
attn_mask             : torch.Size([30, 30])
captions & residual   : torch.Size([1, 30, 768])
after self attention  : torch.Size([1, 30, 768])
self attn weights     : torch.Size([1, 30, 30])    #(1)
after norm            : torch.Size([1, 30, 768])

features              : torch.Size([1, 576, 768])
captions & residual   : torch.Size([1, 30, 768])
after cross attention : torch.Size([1, 30, 768])
cross attn weights    : torch.Size([1, 30, 576])   #(2)
after norm            : torch.Size([1, 30, 768])

captions & residual   : torch.Size([1, 30, 768])
after ffn             : torch.Size([1, 30, 768])
after norm            : torch.Size([1, 30, 768])

Here we will see that our DecoderBlock class works properly because it successfully processed the input tensors all of the approach to the last layer within the network. Here I would like you to take a better have a look at the eye weights at lines #(1) and #(2). Based on these two lines, we will confirm that our decoder implementation is correct because the attention weight produced by the self-attention layer has the dimensions of 30×30 (#(1)), which principally implies that this layer really captured the context throughout the input caption. Meanwhile, the eye weight matrix generated by the cross-attention layer has the dimensions of 30×576 (#(2)), indicating that it successfully captured the relationships between the words and the patches. This essentially implies that after cross-attention operation is performed, the resulting captions tensor has been enriched with the knowledge from the image.

Transformer decoder

Figure 14. All the Transformer Decoder within the CPTR architecture [5].

Now that we have now successfully created all components for your entire decoder, what I’m going to do next is to place them together right into a single class. Take a look at the Codeblock 19a and 19b below to see how I do this.

# Codeblock 19a
class Decoder(nn.Module):
   def __init__(self):
       super().__init__()

       #(1)
       self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
                                     embedding_dim=EMBED_DIM)

       #(2)
       self.sinusoidal_embedding = SinusoidalEmbedding()

       #(3)
       self.decoder_blocks = nn.ModuleList(DecoderBlock() for _ in range(NUM_DECODER_BLOCKS))

       #(4)
       self.linear = nn.Linear(in_features=EMBED_DIM,
                               out_features=VOCAB_SIZE)

If you happen to compare this Decoder class with the Encoder class from codeblock 9, you’ll notice that they’re somewhat similar by way of the structure. Within the encoder, we convert image patches into vectors using Patcher, while within the decoder we convert each word within the caption right into a vector using the nn.Embedding layer (#(1)), which I haven’t explained earlier. Afterward, we initialize the positional embedding layer, where for the decoder we use the quite than the one (#(2)). Next, we stack multiple decoder blocks using nn.ModuleList (#(3)). The linear layer written at line #(4), which doesn’t exist within the encoder, is vital to be implemented here since it is going to be responsible to map each of the embedded words right into a vector of length VOCAB_SIZE (10000). In a while, this vector will contain the logit of each word within the dictionary, and what we want to do afterward is simply to take the index containing the very best value, i.e., the most definitely word to be predicted.

The flow of the tensors throughout the forward() method itself can also be pretty much like the one within the Encoder class. Within the Codeblock 19b below we pass features, captions, and attn_mask because the input (#(1)). Consider that on this case the captions tensor accommodates the raw word sequence, so we want to vectorize these words with the embedding layer beforehand (#(2)). Next, we inject the sinusoidal positional embedding tensor using the code at line #(3) before eventually passing it through the 4 decoder blocks sequentially (#(4)). Finally, we pass the resulting tensor through the last linear layer to acquire the prediction logits (#(5)).

# Codeblock 19b
   def forward(self, features, captions, attn_mask):  #(1)
       print(f"featurestt: {features.shape}")
       print(f"captionstt: {captions.shape}")
      
       captions = self.embedding(captions)  #(2)
       print(f"after embeddingtt: {captions.shape}")
      
       captions = captions + self.sinusoidal_embedding()  #(3)
       print(f"after sin embedtt: {captions.shape}")
      
       for i, decoder_block in enumerate(self.decoder_blocks):
           captions = decoder_block(features, captions, attn_mask)  #(4)
           print(f"after decoder block #{i}t: {captions.shape}")
      
       captions = self.linear(captions)  #(5)
       print(f"after lineartt: {captions.shape}")
      
       return captions

At this point you is likely to be wondering why we don’t implement the softmax activation function as drawn within the illustration. This is basically because in the course of the training phase, softmax is often included throughout the loss function, whereas within the inference phase, the index of the most important value will remain the identical no matter whether softmax is applied.

Now let’s run the next testing code to envision whether there are errors in our implementation. Previously I discussed that the captions input of the Decoder class is a raw word sequence. To simulate this, we will simply create a sequence of random integers ranging between 0 and VOCAB_SIZE (10000) with the length of SEQ_LENGTH (30) words (#(1)).

# Codeblock 20
decoder = Decoder()

features = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(1)

captions = decoder(features, captions, look_ahead_mask)

And below is what the resulting output looks like. Here you possibly can see within the last line that the linear layer produced a tensor of size 30×10000, indicating that our decoder model is now able to predicting the logit scores for every word within the vocabulary across all 30 sequence positions.

# Codeblock 20 Output
features               : torch.Size([1, 576, 768])
captions               : torch.Size([1, 30])
after embedding        : torch.Size([1, 30, 768])
after sin embed        : torch.Size([1, 30, 768])
after decoder block #0 : torch.Size([1, 30, 768])
after decoder block #1 : torch.Size([1, 30, 768])
after decoder block #2 : torch.Size([1, 30, 768])
after decoder block #3 : torch.Size([1, 30, 768])
after linear           : torch.Size([1, 30, 10000])

Transformer decoder (alternative)

It is definitely also possible to make the code simpler by replacing the DecoderBlock class with the nn.TransformerDecoderLayer, identical to what we did within the ViT Encoder. Below is what the code looks like if we use this approach as a substitute.

# Codeblock 21
class DecoderTorch(nn.Module):
   def __init__(self):
       super().__init__()
       self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
                                     embedding_dim=EMBED_DIM)
      
       self.sinusoidal_embedding = SinusoidalEmbedding()
      
       #(1)
       decoder_block = nn.TransformerDecoderLayer(d_model=EMBED_DIM,
                                                  nhead=NUM_HEADS,
                                                  dim_feedforward=HIDDEN_DIM,
                                                  dropout=DROP_PROB,
                                                  batch_first=True)
      
       #(2)
       self.decoder_blocks = nn.TransformerDecoder(decoder_layer=decoder_block,
                                                   num_layers=NUM_DECODER_BLOCKS)
      
       self.linear = nn.Linear(in_features=EMBED_DIM,
                               out_features=VOCAB_SIZE)
      
   def forward(self, features, captions, tgt_mask):
       print(f"featurestt: {features.shape}")
       print(f"captionstt: {captions.shape}")
      
       captions = self.embedding(captions)
       print(f"after embeddingtt: {captions.shape}")
      
       captions = captions + self.sinusoidal_embedding()
       print(f"after sin embedtt: {captions.shape}")
      
       #(3)
       captions = self.decoder_blocks(tgt=captions,
                                      memory=features,
                                      tgt_mask=tgt_mask)
       print(f"after decoder blockst: {captions.shape}")
      
       captions = self.linear(captions)
       print(f"after lineartt: {captions.shape}")
      
       return captions

The primary difference you will note within the __init__() method is the usage of nn.TransformerDecoderLayer and nn.TransformerDecoder at line #(1) and #(2), where the previous is used to initialize a single decoder block, and the latter is for repeating the block multiple times. Next, the forward() method is usually much like the one within the Decoder class, except that the forward propagation on the decoder blocks is robotically repeated 4 times with no need to be put inside a loop (#(3)). One thing that it’s worthwhile to concentrate to within the decoder_blocks layer is that the tensor coming from the encoder (features) have to be passed because the argument for the memory parameter. Meanwhile, the tensor from the decoder itself (captions) needs to be passed because the input to the tgt parameter.

The testing code for the DecoderTorch model below is essentially the identical because the one written in Codeblock 20. Here you possibly can see that this model also generates the ultimate output tensor of size 30×10000.

# Codeblock 22
decoder_torch = DecoderTorch()

features = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))

captions = decoder_torch(features, captions, look_ahead_mask)
# Codeblock 22 Output
features             : torch.Size([1, 576, 768])
captions             : torch.Size([1, 30])
after embedding      : torch.Size([1, 30, 768])
after sin embed      : torch.Size([1, 30, 768])
after decoder blocks : torch.Size([1, 30, 768])
after linear         : torch.Size([1, 30, 10000])

All the CPTR model

Finally, it’s time to place the encoder and the decoder part we just created right into a single class to really construct the CPTR architecture. You possibly can see in Codeblock 23 below that the implementation may be very easy. All we want to do here is simply to initialize the encoder (#(1)) and the decoder (#(2)) components, then pass the raw images and the corresponding caption ground truths in addition to the look-ahead mask to the forward() method (#(3)). Moreover, it is usually possible for you to exchange the Encoder and the Decoder with EncoderTorch and DecoderTorch, respectively.

# Codeblock 23
class EncoderDecoder(nn.Module):
   def __init__(self):
       super().__init__()
       self.encoder = Encoder()  #EncoderTorch()  #(1)
       self.decoder = Decoder()  #DecoderTorch()  #(2)
      
   def forward(self, images, captions, look_ahead_mask):  #(3)
       print(f"imagesttt: {images.shape}")
       print(f"captionstt: {captions.shape}")
      
       features = self.encoder(images)
       print(f"after encodertt: {features.shape}")
      
       captions = self.decoder(features, captions, look_ahead_mask)
       print(f"after decodertt: {captions.shape}")
      
       return captions

We will do the testing by passing dummy tensors through it. See the Codeblock 24 below for the main points. On this case, images is essentially only a tensor of random numbers having the dimension of 1×3×384×384 (#(1)), while captions is a tensor of size 1×30 containing random integers (#(2)).

# Codeblock 24
encoder_decoder = EncoderDecoder()

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(1)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(2)

captions = encoder_decoder(images, captions, look_ahead_mask)

Below is what the output looks like. We will see here that our input images and captions successfully went through all layers within the network, which principally implies that the CPTR model we created is now ready to really be trained on image captioning datasets.

# Codeblock 24 Output
images         : torch.Size([1, 3, 384, 384])
captions       : torch.Size([1, 30])
after encoder  : torch.Size([1, 576, 768])
after decoder  : torch.Size([1, 30, 10000])

Ending

That was just about all the things in regards to the theory and implementation of the CaPtion TransformeR architecture. Let me know what deep learning architecture I should implement next. Be happy to depart a comment when you spot any mistakes in this text!

References

[1] Wei Liu CPTR: Full Transformer Network for Image Captioning. Arxiv. https://arxiv.org/pdf/2101.10804 [Accessed November 16, 2024].

[2] Oriol Vinyals Show and Tell: A Neural Image Caption Generator. Arxiv. https://arxiv.org/pdf/1411.4555 [Accessed December 3, 2024].

[3] Image originally created by writer based on: Alexey Dosovitskiy An Image is Price 16×16 Words: Transformers for Image Recognition at Scale. Arxiv. https://arxiv.org/pdf/2010.11929 [Accessed December 3, 2024].

[4] Image originally created by writer based on [6].

[5] Image originally created by writer based on [1].

[6] Ashish Vaswani Attention Is All You Need. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed December 3, 2024].

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x