Vision Transformers (ViT) Explained: Are They Higher Than CNNs?

1. Introduction

Ever for the reason that introduction of the self-attention mechanism, Transformers have been the highest alternative relating to Natural Language Processing (NLP) tasks. Self-attention-based models are highly parallelizable and require substantially fewer parameters, making them far more computationally efficient, less vulnerable to overfitting, and easier to fine-tune for domain-specific tasks [1]. Moreover, the important thing advantage of transformers over past models (like RNN, LSTM, GRU and other neural-based architectures that dominated the NLP domain prior to the introduction of Transformers) is their ability to process input sequences of length without losing context, through using the self-attention mechanism that focuses on different parts of the input sequence, and the way those parts interact with other parts of the sequence, at different times [2]. Due to these qualities, Transformers has made it possible to coach language models of unprecedented size, with greater than 100B parameters, paving the best way for the present state-of-the-art advanced models just like the (GPT) and the (BERT) [1].

Nonetheless, in the sphere of computer vision, convolutional neural networks or CNNs, remain dominant in most, if not all, computer vision tasks. While there was an increasing collection of research work that attempts to implement self-attention-based architectures to perform computer vision tasks, only a few has reliably outperformed CNNs with promising scalability [3]. The predominant challenge with integrating the transformer architecture with image-related tasks is that, by design, the self-attention mechanism, which is the core component of transformers, has a quadratic time complexity with respect to sequence length, i.e. O(n2), as shown in Table I and as discussed further in Part 2.1. This is generally not an issue for NLP tasks that use a comparatively small variety of tokens per input sequence (e.g., a 1,000-word paragraph will only have 1,000 input tokens, or just a few more if sub-word units are used as tokens as an alternative of full words). Nonetheless, in computer vision, the input sequence (the image) can have a token size with orders of magnitude greater than that of NLP input sequences. For instance, a comparatively small 300 x 300 x 3 image can easily have as much as 270,000 tokens and require a self-attention map with as much as 72.9 billion parameters (270,0002) when self-attention is applied naively.

Table I. Time complexity for various layer types [2].

Because of this, a lot of the research work that try and use self-attention-based architectures to perform computer vision tasks did so either by applying self-attention , using transformer blocks along with CNN layers, or by only replacing specific components of the CNN architecture while maintaining the general structure of the network; never by only using a pure transformer [3]. The goal of Dr. Dosovitskiy, et. al. of their work, “An Image is Value 16×16 Words: Transformers for Image Recognition at Scale”, is to indicate that it’s indeed possible to implement image classification by applying self-attention through using the fundamental Transformer encoder architure, while at the identical time requiring significantly less computational resources to coach, and outperforming state-of-the-art convolutional neural networks like ResNet.

2. The Transformer

Transformers, introduced within the paper titled “Attention is All You Need” by Vaswani et al. in 2017, are a category of neural network architectures which have revolutionized various natural language processing and machine learning tasks. A high level view of its architecture is shown in Fig. 1.

Fig. 1. The Transformer model architecture showing the encoder (left block)
and decoder components (right block) [2]

Since its introduction, transformers have served as the inspiration for a lot of state-of-the-art models in NLP; including BERT, GPT, and more. Fundamentally, they’re designed to process sequential data, resembling text data, without the necessity for recurrent or convolutional layers [2]. They achieve this by relying heavily on a mechanism called .

The self-attention mechanism is a key innovation introduced within the paper that permits the model to capture relationships between different elements in a given sequence by weighing the importance of every element within the sequence with respect to other elements [2]. Say as an example, you would like to translate the next sentence:

.”

What does the word “” on this particular sentence discuss with? Is it referring to the road or the animal? For us humans, this will likely be a trivial query to reply. But for an algorithm, this might be considered a posh task to perform. Nonetheless, through the self-attention mechanism, the transformer model is in a position to estimate the relative weight of every word with respect to all the opposite words within the sentence, allowing the model to associate the word “it” with “animal” within the context of our given sentence [4].

Fig. 2. Sample output of the 5^th encoder in a 5-encoder stack self-attention block given the word “it” as an input. We are able to see that the eye mechanism is associating our input word with the phrase “The Animal” [4].

2.1. The Self-Attention Mechanism

A transformer a given input sequence by passing each element through an encoder (or a stack of encoders) and a decoder (or a stack of decoders) block, in parallel [2]. Each encoder block comprises a self-attention block and a feed forward neural network. Here, we only give attention to the block as this was the component utilized by Dosovitskiy et al. of their Vision Transformer image classification model.

As is the case with general NLP applications, step one within the encoding process is to show each input word right into a vector using an embedding layer which converts our text data right into a vector that represents our word within the vector space while retaining its contextual information. We then compile these individual word embedding vectors right into a matrix , where each row represents the embedding of every element within the input sequence. Then, we create three sets of vectors for every element within the input sequence; namely, Key (), Query (), and Value (). These sets are derived by multiplying matrix with the corresponding trainable weight matrices WQ, WK, and WV [2].

Afterwards, we perform a matrix multiplication between and , divide the result by the square-root of the dimensionality of : …after which apply a softmax function to normalize the output and generate weight values between 0 and 1 [2].

We are going to call this intermediary output the attention factor. This factor, shown in Eq. 4, represents the burden that every element within the sequence contributes to the calculation of the eye value at the present position (word being processed). The thought behind the softmax operation is to amplify the words that the model thinks are relevant to the present position, and attenuate those which might be irrelevant. For instance, in Fig. 3, the input sentence “He later went to report Malaysia for one yr” is passed right into a BERT encoder unit to generate a heatmap that illustrates the contextual relationship of every word with one another. We are able to see that words which might be deemed contextually associated produce higher weight values of their respective cells, visualized in a dark pink color, while words which might be contextually unrelated have low weight values, represented in pale pink.

Fig. 3. Attention matrix visualization – weights in a BERT Encoding Unit [5]

Finally, we multiply the eye factor matrix to the worth matrix to compute the aggregated self-attention value matrix of this layer [2], where each row in represents the eye vector for word in our input sequence. This aggregated value essentially bakes the “context” provided by other words within the sentence into the present word being processed. The eye equation shown in Eq. 5 is typically also known as the .

2.2 The Multi-Headed Self-Attention

Within the paper by Vaswani et. al., the self-attention block is further augmented with a mechanism generally known as the “multi-headed” self-attention, shown in Fig 4. The thought behind that is as an alternative of counting on a single attention mechanism, the model employs multiple parallel attention “heads” (within the paper, Vaswani et. al. used 8 parallel attention layers), wherein each of those attention heads learns different relationships and provides unique perspectives on the input sequence [2]. This improves the performance of the eye layer in two necessary ways:

First, it expands the flexibility of the model to give attention to different positions throughout the sequence. Depending on multiple variations involved within the initialization and training process, the calculated attention value for a given word (Eq. 5) might be dominated by other certain unrelated words or phrases and even by the word itself [4]. By computing multiple attention heads, the transformer model has multiple opportunities to capture the right contextual relationships, thus becoming more robust to variations and ambiguities within the input.Second, since each of our matrices are randomly initialized independently across all the eye heads, the training process then yields several matrices (Eq. 5), which supplies the transformer multiple [4]. For instance, one head might give attention to syntactic relationships while one other might attend to semantic meanings. Through this, the model is in a position to capture more diverse relationships throughout the data.

Fig. 4. Illustrating the Multi-Headed Self-Attention Mechanism. Each individual attention head yields a scaled dot-product attention value, that are concatenated and multiplied to a learned matrix W^O to generate the aggregated multi-headed self-attention value matrix[4].

3. The Vision Transformer

The basic innovation behind the Vision Transformer (ViT) revolves around the concept images might be processed as sequences of tokens relatively than grids of pixels. In traditional CNNs, input images are analyzed as overlapping tiles via a sliding convolutional filter, that are then processed hierarchically through a series of convolutional and pooling layers. In contrast, ViT treats the image as a group of patches, that are treated because the input sequence to an ordinary Transformer encoder unit.

Fig. 5. The Vision Transformer architecture (left), and the Transfomer encoder unit
derived from the Fig. 1 (right)[3].

By defining the input tokens to the transformer as non-overlapping image patches relatively than individual pixels, we’re due to this fact in a position to reduce the dimension of the eye map from ⟮𝐻 𝓍 𝑊⟯² to ⟮𝑛_𝑝ℎ 𝓍 𝑛_𝑝𝑤 ⟯² given 𝑛_𝑝ℎ ≪𝐻 and 𝑛_𝑝𝑤≪ 𝑊; where 𝐻 and 𝑊 are the peak and width of the image, and 𝑛_𝑝ℎ and 𝑛_𝑝𝑙 are the variety of patches within the corresponding axes. By doing so, the model is in a position to handle images of various sizes without requiring extensive architectural changes [3].

These image patches are then linearly embedded into lower-dimensional vectors, just like the word embedding step that produces matrix in Part 2.1. Since transformers don’t contain reoccurrence nor convolutions, they lack the capability to encode positional information of the input tokens and are due to this fact permutation invariant [2]. Hence, because it is completed in NLP applications, a positional embedding is appended to every linearly encoded vector prior to input into the transformer model, as a way to encode the spatial information of the patches, ensuring that the model understands the position of every token relative to other tokens throughout the image. Moreover, an additional learnable classifier embedding is added to the input. All of those (the linear embeddings of every 16 x 16 patch, the additional learnable classifier embedding, and their corresponding positional embedding vectors) are passed through an ordinary Transformer encoder unit as discussed in Part 2. The output corresponding to the added learnable embedding is then used to perform classification via an ordinary MLP classifer head [3].

4. The Result

Within the paper, the 2 largest models, ViT-H/14 and ViT-L/16, each pre-trained on the JFT-300M dataset, are in comparison with state-of-the-art CNNs—as shown in Table II, including Big Transfer (BiT), which employs supervised transfer learning with large ResNets, and Noisy Student, a big EfficientNet trained using semi-supervised learning on ImageNet and JFT-300M without labels [3]. On the time of this study’s publication, Noisy Student held the state-of-the-art position on ImageNet, while BiT-L on the opposite datasets utilized within the paper [3]. All models were trained in TPUv3 hardware, and the variety of TPUv3-core-days that it took to coach each model were recorded.

Table II. Comparison of model performance against popular image classification benchmarks. Reported listed here are the mean and standard deviation of the accuracies, averaged over three fine-tuning runs [3].

We are able to see from the table that Vision Transformer models pre-trained on the JFT-300M dataset outperforms ResNet-based baseline models on all datasets; while, at the identical time, requiring significantly less computational resources (TPUv3-core-days) to pre-train. A secondary ViT-L/16 model was also trained on a much smaller public ImageNet-21k dataset, and is shown to also perform relatively well while requiring as much as 97% less computational resources in comparison with state-of-the-art counter parts [3].

Fig. 6 shows the comparison of the performance between the BiT and ViT models (measured using the ImageNet Top1 Accuracy metric) across different pre-training datasets of various sizes. We see that the ViT-Large models underperform in comparison with the bottom models on the small datasets like ImageNet, and roughly equivalent performance on ImageNet-21k. Nonetheless, when pre-trained on larger datasets like JFT-300M, the ViT clearly outperforms the bottom model [3].

Fig. 6. BiT (ResNet) vs ViT on different pre-training datasets [3].

Further exploring how the scale of the dataset pertains to model performance, the authors trained the models on various random subsets of the JFT dataset—9M, 30M, 90M, and the complete JFT-300M. Additional regularization was not added on smaller subsets as a way to assess the intrinsic model properties (and never the effect of regularization) [3]. Fig. 7 shows that ViT models overfit greater than ResNets on smaller datasets. Data shows that ResNets perform higher with smaller pre-training datasets but plateau before ViT; which then outperforms the previous with larger pre-training. The authors conclude that on smaller datasets, convolutional inductive biases play a key role in CNN model performance, which ViT models lack. Nonetheless, with large enough data, learning relevant patterns directly outweighs inductive biases, wherein ViT excels [3].

Fig. 7. ResNet vs ViT on different subsets of the JFT training dataset [3].

Finally, the authors analyzed the models’ transfer performance from JFT-300M vs total pre-training compute resources allocated, across different architectures, as shown in Fig. 8. Here, we see that Vision Transformers outperform ResNets with the identical computational budget across the board. ViT uses roughly 2-4 times less compute to realize similar performance as ResNet [3]. Implementing a hybrid model does improve performance on smaller model sizes, however the discrepancy vanishes for larger models, which the authors find surprising because the initial hypothesis is that the convolutional local feature processing should have the option to help ViT no matter compute size [3].

Fig. 8. Performance of the models across different pre-training compute values—exa floating point operations per second (or exaFLOPs) [3].

4.1 What does the ViT model learn?

With a purpose to understand how ViT processes image data, it is necessary to research its internal representations. In Part 3, we saw that the input patches generated from the image are fed right into a linear embedding layer that projects the 16×16 patch right into a lower dimensional vector space, and its resulting embedded representations are then appended with positional embeddings. Fig. 9 shows that the model indeed learns to encode the relative position of every patch within the image. The authors used cosine similarity between the learned positional embeddings across patches [3]. High cosine similarity values emerge on similar relative area throughout the position embedding matrix corresponding to the patch; i.e., the highest right patch (row 1, col 7) has a corresponding high cosine similarity value (yellow pixels) on the top-right area of the position embedding matrix [3].

Fig. 9. Learned positional embedding for the input image patches [3].

Meanwhile, Fig. 10 (left) shows the highest principal components of learned embedding filters which might be applied to the raw image patches prior to the addition of the positional embeddings. What’s interesting for me is how similar that is to the learned hidden layer representations that you just get from Convolutional neural networks, an example of which is shown in the identical figure (right) using the AlexNet architecture.

Fig. 10. Filters of the initial linear embedding layer of ViT-L/32 (left) [3].
The primary layer of filters from AlexNet (right) [6].

By design, the self-attention mechanism should allow ViT to integrate information across your complete image, even at the bottom layer, effectively giving ViTs a world receptive field firstly. We are able to in some way see this effect in Fig. 10 where the learned embedding filters captured lower level features like lines and grids, in addition to higher level patterns combining lines and color blobs. This in contrast with CNNs whose receptive field size at the bottom layer may be very small (because local application of the convolution operation only to the world defined by the filter size), and only widens towards the deeper convolutions as further applications of convolutions extract context from the combined information extracted from lower layers. The authors further tested this by measuring the which is computed from the “average distance within the image space across which information is integrated based on the eye weights [3].” The outcomes are shown in Fig. 11.

Fig. 11. Size of attended area by head and network depth [3].

From the figure, we will see that even at very low layers of the network, some heads attend to a lot of the image already (as indicated by data points with high mean attention distance value at lower values of network depth); thus proving the flexibility of the ViT model to integrate image information globally, even at the bottom layers.

Finally, the authors also calculated the eye maps from the output token to the input space using Attention Rollout by averaging the eye weights of the ViT-L/16 across all heads after which recursively multiplying the burden matrices of all layers. This leads to a pleasant visualization of what the output layer attends to prior to classification, shown in Fig. 12 [3].

Fig. 12. Representative examples of attention from the output token to the input space [3].

5. So, is ViT the longer term of Computer Vision?

The Vision Transformer (ViT) introduced by Dosovitskiy et. al. within the research study showcased on this paper is a groundbreaking architecture for computer vision tasks. Unlike previous methods that introduce image-specific biases, ViT treats a picture as a sequence of patches and process it using an ordinary Transformer encoder, resembling how Transformers are utilized in NLP. This straightforward yet scalable strategy, combined with pre-training on extensive datasets, has yielded impressive results as discussed in Part 4. The Vision Transformer (ViT) either matches or surpasses the state-of-the-art on quite a few image classification datasets (Fig. 6, 7, and eight), all while maintaining cost-effectiveness in pre-training [3].

Nonetheless, like in any technology, it has its limitations. First, as a way to perform well, ViTs require a really great amount of coaching data that not everyone has access to within the required scale, especially when put next to traditional CNNs. The authors of the paper used the JFT-300M dataset, which is a limited-access dataset managed by Google [7]. The dominant approach to get around that is to make use of the model pre-trained on the big dataset, after which fine-tune it to smaller (downstream) tasks. Nonetheless, second, there are still only a few pre-trained ViT models available as in comparison with the available pre-trained CNN models, which limits the provision of transfer learning advantages for these smaller, far more specific computer vision tasks. Third, by design, ViTs process images as sequences of tokens (discussed in Part 3), which suggests they don’t naturally capture spatial information [3]. While adding positional embeddings do help treatment this lack of spatial context, ViTs may not perform in addition to CNNs in image localization tasks, given CNNs convolutional layers which might be excellent at capturing these spatial relationships.

Moving forward, the authors mention the necessity to further study scaling ViTs for other computer vision tasks resembling image detection and segmentation, in addition to other training methods like self-supervised pre-training [3]. Future research may give attention to making ViTs more efficient and scalable, resembling developing smaller and more lightweight ViT architectures that may still deliver the identical competitive performance. Moreover, providing higher accessibility by creating and sharing a wider range of pre-trained ViT models for various tasks and domains can further facilitate the event of this technology in the longer term.

References

N. Pogeant, “Transformers - the NLP revolution,” Medium, https://medium.com/mlearning-ai/transformers-the-nlp-revolution-5c3b6123cfb4 (accessed Sep. 23, 2023).
A. Vaswani, et. al. “Attention is all you would like.” NIPS 2017.
A. Dosovitskiy, et. al. “An Image is Value 16×16 Words: Transformers for Image Recognition at Scale,” ICLR 2021.
X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,” Machine Intelligence Research, vol. 20, no. 4, pp. 447–482, 2023, doi: 10.1007/s11633-022-1410-8.
H. Wang, “Addressing Syntax-Based Semantic Complementation: Incorporating Entity and Soft Dependency Constraints into Metonymy Resolution”, Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Attention-matrix-visualization-a-weights-in-BERT-Encoding-Unit-Entity-BERT-b_fig5_359215965 [accessed 24 Sep, 2023]
A. Krizhevsky, et. al. “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS 2012.
C. Sun, et. al. “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era,” Google Research, ICCV 2017.

Vision Transformers (ViT) Explained: Are They Higher Than CNNs?

1. Introduction