Hi everyone! For individuals who have no idea me yet, my name is Francois, I’m a Research Scientist at Meta. I even have a passion for explaining advanced AI concepts and making them more accessible.
Today, let’s dive into one of the crucial significant contribution in the sphere of Computer Vision: the Vision Transformer (ViT).
The Vision Transformer was introduced by Alexey Dosovitskiy and al. (Google Brain) in 2021 within the paper An Image is value 16×16 words. On the time, Transformers had shown to be the important thing to unlock great performance on NLP tasks, introduced within the must paper Attention is All you Need in 2017.
Between 2017 and 2021, there have been several attempts to integrate the eye mechanism into Convolutional Neural Networks (CNNs). Nonetheless, these were mostly hybrid models (combining CNN layers with attention layers) and lacked scalability. Google addressed this by completely eliminating convolutions and leveraging their computational power to scale the model.