Because the title suggests, in this text I’m going to implement the Transformer architecture from scratch with PyTorch — yes, literally from scratch. Before we get into it, let me provide a temporary overview of the architecture. Transformer was first introduced in a paper titled “Attention Is All You Need” written by Vaswani et al. back in 2017 [1]. This neural network model is designed to perform seq2seq (Sequence-to-Sequence) tasks, where it accepts a sequence because the input and is anticipated to return one other sequence for the output resembling machine translation and query answering.
Before Transformer was introduced, we often used RNN-based models like LSTM or GRU to perform seq2seq tasks. These models are indeed able to capturing context, yet they accomplish that in a sequential manner. This approach makes it difficult to capture long-range dependencies, especially when the necessary context may be very far behind the present timestep. In contrast, Transformer can freely attend any parts of the sequence that it considers necessary without being constrained by sequential processing.