We also can apply multiple consecutive transformations to a vector. So if we've two transformations represented by the matrices A1 and A2 we will apply them consecutively A2(A1(vector)).But that is different from applying them...
The complex math behind transformer models, in easy wordsInside the encoder, there are two add & norm layers:connects the input of the multi-head attention sub-layer to its outputconnects the input of the feedforward network...