What LayerNorm really does for Attention in Transformers2 things, not 1… Normalization via LayerNorm has been part and parcel of the Transformer architecture for a while. In the event you asked most AI practitioners why we now have LayerNorm, the generic answer can be that we use LayerNorm to normalize the activations on the forward pass and gradients on the backward. But that default…