Pipeline parallelism splits a model “vertically” by layer. It’s also possible to “horizontally” split certain operations inside a layer, which is normally called Tensor Parallel training. For a lot of modern models (akin to the Transformer), the...
We show that autoregressive language models can learn to infill text after we apply a simple transformation to the dataset, which simply moves a span of text from the center of a document to...