Pipeline parallelism splits a model “vertically” by layer. It’s also possible to “horizontally” split certain operations inside a layer, which is normally called *Tensor Parallel* training. For a lot of modern models (akin to the Transformer), the computation bottleneck is multiplying an activation batch matrix with a big weight matrix. Matrix multiplication will be considered dot products between pairs of rows and columns; it’s possible to compute independent dot products on different GPUs, or to compute parts of every dot product on different GPUs and sum up the outcomes. With either strategy, we will slice the burden matrix into even-sized “shards”, host each shard on a special GPU, and use that shard to compute the relevant a part of the general matrix product before later communicating to mix the results.

One example is Megatron-LM, which parallelizes matrix multiplications inside the Transformer’s self-attention and MLP layers. PTD-P uses tensor, data, and pipeline parallelism; its pipeline schedule assigns multiple non-consecutive layers to every device, reducing bubble overhead at the price of more network communication.

Sometimes the input to the network will be parallelized across a dimension with a high degree of parallel computation relative to cross-communication. Sequence parallelism is one such idea, where an input sequence is split across time into multiple sub-examples, proportionally decreasing peak memory consumption by allowing the computation to proceed with more granularly-sized examples.

healing music