The Reformer – Pushing the boundaries of language modeling

-


Patrick von Platen's avatar


Open In Colab



How the Reformer uses lower than 8GB of RAM to coach on sequences of half one million tokens

The Reformer model as introduced by Kitaev, Kaiser et al. (2020) is some of the memory-efficient transformer models for long sequence modeling as of today.

Recently, long sequence modeling has experienced a surge of interest as will be seen by the numerous submissions from this 12 months alone – Beltagy et al. (2020), Roy et al. (2020), Tay et al., Wang et al. to call just a few.
The motivation behind long sequence modeling is that many tasks in NLP, e.g. summarization, query answering, require the model to process longer input sequences than models, reminiscent of BERT, are capable of handle. In tasks that require the model to process a big input sequence, long sequence models shouldn’t have to chop the input sequence to avoid memory overflow and thus have been shown to outperform standard “BERT”-like models cf. Beltagy et al. (2020).

The Reformer pushes the limit of longe sequence modeling by its ability to process as much as half one million tokens without delay as shown on this demo. As a comparison, a traditional bert-base-uncased model limits the input length to only 512 tokens. In Reformer, each a part of the usual transformer architecture is re-engineered to optimize for minimal memory requirement and not using a significant drop in performance.

The memory improvements will be attributed to 4 features which the Reformer authors introduced to the transformer world:

  1. Reformer Self-Attention LayerHow you can efficiently implement self-attention without being restricted to an area context?
  2. Chunked Feed Forward LayersHow you can get a greater time-memory trade-off for big feed forward layers?
  3. Reversible Residual LayersHow you can drastically reduce memory consumption in training by a sensible residual architecture?
  4. Axial Positional EncodingsHow you can make positional encodings usable for very large input sequences?

The goal of this blog post is to present the reader an in-depth understanding of every of the 4 Reformer features mentioned above. While the reasons are focussed on the Reformer, the reader should get a greater intuition under which circumstances each of the 4 features will be effective for other transformer models as well.
The 4 sections are only loosely connected, in order that they can thoroughly be read individually.

Reformer is a component of the 🤗Transformers library. For all users of the Reformer, it is suggested to undergo this very detailed blog post to raised understand how the model works and easy methods to accurately set its configuration. All equations are accompanied by their equivalent name for the Reformer config, e.g. config., in order that the reader can quickly relate to the official docs and configuration file.

Note: Axial Positional Encodings should not explained within the official Reformer paper, but are extensively utilized in the official codebase. This blog post gives the primary in-depth explanation of Axial Positional Encodings.



1. Reformer Self-Attention Layer

Reformer uses two sorts of special self-attention layers: local self-attention layers and Locality Sensitive Hashing (LSH) self-attention layers.

To raised introduce these recent self-attention layers, we are going to briefly recap
conventional self-attention as introduced in Vaswani et al. 2017.

This blog post uses the identical notation and coloring as the favored blog post The illustrated transformer, so the reader is strongly advised to read this blog first.

Vital: While Reformer was originally introduced for causal self-attention, it may well thoroughly be used for bi-directional self-attention as well. On this post, Reformer’s self-attention is presented for bidirectional self-attention.



Recap Global Self-Attention

The core of each Transformer model is the self-attention layer. To recap the standard self-attention layer, which we discuss with here because the global self-attention layer, allow us to assume we apply a transformer layer on the embedding vector sequence X=x1,…,xnmathbf{X} = mathbf{x}_1, ldots, mathbf{x}_n

In brief, a world self-attention layer projects Xmathbf{X} to the query, key and value matrices Q,K,Vmathbf{Q}, mathbf{K}, mathbf{V}

Visually, we will illustrate this operation as follows for n=16,dh=3n=16, d_h=3

alt text

Note that for all visualizations batch_size and config.num_attention_heads is assumed to be 1. Some vectors, e.g. x3mathbf{x_3}

Vital to recollect is that for every output vector zimathbf{z}_{i}

This can also be the rationale why bert-base-cased has a config.max_position_embedding_size of only 512.



Local Self-Attention

Local self-attention is the plain solution to reducing the O(n2)mathcal{O}(n^2)

Let’s take our input sequence for n=16,dh=3n=16, d_h=3

alt text

Assuming lc=4,nc=4l_{c} = 4, n_{c} = 4

alt text

As will be seen, the eye operation is applied for every chunk X1:4,X5:8,X9:12,X13:16mathbf{X}_{1:4}, mathbf{X}_{5:8}, mathbf{X}_{9:12}, mathbf{X}_{13:16}

An easy treatment is to reinforce each chunk with config.local_num_chunks_before, i.e. npn_{p}

Zloc=[Z1:lcloc,…,Z(nc−1)∗lc:nc∗lcloc],mathbf{Z}^{text{loc}} = left[mathbf{Z}_{1:l_{c}}^{text{loc}}, ldots, mathbf{Z}_{(n_{c} – 1) * l_{c} : n_{c} * l_{c}}^{text{loc}}right],

Okay, this formula looks quite complicated. Let’s make it easier.
In Reformer’s self-attention layers nan_{a}

Z1:lcloc=SelfAttn(X−lc+1:lc)[lc:]mathbf{Z}_{1:l_{c}}^{text{loc}} = text{SelfAttn}(mathbf{X}_{-l_{c} + 1: l_{c}})left[l_{c}:right]

We notice that we’ve got a circular relationship in order that the primary segment can attend the last segment as well. Let’s illustrate this barely enhanced local attention again. First, we apply self-attention inside each windowed segment and keep only the central output segment.

alt text

Finally, the relevant output is concatenated to Zlocmathbf{Z}^{text{loc}}

alt text

Note that local self-attention is implemented efficiently way in order that no output is computed and subsequently “thrown-out” as shown here for illustration purposes by the red cross.

It is vital to notice here that extending the input vectors for every chunked self-attention function allows each single output vector zi mathbf{z}_{i}

The gain in memory consumption is kind of obvious: The O(n2) mathcal{O}(n^2)

This enhanced local self-attention is best than the vanilla local self-attention architecture but still has a significant drawback in that each input vector can only attend to an area context of predefined size. For NLP tasks that don’t require the transformer model to learn long-range dependencies between the input vectors, which include arguably e.g. speech recognition, named entity recognition and causal language modeling of short sentences, this may not be an enormous issue. Many NLP tasks do require the model to learn long-range dependencies, in order that local self-attention could lead on to significant performance degradation, e.g.

  • Query-answering: the model has to learn the connection between the query tokens and relevant answer tokens which can most certainly not be in the identical local range
  • Multiple-Alternative: the model has to match multiple answer token segments to one another which are frequently separated by a big length
  • Summarization: the model has to learn the connection between an extended sequence of context tokens and a shorter sequence of summary tokens, whereas the relevant relationships between context and summary can most certainly not be captured by local self-attention
  • etc…

Local self-attention by itself is most certainly not sufficient for the transformer model to learn the relevant relationships of input vectors (tokens) to one another.

Due to this fact, Reformer moreover employs an efficient self-attention layer that approximates global self-attention, called LSH self-attention.



LSH Self-Attention

Alright, now that we’ve got understood how local self-attention works, we will take a stab on the probably most modern piece of Reformer: Locality sensitive hashing (LSH) Self-Attention.

The premise of LSH self-attention is to be roughly as efficient as local self-attention while approximating global self-attention.

LSH self-attention relies on the LSH algorithm as presented in Andoni et al (2015), hence its name.

The thought behind LSH self-attention relies on the insight that if nn is large, the softmax applied on the QKTmathbf{Q}mathbf{K}^T

Let’s explain this in additional detail.
Let ki∈K=[k1,…,kn]Tmathbf{k}_{i} in mathbf{K} = left[mathbf{k}_1, ldots, mathbf{k}_n right]^T

First, the authors of Reformer notice that sharing the query and key projections: Q=Kmathbf{Q} = mathbf{K}

For every set of indices CmC_{m}

Second, the authors make use of the LSH algorithm to cluster the query vectors right into a predefined variety of buckets nbn_{b}

Visually, we will illustrate this as follows for our original example:

alt text

Third, it may well be noted that having clustered all query vectors in nbn_{b}

Let’s make clear with our example input vectors X=x1,...,x16mathbf{X} = mathbf{x}_1, …, mathbf{x}_{16}

alt text

The self-attention mechanism ought to be applied for every cluster individually in order that for every cluster Cm mathcal{C}_m

Let’s illustrate this again for our example.

alt text

As will be seen, the self-attention function operates on different sizes of matrices, which is suboptimal for efficient batching in GPU and TPU.

To beat this problem, the permuted input will be chunked the identical way it is completed for local attention in order that each chunk is of size config.lsh_chunk_length. By chunking the permuted input, a bucket is perhaps split into two different chunks. To treatment this problem, in LSH self-attention each chunk attends to its previous chunk config.lsh_num_chunks_before=1 along with itself, the identical way local self-attention does (config.lsh_num_chunks_after is frequently set to 0). This manner, we will be assured that each one vectors in a bucket attend to one another with a high probability 3{}^3

All in all for all chunks k∈{1,…,nc} k in {1, ldots, n_{c}}

Z′lc∗k+1:lc∗(k+1)LSH=SelfAttnQ=K(X′lc∗k+1):lc∗(k+1))[lc:] mathbf{Z’}_{l_{c} * k + 1:l_{c} * (k + 1)}^{text{LSH}} = text{SelfAttn}_{mathbf{Q} = mathbf{K}}(mathbf{X’}_{l_{c} * k + 1): l_{c} * (k + 1)})left[l_{c}:right]

with X′mathbf{X’}

The permuted vectors X′mathbf{X’}

alt text

Finally, the output Z′LSHmathbf{Z’}^{text{LSH}}

alt text

One vital feature to say here as well is that the accuracy of LSH self-attention will be improved by running LSH self-attention config.num_hashes, e.g. nhn_{h}

alt text

Great. That is it. Now we understand how LSH self-attention works in Reformer.

Regarding the memory complexity, we now have two terms that compete which one another to be the memory bottleneck: the dot-product: O(nh∗nc∗lc2)=O(n∗nh∗lc) mathcal{O}(n_{h} * n_{c} * l_{c}^2) = mathcal{O}(n * n_{h} * l_{c})

Let’s recap quickly what we’ve got passed through above:

  1. We would like to approximate global attention using the knowledge that the softmax operation only puts significant weights on only a few key vectors.
  2. If key vectors are equal to question vectors which means for every query vector qi mathbf{q}_{i}
  3. This relationship works in each ways, meaning if qj mathbf{q}_{j}
  4. We apply local self-attention on the permuted input and re-order the output to its original permutation.

1 {}^{1}

2 {}^{2}

3 {}^3



Benchmark

Benchmark tools were recently added to Transformers – see here for a more detailed explanation.

To point out how much memory will be saved using “local” + “LSH” self-attention, the Reformer model google/reformer-enwik8 is benchmarked for various local_attn_chunk_length and lsh_attn_chunk_length. The default configuration and usage of the google/reformer-enwik8 model will be checked in additional detail here.

Let’s first do some mandatory imports and installs.

#@title Installs and Imports
# pip installs
!pip -qq install git+https://github.com/huggingface/transformers.git
!pip install -qq py3nvml

from transformers import ReformerConfig, PyTorchBenchmark, PyTorchBenchmarkArguments

First, let’s benchmark the memory usage of the Reformer model using global self-attention. This will be achieved by setting lsh_attn_chunk_length = local_attn_chunk_length = 8192 in order that for all input sequences smaller or equal to 8192, the model mechanically switches to global self-attention.

config = ReformerConfig.from_pretrained("google/reformer-enwik8", lsh_attn_chunk_length=16386, local_attn_chunk_length=16386, lsh_num_chunks_before=0, local_num_chunks_before=0)
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[2048, 4096, 8192, 16386], batch_sizes=[1], models=["Reformer"], no_speed=True, no_env_print=True)
benchmark = PyTorchBenchmark(configs=[config], args=benchmark_args)
result = benchmark.run()
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1279.0, style=ProgressStyle(description…



1 / 1
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capability; 8.87 GiB already allocated; 1.92 GiB free; 8.88 GiB reserved in total by PyTorch)

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
           Reformer                  1              2048            1465     
           Reformer                  1              4096            2757     
           Reformer                  1              8192            7893     
           Reformer                  1             16386            N/A      
--------------------------------------------------------------------------------

The longer the input sequence, the more visible is the quadratic relationship O(n2) mathcal{O}(n^2)

For this a google/reformer-enwik8 model using global attention, a sequence length of over 16K ends in a memory overflow.

Now, let’s activate local and LSH self-attention by utilizing the model’s default parameters.

  config = ReformerConfig.from_pretrained("google/reformer-enwik8")
  benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[2048, 4096, 8192, 16384, 32768, 65436], batch_sizes=[1], models=["Reformer"], no_speed=True, no_env_print=True)
  benchmark = PyTorchBenchmark(configs=[config], args=benchmark_args)
  result = benchmark.run()
1 / 1
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capability; 7.85 GiB already allocated; 1.74 GiB free; 9.06 GiB reserved in total by PyTorch)
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 11.17 GiB total capability; 6.56 GiB already allocated; 3.99 GiB free; 6.81 GiB reserved in total by PyTorch)

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
           Reformer                  1              2048            1785     
           Reformer                  1              4096            2621     
           Reformer                  1              8192            4281     
           Reformer                  1             16384            7607     
           Reformer                  1             32768            N/A      
           Reformer                  1             65436            N/A      
--------------------------------------------------------------------------------

As expected using local and LSH self-attention is way more memory efficient for longer input sequences, in order that the model runs out of memory only at 16K tokens for a 11GB RAM GPU on this notebook.



2. Chunked Feed Forward Layers

Transformer-based models often employ very large feed forward layers after the self-attention layer in parallel. Thereby, this layer can take up a big amount of the general memory and sometimes even represent the memory bottleneck of a model.
First introduced within the Reformer paper, feed forward chunking is a way that permits to effectively trade higher memory consumption for increased time consumption.



Chunked Feed Forward Layer in Reformer

In Reformer, the LSH– or local self-attention layer is frequently followed by a residual connection, which then defines the primary part in a transformer block. For more detail on this please discuss with this blog.

The output of the primary a part of the transformer block, called normed self-attention output will be written as Z‾=Z+X mathbf{overline{Z}} = mathbf{Z} + mathbf{X}

For our example input x1,…,x16 mathbf{x}_1, ldots, mathbf{x}_{16}

alt text

Now, the second a part of a transformer block often consists of two feed forward layers 1 ^{1}

Yout=Linearout(Yint)=Linearout(Linearint(Z‾)).mathbf{Y}_{text{out}} = text{Linear}_{text{out}}(mathbf{Y}_text{int}) = text{Linear}_{text{out}}(text{Linear}_{text{int}}(mathbf{overline{Z}})).

It will be significant to recollect at this point that mathematically the output of a feed forward layer at position yout,i mathbf{y}_{text{out}, i}

Let’s illustrate the feed forward layers for z‾1,…,z‾16 mathbf{overline{z}}_1, ldots, mathbf{overline{z}}_{16}

alt text

As will be depicted from the illustration, all input vectors z‾i mathbf{overline{z}}_{i}

It becomes interesting when one takes a take a look at the output dimensions of the feed forward layers. In Reformer, the output dimension of Linearint text{Linear}_{text{int}}

The Reformer authors observed that in a transformer model the intermediate dimension df d_{f}

To get a greater feeling for the differences in dimensions let’s picture the matrices Yint mathbf{Y}_text{int}

alt text

It’s becoming quite obvious that the tensor Yint mathbf{Y}_text{int}

Assuming cf=1 c_{f}=1

alt text

By processing the inputs in chunks of size 1, the one tensors that should be stored in memory at the identical time are Yout mathbf{Y}_text{out}

Finally, it can be crucial to do not forget that chunked linear layers yield a mathematically equivalent output to standard linear layers and may due to this fact be applied to all transformer linear layers. Making use of config.chunk_size_feed_forward due to this fact allows a greater trade-off between memory and speed in certain use cases.


1 {}^1

2 {}^2

3 {}^3

More information on chunked linear / feed forward layers can be found here on the 🤗Transformers docs.



Benchmark

Let’s test how much memory will be saved by utilizing chunked feed forward layers.

#@title Installs and Imports
# pip installs
!pip -qq install git+https://github.com/huggingface/transformers.git
!pip install -qq py3nvml

from transformers import ReformerConfig, PyTorchBenchmark, PyTorchBenchmarkArguments
  Constructing wheel for transformers (setup.py) ... [?25l[?25hdone

First, let’s compare the default google/reformer-enwik8 model without chunked feed forward layers to the one with chunked feed forward layers.

config_no_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8")  # no chunk
config_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8", chunk_size_feed_forward=1)  # feed forward chunk
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[1024, 2048, 4096], batch_sizes=[8], models=["Reformer-No-Chunk", "Reformer-Chunk"], no_speed=True, no_env_print=True)
benchmark = PyTorchBenchmark(configs=[config_no_chunk, config_chunk], args=benchmark_args)
result = benchmark.run()
1 / 2
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capability; 7.85 GiB already allocated; 1.74 GiB free; 9.06 GiB reserved in total by PyTorch)
2 / 2
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capability; 7.85 GiB already allocated; 1.24 GiB free; 9.56 GiB reserved in total by PyTorch)

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
      Reformer-No-Chunk              8              1024            4281     
      Reformer-No-Chunk              8              2048            7607     
      Reformer-No-Chunk              8              4096            N/A      
        Reformer-Chunk               8              1024            4309     
        Reformer-Chunk               8              2048            7669     
        Reformer-Chunk               8              4096            N/A      
--------------------------------------------------------------------------------

Interesting, chunked feed forward layers don’t seem to assist here in any respect. The explanation is that config.feed_forward_size isn’t sufficiently large to make an actual difference. Only at longer sequence lengths of 4096, a slight decrease in memory usage will be seen.

Let’s examine what happens to the memory peak usage if we increase the dimensions of the feed forward layer by an element of 4 and reduce the variety of attention heads also by an element of 4 in order that the feed forward layer becomes the memory bottleneck.

config_no_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8", chunk_size_feed_forward=0, num_attention_{h}eads=2, feed_forward_size=16384)  # no chuck
config_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8", chunk_size_feed_forward=1, num_attention_{h}eads=2, feed_forward_size=16384)  # feed forward chunk
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[1024, 2048, 4096], batch_sizes=[8], models=["Reformer-No-Chunk", "Reformer-Chunk"], no_speed=True, no_env_print=True)
benchmark = PyTorchBenchmark(configs=[config_no_chunk, config_chunk], args=benchmark_args)
result = benchmark.run()
1 / 2
2 / 2

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
      Reformer-No-Chunk              8              1024            3743     
      Reformer-No-Chunk              8              2048            5539     
      Reformer-No-Chunk              8              4096            9087     
        Reformer-Chunk               8              1024            2973     
        Reformer-Chunk               8              2048            3999     
        Reformer-Chunk               8              4096            6011     
--------------------------------------------------------------------------------

Now a transparent decrease in peak memory usage will be seen for longer input sequences.
As a conclusion, it ought to be noted chunked feed forward layers only is sensible for models having few attention heads and huge feed forward layers.



3. Reversible Residual Layers

Reversible residual layers were first introduced in N. Gomez et al and used to scale back memory consumption when training the favored ResNet model. Mathematically, reversible residual layers are barely different
to “real” residual layers but don’t require the activations to be saved in the course of the forward pass, which may drastically reduce memory consumption for training.



Reversible Residual Layers in Reformer

Let’s start by investigating why training a model requires
way more memory than the inference of the model.

When running a model in inference, the required memory equals roughly the memory it takes to compute the single largest tensor within the model.
Then again, when training a model, the required memory equals roughly the sum of all differentiable tensors.

This isn’t surprising when considering how auto differentiation works in deep learning frameworks. These lecture slides by Roger Grosse of the University of Toronto are great to raised understand auto differentiation.

In a nutshell, to be able to calculate the gradient of a differentiable function (e.g. a layer), auto differentiation requires the gradient of the function’s output and the function’s input and output tensor. While the gradients are dynamically computed and subsequently discarded, the input and output tensors (a.k.a activations) of a function are stored in the course of the forward pass.

Alright, let’s apply this to a transformer model. A transformer model features a stack of multiple so-called transformer layers. Each additional transformer layer forces the model to store more activations in the course of the forward pass and thus increases the required memory for training.
Let’s take a more detailed look. A transformer layer essentially consists of two residual layers. The primary residual layer represents the self-attention mechanism as explained in section 1) and the second residual layer represents the linear or feed-forward layers as explained in section 2).

Using the identical notation as before, the input of a transformer layer i.e. X mathbf{X} is first normalized 1 ^{1}

Let’s illustrate an entire transformer layer using the instance of x1,…,x16 mathbf{x}_1, ldots, mathbf{x}_{16}

alt text

To calculate the gradient of e.g. the self-attention block G G , three tensors should be known beforehand: the gradient ∂Z partial mathbf{Z} , the output Z mathbf{Z} , and the input X mathbf{X} . While ∂Z partial mathbf{Z} will be calculated on-the-fly and discarded afterward, the values for Z mathbf{Z} and X mathbf{X} should be calculated and stored in the course of the forward pass because it isn’t possible to recalculate them easily on-the-fly during backpropagation. Due to this fact, in the course of the forward pass, large tensor outputs, reminiscent of the query-key dot product matrix QKT mathbf{Q}mathbf{K}^T

Here, reversible residual layers come to our help. The thought is comparatively straight-forward. The residual block is designed in a way in order that as a substitute of getting to store the input and output tensor of a function, each can easily be recalculated in the course of the backward pass in order that no tensor needs to be stored in memory in the course of the forward pass.
That is achieved by utilizing two input streams X(1),X(2) mathbf{X}^{(1)}, mathbf{X}^{(2)}

The reversible transformer layer will be visualized for x1,…,x16 mathbf{x}_1, ldots, mathbf{x}_{16}

alt text

As will be seen, the outputs Y‾(1),Y‾(2) mathbf{overline{Y}}^{(1)}, mathbf{overline{Y}}^{(2)}

If we assume to know Y‾(1),Y‾(2) mathbf{overline{Y}}^{(1)}, mathbf{overline{Y}}^{(2)}

Note: Since recently, major deep learning frameworks have released code that permits to store only certain activations and recompute larger ones in the course of the backward propagation (Tensoflow here and PyTorch here). For traditional reversible layers, this still implies that no less than one activation needs to be stored for every transformer layer, but by defining which activations can dynamically be recomputed a variety of memory will be saved.


1 ^{1}



Benchmark

With a purpose to measure the effect of reversible residual layers, we are going to compare the memory consumption of BERT with Reformer in training for an increasing variety of layers.

#@title Installs and Imports
# pip installs
!pip -qq install git+https://github.com/huggingface/transformers.git
!pip install -qq py3nvml

from transformers import ReformerConfig, BertConfig, PyTorchBenchmark, PyTorchBenchmarkArguments

Let’s measure the required memory for the usual bert-base-uncased BERT model by increasing the variety of layers from 4 to 12.

config_4_layers_bert = BertConfig.from_pretrained("bert-base-uncased", num_hidden_layers=4)
config_8_layers_bert = BertConfig.from_pretrained("bert-base-uncased", num_hidden_layers=8)
config_12_layers_bert = BertConfig.from_pretrained("bert-base-uncased", num_hidden_layers=12)
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[512], batch_sizes=[8], models=["Bert-4-Layers", "Bert-8-Layers", "Bert-12-Layers"], training=True, no_inference=True, no_speed=True, no_env_print=True)
benchmark = PyTorchBenchmark(configs=[config_4_layers_bert, config_8_layers_bert, config_12_layers_bert], args=benchmark_args)
result = benchmark.run()
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…



1 / 3
2 / 3
3 / 3

====================        TRAIN - MEMORY - RESULTS        ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
        Bert-4-Layers                8              512             4103     
        Bert-8-Layers                8              512             5759     
        Bert-12-Layers               8              512             7415     
--------------------------------------------------------------------------------

It may be seen that adding a single layer of BERT linearly increases the required memory by greater than 400MB.

config_4_layers_reformer = ReformerConfig.from_pretrained("google/reformer-enwik8", num_hidden_layers=4, num_hashes=1)
config_8_layers_reformer = ReformerConfig.from_pretrained("google/reformer-enwik8", num_hidden_layers=8, num_hashes=1)
config_12_layers_reformer = ReformerConfig.from_pretrained("google/reformer-enwik8", num_hidden_layers=12, num_hashes=1)
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[512], batch_sizes=[8], models=["Reformer-4-Layers", "Reformer-8-Layers", "Reformer-12-Layers"], training=True, no_inference=True, no_speed=True, no_env_print=True)
benchmark = PyTorchBenchmark(configs=[config_4_layers_reformer, config_8_layers_reformer, config_12_layers_reformer], args=benchmark_args)
result = benchmark.run()
1 / 3
2 / 3
3 / 3

====================        TRAIN - MEMORY - RESULTS        ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
      Reformer-4-Layers              8              512             4607     
      Reformer-8-Layers              8              512             4987     
      Reformer-12-Layers             8              512             5367     
--------------------------------------------------------------------------------

For Reformer, alternatively, adding a layer adds significantly less memory in practice. Adding a single layer increases the required memory on average by lower than 100MB in order that a much larger 12-Layer reformer-enwik8 model requires less memory than a 12-Layer bert-base-uncased model.



4. Axial Positional Encodings

Reformer makes it possible to process huge input sequences. Nevertheless, for such long input sequences standard positional encoding weight matrices alone would use greater than 1GB to store its weights.
To stop such large positional encoding matrices, the official Reformer code introduced Axial Position Encodings.

Vital: Axial Position Encodings weren’t explained within the official paper, but will be well understood from looking into the code and talking to the authors



Axial Positional Encodings in Reformer

Transformers need positional encodings to account for the order of words within the input because self-attention layers have no notion of order.
Positional encodings are frequently defined by a straightforward look-up matrix E=[e1,…,enmax] mathbf{E} = left[mathbf{e}_1, ldots, mathbf{e}_{n_text{max}}right]

Assuming dh=4 d_{h}=4

alt text

Here, we showcase only the positional encodings e1 mathbf{e}_{1}

Lets say, we wish to coach a Reformer model on sequences of a length of as much as 0.5M tokens and an input vector config.hidden_size of 1024 (see notebook here). The corresponding positional embeddings have a size of 0.5M×1024∼512M 0.5M times 1024 sim 512M

Such positional encodings would use an unnecessarily great amount of memory each when loading the model in memory and when saving the model on a hard disk drive.

The Reformer authors managed to drastically shrink the positional encodings in size by cutting the config.hidden_size dimension in two and smartly factorizing
the nmax n_text{max}

One can consider factorizing nmax n_{text{max}}

alt text

Each of the three standing rectangular prisms corresponds to one among the encoding vectors e1,e2,e49 mathbf{e}_{1}, mathbf{e}_{2}, mathbf{e}_{49}

alt text

We will see that we’ve got cut the embedding vectors into edown mathbf{e}_text{down}

e′i=[[edown, i%nmax1]T,[eup, ⌊inmax2⌋]T]Tmathbf{e’}_{i} = left[ left[mathbf{e}_{text{down, } i % n_text{max}^1}right]^T, left[mathbf{e}_{text{up, } left lfloor{frac{i}{{n}^2_{text{max}}}}right rfloor} right]^T right]^T

whereas nmax1=7 n_text{max}^1 = 7

In the next, these axial position encodings are illustrated in additional detail for our example.

alt text

Now it ought to be more comprehensible how the ultimate positional encoding vectors E′ mathbf{E’}

The crucial aspect to see here is that Axial Positional Encodings be sure that that not one of the vectors [e′1,…,e′nmax] left[mathbf{e’}_1, ldots, mathbf{e’}_{n_{text{max}}}right]

To exhibit the drastic reduction in size,
let’s assume we might have set config.axial_pos_shape = [1024, 512] and config.axial_pos_embds_dim = [512, 512] for a Reformer model that may process inputs as much as a length of 0.5M tokens. The resulting axial positional encoding matrix would have had a size of only 1024×512+512×512∼800K 1024 times 512 + 512 times 512 sim 800K

For a more condensed and math-heavy explanation please discuss with the 🤗Transformers docs here.



Benchmark

Lastly, let’s also compare the height memory consumption of conventional positional embeddings to axial positional embeddings.

#@title Installs and Imports
# pip installs
!pip -qq install git+https://github.com/huggingface/transformers.git
!pip install -qq py3nvml

from transformers import ReformerConfig, PyTorchBenchmark, PyTorchBenchmarkArguments, ReformerModel

Positional embeddings depend only on two configuration parameters: The utmost allowed length of input sequences config.max_position_embeddings and config.hidden_size. Let’s use a model that pushes the utmost allowed length of input sequences to half one million tokens, called google/reformer-crime-and-punishment, to see the effect of using axial positional embeddings.

To start with, we are going to compare the form of axial position encodings with standard positional encodings and the variety of parameters within the model.

config_no_pos_axial_embeds = ReformerConfig.from_pretrained("google/reformer-crime-and-punishment", axial_pos_embds=False)  # disable axial positional embeddings
config_pos_axial_embeds = ReformerConfig.from_pretrained("google/reformer-crime-and-punishment", axial_pos_embds=True, axial_pos_embds_dim=(64, 192), axial_pos_shape=(512, 1024))  # enable axial positional embeddings

print("Default Positional Encodings")
print(20 * '-')
model = ReformerModel(config_no_pos_axial_embeds)
print(f"Positional embeddings shape: {model.embeddings.position_embeddings}")
print(f"Num parameters of model: {model.num_parameters()}")
print(20 * '-' + 'nn')

print("Axial Positional Encodings")
print(20 * '-')
model = ReformerModel(config_pos_axial_embeds)
print(f"Positional embeddings shape: {model.embeddings.position_embeddings}")
print(f"Num parameters of model: {model.num_parameters()}")
print(20 * '-' + 'nn')
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1151.0, style=ProgressStyle(description…



Default Positional Encodings
--------------------
Positional embeddings shape: PositionEmbeddings(
  (embedding): Embedding(524288, 256)
)
Num parameters of model: 136572416
--------------------


Axial Positional Encodings
--------------------
Positional embeddings shape: AxialPositionEmbeddings(
  (weights): ParameterList(
      (0): Parameter containing: [torch.FloatTensor of size 512x1x64]
      (1): Parameter containing: [torch.FloatTensor of size 1x1024x192]
  )
)
Num parameters of model: 2584064
--------------------

Having read the idea, the form of the axial positional encoding weights shouldn’t be a surprise to the reader.

Regarding the outcomes, it may well be seen that for models being able to processing such long input sequences, it isn’t practical to make use of default positional encodings.
Within the case of google/reformer-crime-and-punishment, standard positional encodings alone contain greater than 100M parameters.
Axial positional encodings reduce this number to only over 200K.

Lastly, let’s also compare the required memory at inference time.

benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[512], batch_sizes=[8], models=["Reformer-No-Axial-Pos-Embeddings", "Reformer-Axial-Pos-Embeddings"], no_speed=True, no_env_print=True)
benchmark = PyTorchBenchmark(configs=[config_no_pos_axial_embeds, config_pos_axial_embeds], args=benchmark_args)
result = benchmark.run()
1 / 2
2 / 2

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
Reformer-No-Axial-Pos-Embeddin       8              512             959      
Reformer-Axial-Pos-Embeddings        8              512             447      
--------------------------------------------------------------------------------

It may be seen that using axial positional embeddings reduces the memory requirement to roughly half within the case of google/reformer-crime-and-punishment.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x