Transformer-based Encoder-Decoder Models

-


Patrick von Platen's avatar



Open In Colab

!pip install transformers==4.2.1
!pip install sentencepiece==0.1.95

The transformer-based encoder-decoder model was introduced by Vaswani
et al. within the famous Attention is all you wish
paper
and is today the de-facto
standard encoder-decoder architecture in natural language processing
(NLP).

Recently, there was a variety of research on different pre-training
objectives for transformer-based encoder-decoder models, e.g. T5,
Bart, Pegasus, ProphetNet, Marge, etc…, however the model architecture
has stayed largely the identical.

The goal of the blog post is to present an in-detail explanation of
how the transformer-based encoder-decoder architecture models
sequence-to-sequence problems. We’ll give attention to the mathematical model
defined by the architecture and the way the model might be utilized in inference.
Along the way in which, we’ll give some background on sequence-to-sequence
models in NLP and break down the transformer-based encoder-decoder
architecture into its encoder and decoder parts. We offer many
illustrations and establish the link between the idea of
transformer-based encoder-decoder models and their practical usage in
🤗Transformers for inference. Note that this blog post does not explain
how such models might be trained – this can be the subject of a future blog
post.

Transformer-based encoder-decoder models are the results of years of
research on representation learning and model architectures. This
notebook provides a brief summary of the history of neural
encoder-decoder models. For more context, the reader is suggested to read
this awesome blog
post
by
Sebastion Ruder. Moreover, a basic understanding of the
self-attention architecture is advisable. The next blog post by
Jay Alammar serves as refresher on the unique Transformer model
here.

On the time of writing this notebook, 🤗Transformers comprises the
encoder-decoder models T5, Bart, MarianMT, and Pegasus, which
are summarized within the docs under model
summaries
.

The notebook is split into 4 parts:

  • BackgroundA brief history of neural encoder-decoder models
    is given with a give attention to RNN-based models.
  • Encoder-DecoderThe transformer-based encoder-decoder model
    is presented and it’s explained how the model is used for
    inference.
  • EncoderThe encoder a part of the model is explained in
    detail.
  • DecoderThe decoder a part of the model is explained in
    detail.

Each part builds upon the previous part, but may also be read on its
own.



Background

Tasks in natural language generation (NLG), a subfield of NLP, are best
expressed as sequence-to-sequence problems. Such tasks might be defined as
finding a model that maps a sequence of input words to a sequence of
goal words. Some classic examples are summarization and
translation. In the next, we assume that every word is encoded
right into a vector representation. nn input words can then be represented as
a sequence of nn input vectors:

X1:n={x1,,xn}.mathbf{X}_{1:n} = {mathbf{x}_1, ldots, mathbf{x}_n}.

Consequently, sequence-to-sequence problems might be solved by finding a
mapping ff from an input sequence of nn vectors X1:nmathbf{X}_{1:n}

f:X1:nY1:m. f: mathbf{X}_{1:n} to mathbf{Y}_{1:m}.

Sutskever et al. (2014) noted that
deep neural networks (DNN)s, “*despite their flexibility and power can
only define a mapping whose inputs and targets might be sensibly encoded
with vectors of fixed dimensionality.*” 1{}^1

Using a DNN model 2{}^2

In 2014, Cho et al. and
Sutskever et al. proposed to make use of an
encoder-decoder model purely based on recurrent neural networks (RNNs)
for sequence-to-sequence tasks. In contrast to DNNS, RNNs are capable
of modeling a mapping to a variable variety of goal vectors. Let’s
dive a bit deeper into the functioning of RNN-based encoder-decoder
models.

During inference, the encoder RNN encodes an input sequence X1:nmathbf{X}_{1:n}

fθenc:X1:nc. f_{theta_{enc}}: mathbf{X}_{1:n} to mathbf{c}.

Then, the decoder’s hidden state is initialized with the input encoding
and through inference, the decoder RNN is used to auto-regressively
generate the goal sequence. Let’s explain.

Mathematically, the decoder defines the probability distribution of a
goal sequence Y1:mmathbf{Y}_{1:m}

pθdec(Y1:mc). p_{theta_{dec}}(mathbf{Y}_{1:m} |mathbf{c}).

By Bayes’ rule the distribution might be decomposed into conditional
distributions of single goal vectors as follows:

pθdec(Y1:mc)=i=1mpθdec(yiY0:i1,c). p_{theta_{dec}}(mathbf{Y}_{1:m} |mathbf{c}) = prod_{i=1}^{m} p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{c}).

Thus, if the architecture can model the conditional distribution of the
next goal vector, given all previous goal vectors:

pθdec(yiY0:i1,c),i{1,,m}, p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{c}), forall i in {1, ldots, m},

then it will probably model the distribution of any goal vector sequence given
the hidden state cmathbf{c} by simply multiplying all conditional
probabilities.

So how does the RNN-based decoder architecture model pθdec(yiY0:i1,c)p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{c})

In computational terms, the model sequentially maps the previous inner
hidden state ci1mathbf{c}_{i-1}

fθdec(yi1,ci1)li,ci. f_{theta_{text{dec}}}(mathbf{y}_{i-1}, mathbf{c}_{i-1}) to mathbf{l}_i, mathbf{c}_i.

p(yili)=Softmax(li), with li=fθdec(yi1,cprev). p(mathbf{y}_i | mathbf{l}_i) = textbf{Softmax}(mathbf{l}_i), text{ with } mathbf{l}_i = f_{theta_{text{dec}}}(mathbf{y}_{i-1}, mathbf{c}_{text{prev}}).

For more detail on the logit vector and the resulting probability
distribution, please see footnote 4{}^4

The space of possible goal vector sequences Y1:mmathbf{Y}_{1:m}

Given such a decoding method, during inference, the following input vector yimathbf{y}_i

A vital feature of RNN-based encoder-decoder models is the
definition of special vectors, reminiscent of the EOStext{EOS} and BOStext{BOS} vector. The EOStext{EOS} vector often represents the ultimate
input vector xnmathbf{x}_n

The unfolded RNN encoder is coloured in green and the unfolded RNN
decoder is coloured in red.

The English sentence “I need to purchase a automobile”, represented by x1=Imathbf{x}_1 = text{I}

To generate the primary goal vector, the decoder is fed the BOStext{BOS}
vector, illustrated as y0mathbf{y}_0

pθdec(yBOS,c). p_{theta_{dec}}(mathbf{y} | text{BOS}, mathbf{c}).

The word Ichtext{Ich} is sampled (shown by the grey arrow, connecting l1mathbf{l}_1

willpθdec(yBOS,Ich,c). text{will} sim p_{theta_{dec}}(mathbf{y} | text{BOS}, text{Ich}, mathbf{c}).

And so forth until at step i=6i=6

To sum it up, an RNN-based encoder-decoder model, represented by fθencf_{theta_{text{enc}}}

pθenc,θdec(Y1:mX1:n)=i=1mpθenc,θdec(yiY0:i1,X1:n)=i=1mpθdec(yiY0:i1,c), with c=fθenc(X). p_{theta_{text{enc}}, theta_{text{dec}}}(mathbf{Y}_{1:m} | mathbf{X}_{1:n}) = prod_{i=1}^{m} p_{theta_{text{enc}}, theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{X}_{1:n}) = prod_{i=1}^{m} p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{c}), text{ with } mathbf{c}=f_{theta_{enc}}(X).

During inference, efficient decoding methods can auto-regressively
generate the goal sequence Y1:mmathbf{Y}_{1:m}

The RNN-based encoder-decoder model took the NLG community by storm. In
2016, Google announced to completely replace its heavily feature engineered
translation service by a single RNN-based encoder-decoder model (see
here).

Nevertheless, RNN-based encoder-decoder models have two pitfalls. First,
RNNs suffer from the vanishing gradient problem, making it very
difficult to capture long-range dependencies, cf. Hochreiter et al.
(2001)
. Second,
the inherent recurrent architecture of RNNs prevents efficient
parallelization when encoding, cf. Vaswani et al.
(2017)
.


1{}^1

2{}^2

3{}^3

4{}^4

5{}^5

6{}^6



Encoder-Decoder

In 2017, Vaswani et al. introduced the Transformer and thereby gave
birth to transformer-based encoder-decoder models.

Analogous to RNN-based encoder-decoder models, transformer-based
encoder-decoder models consist of an encoder and a decoder that are
each stacks of residual attention blocks. The important thing innovation of
transformer-based encoder-decoder models is that such residual attention
blocks can process an input sequence X1:nmathbf{X}_{1:n}

As a reminder, to unravel a sequence-to-sequence problem, we want to
discover a mapping of an input sequence X1:nmathbf{X}_{1:n}

Much like RNN-based encoder-decoder models, the transformer-based
encoder-decoder models define a conditional distribution of goal
vectors Y1:mmathbf{Y}_{1:m}

pθenc,θdec(Y1:mX1:n). p_{theta_{text{enc}}, theta_{text{dec}}}(mathbf{Y}_{1:m} | mathbf{X}_{1:n}).

The transformer-based encoder part encodes the input sequence X1:nmathbf{X}_{1:n}

fθenc:X1:nX1:n. f_{theta_{text{enc}}}: mathbf{X}_{1:n} to mathbf{overline{X}}_{1:n}.

The transformer-based decoder part then models the conditional
probability distribution of the goal vector sequence Y1:nmathbf{Y}_{1:n}

pθdec(Y1:nX1:n). p_{theta_{dec}}(mathbf{Y}_{1:n} | mathbf{overline{X}}_{1:n}).

By Bayes’ rule, this distribution might be factorized to a product of
conditional probability distribution of the goal vector yimathbf{y}_i

pθdec(Y1:nX1:n)=i=1npθdec(yiY0:i1,X1:n). p_{theta_{dec}}(mathbf{Y}_{1:n} | mathbf{overline{X}}_{1:n}) = prod_{i=1}^{n} p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n}).

The transformer-based decoder hereby maps the sequence of encoded hidden
states X1:nmathbf{overline{X}}_{1:n}

Having defined the conditional distribution pθdec(yiY0:i1,X1:n)p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n})

Let’s visualize the entire technique of auto-regressive generation of
transformer-based encoder-decoder models.

texte du
lien

The transformer-based encoder is coloured in green and the
transformer-based decoder is coloured in red. As within the previous section,
we show how the English sentence “I need to purchase a automobile”, represented by x1=Imathbf{x}_1 = text{I}

To start with, the encoder processes the entire input sequence X1:7mathbf{X}_{1:7}

Next, the input encoding X1:7mathbf{overline{X}}_{1:7}

pθenc,dec(yy0,X1:7)=pθenc,dec(yBOS,I want to buy a automobile EOS)=pθdec(yBOS,X1:7). p_{theta_{enc, dec}}(mathbf{y} | mathbf{y}_0, mathbf{X}_{1:7}) = p_{theta_{enc, dec}}(mathbf{y} | text{BOS}, text{I need to purchase a automobile EOS}) = p_{theta_{dec}}(mathbf{y} | text{BOS}, mathbf{overline{X}}_{1:7}).

Next, the primary goal vector y1mathbf{y}_1

pθdec(yBOS Ich,X1:7). p_{theta_{dec}}(mathbf{y} | text{BOS Ich}, mathbf{overline{X}}_{1:7}).

We are able to sample again and produce the goal vector y2mathbf{y}_2

EOSpθdec(yBOS Ich will ein Auto kaufen,X1:7). text{EOS} sim p_{theta_{dec}}(mathbf{y} | text{BOS Ich will ein Auto kaufen}, mathbf{overline{X}}_{1:7}).

And so forth in auto-regressive fashion.

It is crucial to know that the encoder is simply utilized in the primary
forward pass to map X1:nmathbf{X}_{1:n}

texte du
lien

As might be seen, only in step i=1i=1

In 🤗Transformers, this auto-regressive generation is finished under-the-hood
when calling the .generate() method. Let’s use considered one of our translation
models to see this in motion.

from transformers import MarianMTModel, MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")


input_ids = tokenizer("I need to purchase a automobile", return_tensors="pt").input_ids


output_ids = model.generate(input_ids)[0]


print(tokenizer.decode(output_ids))

Output:

     Ich will ein Auto kaufen

Calling .generate() does many things under-the-hood. First, it passes
the input_ids to the encoder. Second, it passes a pre-defined token, which is the text{} symbol within the case of
MarianMTModel together with the encoded input_ids to the decoder.
Third, it applies the beam search decoding mechanism to
auto-regressively sample the following output word of the last decoder
output 1{}^1