Transformer-based Encoder-Decoder Models

!pip install transformers==4.2.1
!pip install sentencepiece==0.1.95

The transformer-based encoder-decoder model was introduced by Vaswani
et al. within the famous Attention is all you wish
paper and is today the de-facto
standard encoder-decoder architecture in natural language processing
(NLP).

Recently, there was a variety of research on different pre-training
objectives for transformer-based encoder-decoder models, e.g. T5,
Bart, Pegasus, ProphetNet, Marge, etc…, however the model architecture
has stayed largely the identical.

The goal of the blog post is to present an in-detail explanation of
how the transformer-based encoder-decoder architecture models
sequence-to-sequence problems. We’ll give attention to the mathematical model
defined by the architecture and the way the model might be utilized in inference.
Along the way in which, we’ll give some background on sequence-to-sequence
models in NLP and break down the transformer-based encoder-decoder
architecture into its encoder and decoder parts. We offer many
illustrations and establish the link between the idea of
transformer-based encoder-decoder models and their practical usage in
🤗Transformers for inference. Note that this blog post does not explain
how such models might be trained – this can be the subject of a future blog
post.

Transformer-based encoder-decoder models are the results of years of
research on representation learning and model architectures. This
notebook provides a brief summary of the history of neural
encoder-decoder models. For more context, the reader is suggested to read
this awesome blog
post by
Sebastion Ruder. Moreover, a basic understanding of the
self-attention architecture is advisable. The next blog post by
Jay Alammar serves as refresher on the unique Transformer model
here.

On the time of writing this notebook, 🤗Transformers comprises the
encoder-decoder models T5, Bart, MarianMT, and Pegasus, which
are summarized within the docs under model
summaries.

The notebook is split into 4 parts:

Background – A brief history of neural encoder-decoder models
is given with a give attention to RNN-based models.
Encoder-Decoder – The transformer-based encoder-decoder model
is presented and it’s explained how the model is used for
inference.
Encoder – The encoder a part of the model is explained in
detail.
Decoder – The decoder a part of the model is explained in
detail.

Each part builds upon the previous part, but may also be read on its
own.

Background

Tasks in natural language generation (NLG), a subfield of NLP, are best
expressed as sequence-to-sequence problems. Such tasks might be defined as
finding a model that maps a sequence of input words to a sequence of
goal words. Some classic examples are summarization and
translation. In the next, we assume that every word is encoded
right into a vector representation. $n$

$mathbf{X}_{1:n} = {mathbf{x}_1, ldots, mathbf{x}_n}.$

Consequently, sequence-to-sequence problems might be solved by finding a
mapping $f$

$f: mathbf{X}_{1:n} to mathbf{Y}_{1:m}.$

Sutskever et al. (2014) noted that
deep neural networks (DNN)s, “*despite their flexibility and power can
only define a mapping whose inputs and targets might be sensibly encoded
with vectors of fixed dimensionality.*” $^{1}$

Using a DNN model $^{2}$

In 2014, Cho et al. and
Sutskever et al. proposed to make use of an
encoder-decoder model purely based on recurrent neural networks (RNNs)
for sequence-to-sequence tasks. In contrast to DNNS, RNNs are capable
of modeling a mapping to a variable variety of goal vectors. Let’s
dive a bit deeper into the functioning of RNN-based encoder-decoder
models.

During inference, the encoder RNN encodes an input sequence $mathbf{X}_{1:n}$

$f_{theta_{enc}}: mathbf{X}_{1:n} to mathbf{c}.$

Then, the decoder’s hidden state is initialized with the input encoding
and through inference, the decoder RNN is used to auto-regressively
generate the goal sequence. Let’s explain.

Mathematically, the decoder defines the probability distribution of a
goal sequence $mathbf{Y}_{1:m}$

$p_{theta_{dec}}(mathbf{Y}_{1:m} |mathbf{c}).$

By Bayes’ rule the distribution might be decomposed into conditional
distributions of single goal vectors as follows:

$p_{theta_{dec}}(mathbf{Y}_{1:m} |mathbf{c}) = prod_{i=1}^{m} p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{c}).$

Thus, if the architecture can model the conditional distribution of the
next goal vector, given all previous goal vectors:

$p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{c}), forall i in {1, ldots, m},$

then it will probably model the distribution of any goal vector sequence given
the hidden state $c mathbf{c}$

So how does the RNN-based decoder architecture model $p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{c})$

In computational terms, the model sequentially maps the previous inner
hidden state $mathbf{c}_{i-1}$

$f_{theta_{text{dec}}}(mathbf{y}_{i-1}, mathbf{c}_{i-1}) to mathbf{l}_i, mathbf{c}_i.$

$p(mathbf{y}_i | mathbf{l}_i) = textbf{Softmax}(mathbf{l}_i), text{ with } mathbf{l}_i = f_{theta_{text{dec}}}(mathbf{y}_{i-1}, mathbf{c}_{text{prev}}).$

For more detail on the logit vector and the resulting probability
distribution, please see footnote $^{4}$

The space of possible goal vector sequences $mathbf{Y}_{1:m}$

Given such a decoding method, during inference, the following input vector $mathbf{y}_i$

A vital feature of RNN-based encoder-decoder models is the
definition of special vectors, reminiscent of the $EOS text{EOS}$

The unfolded RNN encoder is coloured in green and the unfolded RNN
decoder is coloured in red.

The English sentence “I need to purchase a automobile”, represented by $mathbf{x}_1 = text{I}$

To generate the primary goal vector, the decoder is fed the $BOS text{BOS}$

$p_{theta_{dec}}(mathbf{y} | text{BOS}, mathbf{c}).$

The word $Ich text{Ich}$

$text{will} sim p_{theta_{dec}}(mathbf{y} | text{BOS}, text{Ich}, mathbf{c}).$

And so forth until at step $i = 6$

To sum it up, an RNN-based encoder-decoder model, represented by $f_{theta_{text{enc}}}$

$p_{theta_{text{enc}}, theta_{text{dec}}}(mathbf{Y}_{1:m} | mathbf{X}_{1:n}) = prod_{i=1}^{m} p_{theta_{text{enc}}, theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{X}_{1:n}) = prod_{i=1}^{m} p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{c}), text{ with } mathbf{c}=f_{theta_{enc}}(X).$

During inference, efficient decoding methods can auto-regressively
generate the goal sequence $mathbf{Y}_{1:m}$

The RNN-based encoder-decoder model took the NLG community by storm. In
2016, Google announced to completely replace its heavily feature engineered
translation service by a single RNN-based encoder-decoder model (see
here).

Nevertheless, RNN-based encoder-decoder models have two pitfalls. First,
RNNs suffer from the vanishing gradient problem, making it very
difficult to capture long-range dependencies, cf. Hochreiter et al.
(2001). Second,
the inherent recurrent architecture of RNNs prevents efficient
parallelization when encoding, cf. Vaswani et al.
(2017).

$^{1}$

$^{2}$

$^{3}$

$^{4}$

$^{5}$

$^{6}$

Encoder-Decoder

In 2017, Vaswani et al. introduced the Transformer and thereby gave
birth to transformer-based encoder-decoder models.

Analogous to RNN-based encoder-decoder models, transformer-based
encoder-decoder models consist of an encoder and a decoder that are
each stacks of residual attention blocks. The important thing innovation of
transformer-based encoder-decoder models is that such residual attention
blocks can process an input sequence $mathbf{X}_{1:n}$

As a reminder, to unravel a sequence-to-sequence problem, we want to
discover a mapping of an input sequence $mathbf{X}_{1:n}$

Much like RNN-based encoder-decoder models, the transformer-based
encoder-decoder models define a conditional distribution of goal
vectors $mathbf{Y}_{1:m}$

$p_{theta_{text{enc}}, theta_{text{dec}}}(mathbf{Y}_{1:m} | mathbf{X}_{1:n}).$

The transformer-based encoder part encodes the input sequence $mathbf{X}_{1:n}$

$f_{theta_{text{enc}}}: mathbf{X}_{1:n} to mathbf{overline{X}}_{1:n}.$

The transformer-based decoder part then models the conditional
probability distribution of the goal vector sequence $mathbf{Y}_{1:n}$

$p_{theta_{dec}}(mathbf{Y}_{1:n} | mathbf{overline{X}}_{1:n}).$

By Bayes’ rule, this distribution might be factorized to a product of
conditional probability distribution of the goal vector $mathbf{y}_i$

$p_{theta_{dec}}(mathbf{Y}_{1:n} | mathbf{overline{X}}_{1:n}) = prod_{i=1}^{n} p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n}).$

The transformer-based decoder hereby maps the sequence of encoded hidden
states $mathbf{overline{X}}_{1:n}$

Having defined the conditional distribution $p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n})$

Let’s visualize the entire technique of auto-regressive generation of
transformer-based encoder-decoder models.

The transformer-based encoder is coloured in green and the
transformer-based decoder is coloured in red. As within the previous section,
we show how the English sentence “I need to purchase a automobile”, represented by $mathbf{x}_1 = text{I}$

To start with, the encoder processes the entire input sequence $mathbf{X}_{1:7}$

Next, the input encoding $mathbf{overline{X}}_{1:7}$

$p_{theta_{enc, dec}}(mathbf{y} | mathbf{y}_0, mathbf{X}_{1:7}) = p_{theta_{enc, dec}}(mathbf{y} | text{BOS}, text{I need to purchase a automobile EOS}) = p_{theta_{dec}}(mathbf{y} | text{BOS}, mathbf{overline{X}}_{1:7}).$

Next, the primary goal vector $mathbf{y}_1$

$p_{theta_{dec}}(mathbf{y} | text{BOS Ich}, mathbf{overline{X}}_{1:7}).$

We are able to sample again and produce the goal vector $mathbf{y}_2$

$text{EOS} sim p_{theta_{dec}}(mathbf{y} | text{BOS Ich will ein Auto kaufen}, mathbf{overline{X}}_{1:7}).$

And so forth in auto-regressive fashion.

It is crucial to know that the encoder is simply utilized in the primary
forward pass to map $mathbf{X}_{1:n}$

As might be seen, only in step $i = 1$

In 🤗Transformers, this auto-regressive generation is finished under-the-hood
when calling the .generate() method. Let’s use considered one of our translation
models to see this in motion.

from transformers import MarianMTModel, MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")


input_ids = tokenizer("I need to purchase a automobile", return_tensors="pt").input_ids


output_ids = model.generate(input_ids)[0]


print(tokenizer.decode(output_ids))

Output:

     Ich will ein Auto kaufen

Calling .generate() does many things under-the-hood. First, it passes
the input_ids to the encoder. Second, it passes a pre-defined token, which is the $text{}$

Within the Appendix, we have now included a code snippet that shows how a straightforward
generation method might be implemented “from scratch”. To totally
understand how auto-regressive generation works under-the-hood, it’s
highly advisable to read the Appendix.

To sum it up:

The transformer-based encoder defines a mapping from the input
sequence $mathbf{X}_{1:n}$
The transformer-based decoder defines the conditional distribution
$p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n})$
Given an appropriate decoding mechanism, the output sequence
$mathbf{Y}_{1:m}$

Great, now that we have now gotten a general overview of how
transformer-based encoder-decoder models work, we will dive deeper into
each the encoder and decoder a part of the model. More specifically, we
will see exactly how the encoder makes use of the self-attention layer
to yield a sequence of context-dependent vector encodings and the way
self-attention layers allow for efficient parallelization. Then, we’ll
explain intimately how the self-attention layer works within the decoder
model and the way the decoder is conditioned on the encoder’s output with
cross-attention layers to define the conditional distribution $p_{theta_{text{dec}}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n})$

$^{1}$

Encoder

As mentioned within the previous section, the transformer-based encoder
maps the input sequence to a contextualized encoding sequence:

$f_{theta_{text{enc}}}: mathbf{X}_{1:n} to mathbf{overline{X}}_{1:n}.$

Taking a better have a look at the architecture, the transformer-based encoder
is a stack of residual encoder blocks. Each encoder block consists of
a bi-directional self-attention layer, followed by two feed-forward
layers. For simplicity, we disregard the normalization layers on this
notebook. Also, we won’t further discuss the role of the 2
feed-forward layers, but simply see it as a final vector-to-vector
mapping required in each encoder block $^{1}$

Let’s visualize how the encoder processes the input sequence “I need
to purchase a automobile EOS” to a contextualized encoding sequence. Much like
RNN-based encoders, transformer-based encoders also add a special
“end-of-sequence” input vector to the input sequence to hint to the
model that the input vector sequence is finished $^{2}$

Our exemplary transformer-based encoder consists of three encoder
blocks, whereas the second encoder block is shown in additional detail within the
red box on the correct for the primary three input vectors $mathbf{x}_1, mathbf{x}_2 and mathbf{x}_3$

As might be seen each output vector of the self-attention layer $mathbf{x”}_i, forall i in {1, ldots, 7}$

Let’s take a deeper have a look at how bi-directional self-attention works.
Each input vector $mathbf{x’}_i$

$mathbf{q}_i = mathbf{W}_q mathbf{x’}_i,$

Note, that the same weight matrices are applied to every input vector $mathbf{x}_i, forall i in {i, ldots, n}$

Alright, this sounds quite complicated. Let’s illustrate the
bi-directional self-attention layer for considered one of the query vectors of our
example above. For simplicity, it’s assumed that our exemplary
transformer-based decoder uses only a single attention head
config.num_heads = 1 and that no normalization is applied.

On the left, the previously illustrated second encoder block is shown
again and on the correct, an intimately visualization of the bi-directional
self-attention mechanism is given for the second input vector $mathbf{x’}_2$

To further understand the implications of the bi-directional
self-attention layer, let’s assume the next sentence is processed:
“The home is gorgeous and well positioned in the midst of town
where it is definitely accessible by public transport“. The word “it”
refers to “house”, which is 12 “positions away”. In
transformer-based encoders, the bi-directional self-attention layer
performs a single mathematical operation to place the input vector of
“house” into relation with the input vector of “it” (compare to the
first illustration of this section). In contrast, in an RNN-based
encoder, a word that’s 12 “positions away”, would require no less than 12
mathematical operations meaning that in an RNN-based encoder a linear
variety of mathematical operations are required. This makes it much
harder for an RNN-based encoder to model long-range contextual
representations. Also, it becomes clear that a transformer-based encoder
is far less vulnerable to lose essential information than an RNN-based
encoder-decoder model since the sequence length of the encoding is
kept the identical, i.e. $textbf{len}(mathbf{X}_{1:n}) = textbf{len}(mathbf{overline{X}}_{1:n}) = n$

Along with making long-range dependencies more easily learnable, we
can see that the Transformer architecture is capable of process text in
parallel.Mathematically, this will easily be shown by writing the
self-attention formula as a product of query, key, and value matrices:

$mathbf{X”}_{1:n} = mathbf{V}_{1:n} text{Softmax}(mathbf{Q}_{1:n}^intercal mathbf{K}_{1:n}) + mathbf{X’}_{1:n}.$

The output $mathbf{X”}_{1:n} = mathbf{x”}_1, ldots, mathbf{x”}_n$

Great, now we should always have a greater understanding of a) how
transformer-based encoder models effectively model long-range contextual
representations and b) how they efficiently process long sequences of
input vectors.

Now, let’s code up a brief example of the encoder a part of our
MarianMT encoder-decoder models to confirm that the explained theory
holds in practice.

$^{1}$

$^{2}$

from transformers import MarianMTModel, MarianTokenizer
import torch

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")

embeddings = model.get_input_embeddings()


input_ids = tokenizer("I need to purchase a automobile", return_tensors="pt").input_ids


encoder_hidden_states = model.base_model.encoder(input_ids, return_dict=True).last_hidden_state


input_ids_perturbed = tokenizer("I need to purchase a house", return_tensors="pt").input_ids
encoder_hidden_states_perturbed = model.base_model.encoder(input_ids_perturbed, return_dict=True).last_hidden_state


print(f"Length of input embeddings {embeddings(input_ids).shape[1]}. Length of encoder_hidden_states {encoder_hidden_states.shape[1]}")


print("Is encoding for `I` equal to its perturbed version?: ", torch.allclose(encoder_hidden_states[0, 0], encoder_hidden_states_perturbed[0, 0], atol=1e-3))

Outputs:

    Length of input embeddings 7. Length of encoder_hidden_states 7
    Is encoding for `I` equal to its perturbed version?:  False

We compare the length of the input word embeddings, i.e.
embeddings(input_ids) corresponding to $mathbf{X}_{1:n}$

As expected the output length of the input word embeddings and encoder
output encodings, i.e. $textbf{len}(mathbf{X}_{1:n})$

On a side-note, autoencoding models, reminiscent of BERT, have the very same
architecture as transformer-based encoder models. Autoencoding
models leverage this architecture for large self-supervised
pre-training on open-domain text data in order that they will map any word
sequence to a deep bi-directional representation. In Devlin et al.
(2018), the authors show that a
pre-trained BERT model with a single task-specific classification layer
on top can achieve SOTA results on eleven NLP tasks. All autoencoding
models of 🤗Transformers might be found
here.

Decoder

As mentioned within the Encoder-Decoder section, the transformer-based
decoder defines the conditional probability distribution of a goal
sequence given the contextualized encoding sequence:

$p_{theta_{dec}}(mathbf{Y}_{1: m} | mathbf{overline{X}}_{1:n}),$

which by Bayes’ rule might be decomposed right into a product of conditional
distributions of the following goal vector given the contextualized
encoding sequence and all previous goal vectors:

$p_{theta_{dec}}(mathbf{Y}_{1:m} | mathbf{overline{X}}_{1:n}) = prod_{i=1}^{m} p_{theta_{dec}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n}).$

Let’s first understand how the transformer-based decoder defines a
probability distribution. The transformer-based decoder is a stack of
decoder blocks followed by a dense layer, the “LM head”. The stack
of decoder blocks maps the contextualized encoding sequence $mathbf{overline{X}}_{1:n}$

$p_{theta_{dec}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n}), forall i in {1, ldots, n},$

respectively. The “LM head” is commonly tied to the transpose of the word
embedding matrix, i.e. $mathbf{W}_{text{emb}}^{intercal} = left[mathbf{y}^1, ldots, mathbf{y}^{text{vocab}}right]^{intercal}$

$p_{theta_{dec}}(mathbf{y} | mathbf{overline{X}}_{1:n}, mathbf{Y}_{0:i-1})$

Putting all of it together, with the intention to model the conditional distribution
of a goal vector sequence $mathbf{Y}_{1: m}$

$p_{theta_{dec}}(mathbf{Y}_{1:m} | mathbf{overline{X}}_{1:n}) = prod_{i=1}^{m} p_{theta_{dec}}(mathbf{y}_i | mathbf{Y}_{0: i-1}, mathbf{overline{X}}_{1:n}).$

In contrast to transformer-based encoders, in transformer-based
decoders, the encoded output vector $mathbf{overline{y}}_i$

Alright, let’s visualize the transformer-based decoder for our
English to German translation example.

We are able to see that the decoder maps the input $mathbf{Y}_{0:5}$

Applying a softmax operation on each $mathbf{l}_1, mathbf{l}_2, ldots, mathbf{l}_5$

$p_{theta_{dec}}(mathbf{y} | text{BOS}, mathbf{overline{X}}_{1:7}),$

The general conditional probability of:

$p_{theta_{dec}}(text{Ich will ein Auto kaufen EOS} | mathbf{overline{X}}_{1:n})$

can subsequently be computed as the next product:

$p_{theta_{dec}}(text{Ich} | text{BOS}, mathbf{overline{X}}_{1:7}) times ldots times p_{theta_{dec}}(text{EOS} | text{BOS Ich will ein Auto kaufen}, mathbf{overline{X}}_{1:7}).$

The red box on the correct shows a decoder block for the primary three
goal vectors $mathbf{y}_0, mathbf{y}_1, mathbf{y}_2$

As in bi-directional self-attention, in uni-directional self-attention,
the query vectors $mathbf{q}_0, ldots, mathbf{q}_{m-1}$

We are able to summarize uni-directional self-attention as follows:

$mathbf{y”}_i = mathbf{V}_{0: i} textbf{Softmax}(mathbf{K}_{0: i}^intercal mathbf{q}_i) + mathbf{y’}_i.$

Note that the index range of the important thing and value vectors is $0 : i$

Let’s illustrate uni-directional self-attention for the input vector $mathbf{y’}_1$

As might be seen $mathbf{y”}_1$

So why is it essential that we use uni-directional self-attention within the
decoder as a substitute of bi-directional self-attention? As stated above, a
transformer-based decoder defines a mapping from a sequence of input
vector $mathbf{Y}_{0: m-1}$

This is clearly disadvantageous because the transformer-based decoder would
never learn to predict the following word given all previous words, but just
copy the goal vector $mathbf{y}_i$

Great! Now we will move to the layer that connects the encoder and
decoder – the cross-attention mechanism!

The cross-attention layer takes two vector sequences as inputs: the
outputs of the uni-directional self-attention layer, i.e. $mathbf{Y”}_{0: m-1}$

$mathbf{y”’}_i = mathbf{V}_{1:n} textbf{Softmax}(mathbf{K}_{1: n}^intercal mathbf{q}_i) + mathbf{y”}_i.$

Note that the index range of the important thing and value vectors is $1 : n$

Let’s visualize the cross-attention mechanism for the input
vector $mathbf{y”}_1$

We are able to see that the query vector $mathbf{q}_1$

So intuitively, what happens here exactly? Each output vector $mathbf{y”’}_i$

Cool! Now we will see how this architecture nicely conditions each output
vector $mathbf{y”’}_i$

To conclude, the uni-directional self-attention layer is liable for
conditioning each output vector on all previous decoder input vectors
and the present input vector and the cross-attention layer is
responsible to further condition each output vector on all encoded input
vectors.

To confirm our theoretical understanding, let’s proceed our code
example from the encoder section above.

$^{1}$

$^{2}$

from transformers import MarianMTModel, MarianTokenizer
import torch

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")
embeddings = model.get_input_embeddings()


input_ids = tokenizer("I need to purchase a automobile", return_tensors="pt").input_ids


encoder_output_vectors = model.base_model.encoder(input_ids, return_dict=True).last_hidden_state


decoder_input_ids = tokenizer(" Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids


decoder_output_vectors = model.base_model.decoder(decoder_input_ids, encoder_hidden_states=encoder_output_vectors).last_hidden_state


lm_logits = torch.nn.functional.linear(decoder_output_vectors, embeddings.weight, bias=model.final_logits_bias)


decoder_input_ids_perturbed = tokenizer(" Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
decoder_output_vectors_perturbed = model.base_model.decoder(decoder_input_ids_perturbed, encoder_hidden_states=encoder_output_vectors).last_hidden_state
lm_logits_perturbed = torch.nn.functional.linear(decoder_output_vectors_perturbed, embeddings.weight, bias=model.final_logits_bias)


print(f"Shape of decoder input vectors {embeddings(decoder_input_ids).shape}. Shape of decoder logits {lm_logits.shape}")


print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))

Output:

    Shape of decoder input vectors torch.Size([1, 5, 512]). Shape of decoder logits torch.Size([1, 5, 58101])
    Is encoding for `Ich` equal to its perturbed version?:  True

We compare the output shape of the decoder input word embeddings, i.e.
embeddings(decoder_input_ids) (corresponds to $mathbf{Y}_{0: 4}$

As expected the output shapes of the decoder input word embeddings and
lm_logits, i.e. the dimensionality of $mathbf{Y}_{0: 4}$

On a final side-note, auto-regressive models, reminiscent of GPT2, have the
same architecture as transformer-based decoder models if one
removes the cross-attention layer because stand-alone auto-regressive
models will not be conditioned on any encoder outputs. So auto-regressive
models are essentially the identical as auto-encoding models but replace
bi-directional attention with uni-directional attention. These models
may also be pre-trained on massive open-domain text data to indicate
impressive performances on natural language generation (NLG) tasks. In
Radford et al.
(2019),
the authors show that a pre-trained GPT2 model can achieve SOTA or close
to SOTA results on a wide range of NLG tasks without much fine-tuning. All
auto-regressive models of 🤗Transformers might be found
here.

Alright, that is it! Now, you need to have gotten understanding of
transformer-based encoder-decoder models and find out how to use them with the
🤗Transformers library.

Thanks loads to Victor Sanh, Sasha Rush, Sam Shleifer, Oliver Åstrand,
‪Ted Moskovitz and Kristian Kyvik for giving useful feedback.

Appendix

As mentioned above, the next code snippet shows how one can program
a straightforward generation method for transformer-based encoder-decoder
models. Here, we implement a straightforward greedy decoding method using
torch.argmax to sample the goal vector.

from transformers import MarianMTModel, MarianTokenizer
import torch

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")


input_ids = tokenizer("I need to purchase a automobile", return_tensors="pt").input_ids


decoder_input_ids = tokenizer("", add_special_tokens=False, return_tensors="pt").input_ids

assert decoder_input_ids[0, 0].item() == model.config.decoder_start_token_id, "`decoder_input_ids` should correspond to `model.config.decoder_start_token_id`"




outputs = model(input_ids, decoder_input_ids=decoder_input_ids, return_dict=True)


encoded_sequence = (outputs.encoder_last_hidden_state,)

lm_logits = outputs.logits


next_decoder_input_ids = torch.argmax(lm_logits[:, -1:], axis=-1)


decoder_input_ids = torch.cat([decoder_input_ids, next_decoder_input_ids], axis=-1)




lm_logits = model(None, encoder_outputs=encoded_sequence, decoder_input_ids=decoder_input_ids, return_dict=True).logits


next_decoder_input_ids = torch.argmax(lm_logits[:, -1:], axis=-1)


decoder_input_ids = torch.cat([decoder_input_ids, next_decoder_input_ids], axis=-1)


lm_logits = model(None, encoder_outputs=encoded_sequence, decoder_input_ids=decoder_input_ids, return_dict=True).logits
next_decoder_input_ids = torch.argmax(lm_logits[:, -1:], axis=-1)
decoder_input_ids = torch.cat([decoder_input_ids, next_decoder_input_ids], axis=-1)


print(f"Generated to this point: {tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)}")

Outputs:

    Generated to this point: Ich will ein

On this code example, we show exactly what was described earlier. We
pass an input “I need to purchase a automobile” along with the $BOS text{BOS}$

In practice, more complicated decoding methods are used to sample the
lm_logits. Most of that are covered in
this blog post.

Source link

Transformer-based Encoder-Decoder Models

Background

Encoder-Decoder

Encoder

Decoder

Appendix

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

A Unified and Diverse Benchmark for Speculative Decoding**

Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development

Google bets on ‘vibe design’ with Stitch

Generative AI improves a wireless vision system that sees through obstructions

Transformer-based Encoder-Decoder Models

Background

Encoder-Decoder

Encoder

Decoder

Appendix

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.