Home Artificial Intelligence Transformers: How Do They Transform Your Data?

Transformers: How Do They Transform Your Data?

1
Transformers: How Do They Transform Your Data?

Diving into the Transformers architecture and what makes them unbeatable at language tasks

Image by the writer

Within the rapidly evolving landscape of artificial intelligence and machine learning, one innovation stands out for its profound impact on how we process, understand, and generate data: Transformers. Transformers have revolutionized the sector of natural language processing (NLP) and beyond, powering a few of today’s most advanced AI applications. But what exactly are Transformers, and the way do they manage to rework data in such groundbreaking ways? This text demystifies the inner workings of Transformer models, specializing in the encoder architecture. We’ll start by going through the implementation of a Transformer encoder in Python, breaking down its predominant components. Then, we are going to visualize how Transformers process and adapt input data during training.

While this blog doesn’t cover every architectural detail, it provides an implementation and an overall understanding of the transformative power of Transformers. For an in-depth explanation of Transformers, I suggest you have a look at the wonderful Stanford CS224-n course.

I also recommend following the GitHub repository related to this text for extra details. 😊

The Transformer model from Attention Is All You Need

This picture shows the unique Transformer architecture, combining an encoder and a decoder for sequence-to-sequence language tasks.

In this text, we are going to deal with the encoder architecture (the red block on the image). That is what the favored BERT model is using under the hood: the first focus is on understanding and representing the information, quite than generating sequences. It will probably be used for quite a lot of applications: text classification, named-entity recognition (NER), extractive query answering, etc.

So, how is the information actually transformed by this architecture? We’ll explain each component intimately, but here is an summary of the method.

  • The input text is tokenized: the Python string is transformed into an inventory of tokens (numbers)
  • Each token is passed through an Embedding layer that outputs a vector representation for every token
  • The embeddings are then further encoded with a Positional Encoding layer, adding information in regards to the position of every token within the sequence
  • These latest embeddings are transformed by a series of Encoder Layers, using a self-attention mechanism
  • A task-specific head might be added. For instance, we are going to later use a classification head to categorise movie reviews as positive or negative

That is essential to grasp that the Transformer architecture transforms the embedding vectors by mapping them from one representation in a high-dimensional space to a different inside the same space, applying a series of complex transformations.

The Positional Encoder layer

Unlike RNN models, the eye mechanism makes no use of the order of the input sequence. The PositionalEncoder class adds positional encodings to the input embeddings, using two mathematical functions: cosine and sine.

Positional encoding matrix definition from Attention Is All You Need

Note that positional encodings don’t contain trainable parameters: there are the outcomes of deterministic computations, which makes this method very tractable. Also, sine and cosine functions take values between -1 and 1 and have useful periodicity properties to assist the model learn patterns in regards to the relative positions of words.

class PositionalEncoder(nn.Module):
def __init__(self, d_model, max_length):
super(PositionalEncoder, self).__init__()
self.d_model = d_model
self.max_length = max_length

# Initialize the positional encoding matrix
pe = torch.zeros(max_length, d_model)

position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float) * -(math.log(10000.0) / d_model))

# Calculate and assign position encodings to the matrix
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.pe = pe.unsqueeze(0)

def forward(self, x):
x = x + self.pe[:, :x.size(1)] # update embeddings
return x

Multi-Head Self-Attention

The self-attention mechanism is the important thing component of the encoder architecture. Let’s ignore the “multi-head” for now. Attention is a option to determine for every token (i.e. each embedding) the relevance of all other embeddings to that token, to acquire a more refined and contextually relevant encoding.

How does“it” concentrate to other words of the sequence? (The Illustrated Transformer)

There are 3 steps within the self-attention mechanism.

  • Use matrices Q, K, and V to respectively transform the inputs “query”, “key” and “value”. Note that for self-attention, query, key, and values are all equal to our input embedding
  • Compute the eye rating using cosine similarity (a dot product) between the query and the key. Scores are scaled by the square root of the embedding dimension to stabilize the gradients during training
  • Use a softmax layer to make these scores probabilities
  • The output is the weighted average of the values, using the eye scores because the weights

Mathematically, this corresponds to the next formula.

The Attention Mechanism from Attention Is All You Need

What does “multi-head” mean? Mainly, we are able to apply the described self-attention mechanism process several times, in parallel, and concatenate and project the outputs. This enables each head to focus on different semantic facets of the sentence.

We start by defining the variety of heads, the dimension of the embeddings (d_model), and the dimension of every head (head_dim). We also initialize the Q, K, and V matrices (linear layers), and the ultimate projection layer.

class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
self.head_dim = d_model // num_heads

self.query_linear = nn.Linear(d_model, d_model)
self.key_linear = nn.Linear(d_model, d_model)
self.value_linear = nn.Linear(d_model, d_model)
self.output_linear = nn.Linear(d_model, d_model)

When using multi-head attention, we apply each attention head with a reduced dimension (head_dim as an alternative of d_model) as in the unique paper, making the entire computational cost much like a one-head attention layer with full dimensionality. Note this can be a logical split only. What makes multi-attention so powerful is it might probably still be represented via a single matrix operation, making computations very efficient on GPUs.

def split_heads(self, x, batch_size):
# Split the sequence embeddings in x across the eye heads
x = x.view(batch_size, -1, self.num_heads, self.head_dim)
return x.permute(0, 2, 1, 3).contiguous().view(batch_size * self.num_heads, -1, self.head_dim)

We compute the eye scores and use a mask to avoid using attention on padded tokens. We apply a softmax activation to make these scores probabilities.

def compute_attention(self, query, key, mask=None):
# Compute dot-product attention scores
# dimensions of query and key are (batch_size * num_heads, seq_length, head_dim)
scores = query @ key.transpose(-2, -1) / math.sqrt(self.head_dim)
# Now, dimensions of scores is (batch_size * num_heads, seq_length, seq_length)
if mask will not be None:
scores = scores.view(-1, scores.shape[0] // self.num_heads, mask.shape[1], mask.shape[2]) # for compatibility
scores = scores.masked_fill(mask == 0, float('-1e20')) # mask to avoid attention on padding tokens
scores = scores.view(-1, mask.shape[1], mask.shape[2]) # reshape back to original shape
# Normalize attention scores into attention weights
attention_weights = F.softmax(scores, dim=-1)

return attention_weights

The forward attribute performs the multi-head logical split and computes the eye weights. Then, we get the output by multiplying these weights by the values. Finally, we reshape the output and project it with a linear layer.

def forward(self, query, key, value, mask=None):
batch_size = query.size(0)

query = self.split_heads(self.query_linear(query), batch_size)
key = self.split_heads(self.key_linear(key), batch_size)
value = self.split_heads(self.value_linear(value), batch_size)

attention_weights = self.compute_attention(query, key, mask)

# Multiply attention weights by values, concatenate and linearly project outputs
output = torch.matmul(attention_weights, value)
output = output.view(batch_size, self.num_heads, -1, self.head_dim).permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
return self.output_linear(output)

The Encoder Layer

That is the predominant component of the architecture, which leverages multi-head self-attention. We first implement a straightforward class to perform a feed-forward operation through 2 dense layers.

class FeedForwardSubLayer(nn.Module):
def __init__(self, d_model, d_ff):
super(FeedForwardSubLayer, self).__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()

def forward(self, x):
return self.fc2(self.relu(self.fc1(x)))

We will now code the logic for the encoder layer. We start by applying self-attention to the input, which supplies a vector of the identical dimension. We then use our mini feed-forward network with Layer Norm layers. Note that we also use skip connections before applying normalization.

class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

def forward(self, x, mask):
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output)) # skip connection and normalization
ff_output = self.feed_forward(x)
return self.norm2(x + self.dropout(ff_output)) # skip connection and normalization

Putting Every part Together

It’s time to create our final model. We pass our data through an embedding layer. This transforms our raw tokens (integers) right into a numerical vector. We then apply our positional encoder and a number of other (num_layers) encoder layers.

class TransformerEncoder(nn.Module):
def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
super(TransformerEncoder, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

def forward(self, x, mask):
x = self.embedding(x)
x = self.positional_encoding(x)
for layer in self.layers:
x = layer(x, mask)
return x

We also create a ClassifierHead class which is used to rework the ultimate embedding into class probabilities for our classification task.

class ClassifierHead(nn.Module):
def __init__(self, d_model, num_classes):
super(ClassifierHead, self).__init__()
self.fc = nn.Linear(d_model, num_classes)

def forward(self, x):
logits = self.fc(x[:, 0, :]) # first token corresponds to the classification token
return F.softmax(logits, dim=-1)

Note that the dense and softmax layers are only applied on the primary embedding (corresponding to the primary token of our input sequence). It is because when tokenizing the text, the primary token is the [CLS] token which stands for “classification.” The [CLS] token is designed to aggregate your entire sequence’s information right into a single embedding vector, serving as a summary representation that might be used for classification tasks.

Note: the concept of including a [CLS] token originates from BERT, which was initially trained on tasks like next-sentence prediction. The [CLS] token was inserted to predict the likelihood that sentence B follows sentence A, with a [SEP] token separating the two sentences. For our model, the [SEP] token simply marks the tip of the input sentence, as shown below.

[CLS] Token in BERT Architecture (All About AI)

When you consider it, it’s really mind-blowing that this single [CLS] embedding is in a position to capture a lot details about your entire sequence, due to the self-attention mechanism’s ability to weigh and synthesize the importance of every bit of the text in relation to one another.

Hopefully, the previous section gives you a greater understanding of how our Transformer model transforms the input data. We’ll now write our training pipeline for our binary classification task using the IMDB dataset (movie reviews). Then, we are going to visualize the embedding of the [CLS] token throughout the training process to see how our model transformed it.

We first define our hyperparameters, in addition to a BERT tokenizer. Within the GitHub repository, you possibly can see that I also coded a function to pick out a subset of the dataset with only 1200 train and 200 test examples.

num_classes = 2 # binary classification
d_model = 256 # dimension of the embedding vectors
num_heads = 4 # variety of heads for self-attention
num_layers = 4 # variety of encoder layers
d_ff = 512. # dimension of the dense layers within the encoder layers
sequence_length = 256 # maximum sequence length
dropout = 0.4 # dropout to avoid overfitting
num_epochs = 20
batch_size = 32

loss_function = torch.nn.CrossEntropyLoss()

dataset = load_dataset("imdb")
dataset = balance_and_create_dataset(dataset, 1200, 200) # check GitHub repo

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=sequence_length)

You possibly can try to make use of the BERT tokenizer on one in every of the sentences:

print(tokenized_datasets['train']['input_ids'][0])

Every sequence should start with the token 101, corresponding to [CLS], followed by some non-zero integers and padded with zeros if the sequence length is smaller than 256. Note that these zeros are ignored throughout the self-attention computation using our “mask”.

tokenized_datasets = dataset.map(encode_examples, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

train_dataloader = DataLoader(tokenized_datasets['train'], batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(tokenized_datasets['test'], batch_size=batch_size, shuffle=True)

vocab_size = tokenizer.vocab_size

encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)
classifier = ClassifierHead(d_model, num_classes)

optimizer = torch.optim.Adam(list(encoder.parameters()) + list(classifier.parameters()), lr=1e-4)

We will now write our train function:

def train(dataloader, encoder, classifier, optimizer, loss_function, num_epochs):
for epoch in range(num_epochs):
# Collect and store embeddings before each epoch starts for visualization purposes (check repo)
all_embeddings, all_labels = collect_embeddings(encoder, dataloader)
reduced_embeddings = visualize_embeddings(all_embeddings, all_labels, epoch, show=False)
dic_embeddings[epoch] = [reduced_embeddings, all_labels]

encoder.train()
classifier.train()
correct_predictions = 0
total_predictions = 0
for batch in tqdm(dataloader, desc="Training"):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask'] # indicate where padded tokens are
# These 2 lines make the attention_mask a matrix as an alternative of a vector
attention_mask = attention_mask.unsqueeze(-1)
attention_mask = attention_mask & attention_mask.transpose(1, 2)
labels = batch['label']
optimizer.zero_grad()
output = encoder(input_ids, attention_mask)
classification = classifier(output)
loss = loss_function(classification, labels)
loss.backward()
optimizer.step()
preds = torch.argmax(classification, dim=1)
correct_predictions += torch.sum(preds == labels).item()
total_predictions += labels.size(0)

epoch_accuracy = correct_predictions / total_predictions
print(f'Epoch {epoch} Training Accuracy: {epoch_accuracy:.4f}')

Yow will discover the collect_embeddings and visualize_embeddings functions within the GitHub repo. They store the [CLS] token embedding for every sentence of the training set, apply a dimensionality reduction technique called t-SNE to make them 2D vectors (as an alternative of 256-dimensional vectors), and save an animated plot.

Let’s visualize the outcomes.

Projected [CLS] embeddings for every training point (blue corresponds to positive sentences, red corresponds to negative sentences)

Observing the plot of projected [CLS] embeddings for every training point, we are able to see the clear distinction between positive (blue) and negative (red) sentences after a number of epochs. This visual shows the remarkable capability of the Transformer architecture to adapt embeddings over time and highlights the ability of the self-attention mechanism. The information is transformed in such a way that embeddings for every class are well separated, thereby significantly simplifying the duty for the classifier head.

As we conclude our exploration of the Transformer architecture, it’s evident that these models are adept at tailoring data to a given task. With the usage of positional encoding and multi-head self-attention, Transformers transcend mere data processing: they interpret and understand information with a level of sophistication previously unseen. The flexibility to dynamically weigh the relevance of various parts of the input data allows for a more nuanced understanding and representation of the input text. This enhances performance across a wide selection of downstream tasks, including text classification, query answering, named entity recognition, and more.

Now that you could have a greater understanding of the encoder architecture, you might be able to delve into decoder and encoder-decoder models, that are very much like what we have now just explored. Decoders play a pivotal role in generative tasks and are on the core of the favored GPT models.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here