2017 was a historical yr in machine learning when the Transformer model made its first appearance on the scene. It has been performing amazingly on many benchmarks and has develop into suitable for a number of problems in Data Science. Because of its efficient architecture, many other Transformer-based models have been developed later which specialise more on particular tasks.
One in all such models is BERT. It’s primarily known for having the ability to construct embeddings which might very accurately represent text information and store semantic meanings of long text sequences. In consequence, BERT embeddings became widely utilized in machine learning. Understanding how BERT builds text representations is crucial since it opens the door for tackling a wide range of tasks in NLP.
In this text, we are going to discuss with the original BERT paper and have a take a look at BERT architecture and understand the core mechanisms behind it. In the primary sections, we are going to give a high-level overview of BERT. After that, we are going to step by step dive into its internal workflow and the way information is passed throughout the model. Finally, we are going to find out how BERT will be fine-tuned for solving particular problems in NLP.
Transformer’s architecture consists of two primary parts: encoders and decoders. The goal of stacked encoders is to construct a meaningful embedding for an input which might preserve its essential context. The output of the last encoder is passed to inputs of all decoders attempting to generate recent information.
BERT is a Transformer successor which inherits its stacked bidirectional encoders. A lot of the architectural principles in BERT are similar to in the unique Transformer.
There exist two essential versions of BERT: Base and Large. Their architecture is completely equivalent aside from the indisputable fact that they use different numbers of parameters. Overall, BERT Large has 3.09 times more parameters to tune, in comparison with BERT Base.
From the letter “B” within the BERT’s name, it’s important to keep in mind that BERT is a bidirectional model meaning that it could higher capture word connections attributable to the indisputable fact that the data is passed in each directions (left-to-right and right-to-left). Obviously, this leads to more training resources, in comparison with unidirectional models, but at the identical time results in a greater prediction accuracy.
For a greater understanding, we are able to visualise BERT architecture as compared with other popular NLP models.
Before diving into how BERT is trained, it’s vital to know in what format it accepts data. For the input, BERT takes a single sentence or a pair of sentences. Each sentence is split into tokens. Moreover, two special tokens are passed to the input:
- [CLS] — passed before the primary sentence indicating the start of the sequence. At the identical time, [CLS] can be used for a classification objective during training (discussed within the sections below).
- [SEP] — passed between sentences to point the tip of the primary sentence and the start of the second.
Passing two sentence makes it possible for BERT to handle a big number of tasks where an input incorporates two sentences (e.g. query and answer, hypothesis and premise, etc.).
After tokenisation, an embedding is built for every token. To make input embeddings more representative, BERT constructs three kinds of embeddings for every token:
- Token embeddings capture the semantic meaning of tokens.
- Segment embeddings have one in all two possible values and illustrate to which sentence a token belongs.
- Position embeddings contain details about a relative position of a token in a sequence.
These embeddings are summed up and the result’s passed to the primary encoder of the BERT model.
Each encoder takes n embeddings as input after which outputs the identical variety of processed embeddings of the identical dimensionality. Ultimately, the entire BERT output also incorporates n embeddings each of which corresponds to its initial token.
BERT training consists of two stages:
- Pre-training. BERT is trained on unlabeled pair of sentences over two prediction tasks: masked language modeling (MLM) and natural language inference (NLI). For every pair of sentences, the model makes predictions for these two tasks and based on the loss values, it performs backpropagation to update weights.
- Fantastic-tuning. BERT is initialised with pre-trained weights that are then optimised for a specific problem on labeled data.
In comparison with fine-tuning, pre-training often takes a big proportion of time since the model is trained on a big corpus of information. That’s the reason there exist plenty of online repositories of pre-trained models which will be then fine-tined relatively fast to resolve a specific task.
We’re going to look intimately at each problems solved by BERT during pre-training.
Masked Language Modeling
Authors propose training BERT by masking a certain quantity of tokens within the initial text and predicting them. This offers BERT the flexibility to construct resilient embeddings that may use the encompassing context to guess a certain word which also results in constructing an appropriate embedding for the missed word as well. This process works in the next way:
- After tokenization, 15% of tokens are randomly chosen to be masked. The chosen tokens will likely be then predicted at the tip of the iteration.
- The chosen tokens are replaced in one in all 3 ways:
 – 80% of the tokens are replaced by the [MASK] token.
 Example: I purchased a book → I purchased a [MASK]
 – 10% of the tokens are replaced by a random token.
 Example: He’s eating a fruit → He’s drawing a fruit
 – 10% of the tokens remain unchanged.
 Example: A home is near me → A home is near me
- All tokens are passed to the BERT model which outputs an embedding for every token it received as input.
4. Output embeddings corresponding to the tokens processed at step 2 are independently used to predict the masked tokens. The results of each prediction is a probability distribution across all of the tokens within the vocabulary.
5. The cross-entropy loss is calculated by comparing probability distributions with the true masked tokens.
6. The model weights are updated through the use of backpropagation.
Natural Language Inference
For this classification task, BERT tries to predict whether the second sentence follows the primary. The entire prediction is made through the use of only the embedding from the ultimate hidden state of the [CLS] token which is presupposed to contain aggregated information from each sentences.
Similarly to MLM, a constructed probability distribution (binary on this case) is used to calculate the model’s loss and update the weights of the model through backpropagation.
For NLI, authors recommend selecting 50% of pairs of sentences which follow one another within the corpus (positive pairs) and 50% of pairs where sentences are taken randomly from the corpus (negative pairs).
Training details
Based on the paper, BERT is pre-trained on BooksCorpus (800M words) and English Wikipedia (2,500M words). For extracting longer continuous texts, authors took from Wikipedia only reading passages ignoring tables, headers and lists.
BERT is trained on 1,000,000 batches of size equal to 256 sequences which is akin to 40 epochs on 3.3 billion words. Each sequence incorporates as much as 128 (90% of the time) or 512 (10% of the time) tokens.
Based on the unique paper, the training parameters are the next:
- Optimisator: Adam (learning rate l = 1e-4, weight decay L₂ = 0.01, β₁ = 0.9, β₂ = 0.999, ε = 1e-6).
- Learning rate warmup is performed over the primary 10 000 steps after which reduced linearly.
- Dropout (α = 0.1) layer is used on all layers.
- Activation function: GELU.
- Training loss is the sum of mean MLM and mean next sentence prediction likelihoods.
Once pre-training is accomplished, BERT can literally understand the semantic meanings of words and construct embeddings which might almost fully represent their meanings. The goal of fine-tuning is to step by step modify BERT weights for solving a specific downstream task.
Data format
Because of the robustness of the self-attention mechanism, BERT will be easily fine-tuned for a specific downstream task. One other advantage of BERT is the flexibility to construct bidirectional text representations. This offers the next probability of discovering correct relations between two sentences when working with pairs. Previous approaches consisted of independently encoding each sentences after which applying bidirectional cross-attention to them. BERT unifies these two stages.
Depending on a certain problem, BERT accepts several input formats. The framework for solving all downstream tasks with BERT is identical: by taking as an input a sequence of text, BERT outputs a set of token embeddings that are then fed to the model. More often than not, not all the output embeddings are used.
Allow us to have a take a look at common problems and the ways they’re solved by fine-tuning BERT.
Sentence pair classification
The goal of sentence pair classification is to know the connection between a given pair of sentences. Most of common kinds of tasks are:
- Natural language inference: determining whether the second sentence follows the primary.
- Similarity evaluation: finding a level of similarity between sentences.
For fine-tuning, each sentences are passed to BERT. As a rule of thumb, the output embedding of the [CLS] token is then used for the classification task. Based on the researchers, the [CLS] token is presupposed to contain the essential details about sentence relationships.
After all, other output embeddings can be used but they are often omitted in practice.
Query answering task
The target of query answering is to seek out a solution in a text paragraph corresponding to a specific query. More often than not, the reply is given in the shape of two numbers: the beginning and end token positions of the passage.
For the input, BERT takes the query and the paragraph and outputs a set of embeddings for them. For the reason that answer is contained inside the paragraph, we’re only desirous about output embeddings corresponding to paragraph tokens.
For locating a position of the beginning answer token within the paragraph, the scalar product between every output embedding and a special trainable vector Tₛₜₐᵣₜ is calculated. For many cases when the model and the vector Tₛₜₐᵣₜ are trained accordingly, the scalar product must be proportional to the likelihood that a corresponding token is in point of fact the beginning answer token. To normalise scalar products, they’re then passed to the softmax function and will be thought as probabilities. The token embedding corresponding to the best probability is predicted as the beginning answer token. Based on the true probability distribution, the loss value is calculated and the backpropagation is performed. The analogous process is performed with the vector Tₑₙ𝒹 for predicting the tip token.
Single sentence classification
The difference, in comparison with previous downstream tasks, is that here only a single sentence is passed BERT. Typical problems solved by this configuration are the next:
- Sentiment evaluation: understanding whether a sentence has a positive or negative attitude.
- Topic classification: classifying a sentence into one in all several categories based on its contents.
The prediction workflow is identical as for sentence pair classification: the output embedding for the [CLS] token is used because the input for the classification model.
Single sentence tagging
Named entity recognition (NER) is a machine learning problem which goals to map every token of a sequence to one in all respective entities.
For this objective, embeddings are computed for tokens of an input sentence, as usual. Then every embedding (aside from [CLS] and [SEP]) is passed independently to a model which maps each of them to a given NER class (or not, if it cannot).
Sometimes we deal not only with text but with numerical features, for instance, as well. It is of course desirable to construct embeddings that may incorporate information from each text and other non-text features. Listed here are the really useful strategies to use:
- Concatenation of text with non-text features. As an illustration, if we work with profile descriptions about people in the shape of text and there are other separate features like their name or age, then a recent text description will be obtained in the shape: “My name is <name>. <profile description>. I’m <age> years old”. Finally, such a text description will be fed into the BERT model.
- Concatenation of embeddings with features. It is feasible to construct BERT embeddings, as discussed above, after which concatenate them with other features. The one thing that changes within the configuration is the actual fact a classification model for a downstream task has to simply accept now input vectors of upper dimensionality.
In this text, we have now dived into the processes of BERT training and fine-tuning. As a matter of fact, this information is enough to resolve the vast majority of tasks in NLP thankfully to the indisputable fact that BERT allows to almost fully incorporate text data into embeddings.
In recent times, other BERT-based models have appeared like SBERT, RoBERTa, etc. There even exists a special sphere of study called “BERTology” which analyses BERT capabilities in depth for deriving recent high-performant models. These facts reinforce the indisputable fact that BERT designated a revolution in machine learning and made it possible to significantly advance in NLP.
All images unless otherwise noted are by the creator



