Large Language Models: DistilBERT — Smaller, Faster, Cheaper and Lighter

Artificial Intelligence

Large Language Models: DistilBERT — Smaller, Faster, Cheaper and Lighter

admin

October 8, 2023

Large Language Models: DistilBERT — Smaller, Faster, Cheaper and Lighter

Unlocking the secrets of BERT compression: a student-teacher framework for max efficiency

In recent times, the evolution of huge language models has skyrocketed. BERT became some of the popular and efficient models allowing to resolve a big selection of NLP tasks with high accuracy. After BERT, a set of other models appeared in a while the scene demonstrating outstanding results as well.

The plain trend that became easy to watch is the incontrovertible fact that with time large language models (LLMs) are inclined to develop into more complex by exponentially augmenting the variety of parameters and data they’re trained on. Research in deep learning showed that such techniques normally lead to higher results. Unfortunately, the machine learning world has already handled several problems regarding LLMs and scalability has develop into the primary obstacle in effective training, storing and using them.

By bearing in mind this issue, special techniques have been elaborated for compressing LLMs. The objectives of compressing algorithms are either decreasing training time, reducing memory consumption or accelerating model inference. The three commonest compression techniques utilized in practice are the next:

Knowledge distillation involves training a smaller model attempting to represent the behaviour of a bigger model.
Quantization is the means of reducing memory for storing numbers representing model’s weights.
Pruning refers to discarding the least necessary model’s weights.

In this text, we’ll understand the distillation mechanism applied to BERT which led to a recent model called DistillBERT. By the best way, the discussed techniques below may be applied to other NLP models as well.

The goal of distillation is to create a smaller model which might imitate a bigger model. In practice, it implies that if a big model predicts something, then a smaller model is anticipated to make the same prediction.

To realize this, a bigger model must be already pretrained (BERT in our case). Then an architecture of a smaller model must be chosen. To extend the potential of successful imitation, it will likely be really helpful for the smaller model to have the same architecture to the larger model with a reduced variety of parameters. Finally, the smaller model learns from the predictions made by the larger model on a certain dataset. For this objective, it’s important to decide on an appropriate loss function that may help the smaller model to learn higher.

In distillation notation, the larger model is named a teacher and the smaller model is known as a student.

Generally, the distillation procedure is applied through the pretaining but may be applied through the fine-tuning as well.

DistilBERT learns from BERT and updates its weights through the use of the loss function which consists of three components:

Masked language modeling (MLM) loss
Distillation loss
Similarity loss

Below, we’re going to discuss these loss components and undestand the need of every of them. Nevertheless, before diving into depth it’s essential to grasp a crucial concept called temperature in softmax activation function. The temperature concept is utilized in the DistilBERT loss function.

It is usually to watch a softmax transformation because the last layer of a neural network. Softmax normalizes all model outputs, so that they sum as much as 1 and may be interpreted as probabilities.

There exists a softmax formula where all of the outputs of the model are divided by a temperature parameter T:

Softmax temperature formula. pᵢ and zᵢ are the model output and the normalized probability for the i-th object respectively. T is the temperature parameter.

The temperature T controls the smoothness of the output distribution:

If T > 1, then the distribution becomes smoother.
If T = 1, then the distribution is similar if the traditional softmax was applied.
If T < 1, then the distribution becomes more rough.

To make things clear, allow us to have a look at an example. Consider a classification task with 5 labels during which a neural network produced 5 values indicating the arrogance of an input object belonging to a corresponding class. Applying softmax with different values of T results in several output distributions.

An example of a neural network producing different probability distributions based on the temperature T

The greater the temperature is, the smoother the probability distribution becomes.

Softmax transformation of logits (natural numbers from 1 to five) based on different values of temperature T. Because the temperature increases, softmax values develop into more aligned with one another.

Masked language modeling loss

Much like the teacher’s model (BERT), during pretraining, the scholar (DistilBERT) learns language by making predictions for the masked language modeling task. After producing a prediction for a certain token, the expected probability distribution is in comparison with the one-hot encoded probability distribution of the teacher’s model.

The one-hot encoded distribution designates a probability distribution where the probability of the almost definitely token is about to 1 and the chances of all other tokens are set to 0.

As in most language models, the cross-entropy loss is calculated between predicted and true distribution and the weights of the scholar’s model are updated through backpropagation.

Masked language modeling loss computation example

Distillation loss

Actually it is feasible to make use of only the scholar loss to coach the scholar model. Nonetheless, in lots of cases, it may not be enough. The common problem with using only the scholar loss lies in its softmax transformation during which the temperature T is about to 1. In practice, the resulting distribution with T = 1 seems to be in the shape where one in every of the possible labels has a really high probability near 1 and all other label probabilities develop into low being near 0.

Such a situation doesn’t align well with cases where two or more classification labels are valid for a selected input: the softmax layer with T = 1 will probably be very prone to exclude all valid labels but one and can make the probability distribution near one-hot encoding distribution. This ends in a loss of probably useful information that might be learned by the scholar model which makes it less diverse.

That’s the reason the authors of the paper introduce distillation loss during which softmax probabilities are calculated with a temperature T > 1 making it possible to easily align probabilities, thus bearing in mind several possible answers for the scholar.

In distillation loss, the identical temperature T is applied each to the scholar and the teacher. One-hot encoding of the teacher’s distribution is removed.

As a substitute of the cross-entropy loss, it is feasible to make use of KL divergence loss.

Similarity loss

The researchers also state that it is helpful so as to add cosine similarity loss between hidden state embeddings.

This fashion, the scholar is probably going not only to breed masked tokens accurately but in addition to construct embeddings which are just like those of the teacher. It also opens the door for preserving the identical relations between embeddings in each spaces of the models.

Triple loss

Finally, a sum of the linear combination of all three loss functions is calculated which defines the loss function in DistilBERT. Based on the loss value, the backpropagation is performed on the scholar model to update its weights.

As an interesting fact, among the many three loss components, the masked language modeling loss has the least importance on the model’s performance. The distillation loss and similarity loss have a much higher impact.

The inference process in DistilBERT works exactly as through the training phase. The one subtlety is that softmax temperature T is about to 1. This is finished to acquire probabilities near those calculated by BERT.

Generally, DistilBERT uses the identical architecture as BERT apart from these changes:

DistilBERT has only half of BERT layers. Each layer within the model is initialized by taking one BERT layer out of two.
Token-type embeddings are removed.
The dense layer which is applied to the hidden state of the [CLS] token for a classification task is removed.
For a more robust performance, authors use one of the best ideas proposed in RoBERTa:
– usage of dynamic masking
– removing the following sentence prediction objective
– training on larger batches
– gradient accumulation technique is applied for optimized gradient computations

The last hidden layer size (768) in DistilBERT is similar as in BERT. The authors reported that its reduction doesn’t result in considerable improvements when it comes to computation efficiency. Based on them, reducing the variety of total layers has a much higher impact.

DistilBERT is trained on the identical corpus of information as BERT which incorporates BooksCorpus (800M words) English Wikipedia (2500M words).

The important thing performance parameters of BERT and DistilBERT were compared on the several hottest benchmarks. Listed below are the facts necessary to retain:

During inference, DistilBERT is 60% faster than BERT.
DistilBERT has 44M fewer parameters and in total is 40% smaller than BERT.
DistilBERT retains 97% of BERT performance.

BERT vs DistilBERT comparison (on GLUE dataset)

DistilBERT made an enormous step in BERT evolution by allowing it to significantly compress the model while achieving comparable performance on various NLP tasks. Other than it, DistilBERT weighs only 207 MB making the combination on devices with restricted capacities easier. Knowledge distillation shouldn’t be the one technique to use: DistilBERT may be further compressed with quantization or pruning algorithms.

All images unless otherwise noted are by the creator

Unlocking the secrets of BERT compression: a student-teacher framework for max efficiency

Masked language modeling loss

Distillation loss

Similarity loss

Triple loss

LEAVE A REPLY Cancel reply