Say Once! Repeating Words Is Not Helping AI

Artificial Intelligence

Say Once! Repeating Words Is Not Helping AI

admin

June 20, 2023

Say Once! Repeating Words Is Not Helping AI

AI data crisis — image by Karen Vardazaryan on Unsplash

As we’ve got seen more parameters don’t equate to higher performance. For higher performance, we want quality tokens (texts), but these are briefly supply. How can we obtain them? Can we help ourselves with artificial intelligence?

Why we will not be using Chat-GPT to provide text?

If we humans will not be producing enough text, why not automate this process? A recent study shows how this process is just not optimal. Stanford Alpaca was trained using 52,000 examples derived from GPT-3, but only apparently achieved similar performance. In point of fact, the model learns the sort of the goal model but not its knowledge.

Why not train longer?

For each PaLM, Gopher, and LLaMA (also for the opposite LLMs) it’s clearly written that the models were trained for a number of epochs (one or nevertheless few). This is just not a limitation of the Transformer because, for instance, the Vision Transformers (ViT) have been trained for 300 epochs on ImageNet (1 million images), as shown within the table:

Large Language Model LLM overfitting — image source: here

Since it is beyond expensive. Within the LLaMA article, the authors trained for under one epoch (and two epochs for under a part of the dataset). Nevertheless, the authors report:

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. Which means training over our dataset containing 1.4T tokens takes roughly 21 days. (source)

Training an LLM for even a number of epochs is incredibly expensive. As calculated by Dmytro Nikolaiev (Dimid) that is meaning 4.0 million dollars in case you train a model just like META’s LLaMA on the Google Cloud Platform.

So training for other epochs would result in an exponential increase in costs. Also, we don’t know if this extra training is actually useful: we haven’t tested it yet.

Recently a gaggle of researchers on the University of Singapore studied what happens if we train an LLM for multiple epochs:

Until now we all know that the performance of a model is derived not only by the variety of parameters but additionally by the variety of quality tokens used to coach. However, these quality tokens will not be infinite and we’re approaching the limit. If we cannot find enough quality tokens and it’s an choice to generate with AI, what could we do?

Can we use the identical training set and train longer?

There’s a Latin locution that states that repeating things advantages (repetita iuvant), but over time someone added “but continuing bores” (continuata secant).

The identical is true for neural networks: increasing the variety of epochs improves network performance (decrease in loss); sooner or later, nevertheless, while the loss within the training set continues to fall, the loss within the validation set begins to rise. The neural network went into overfitting, starting to think about patterns which might be only present within the training set and losing the power to generalize.

Okay, this has been studied extensively for small neural networks, but what about huge transformers?

The authors of this study used the T5 model (encoder-decoder model) on the C4 dataset. The authors trained several versions of the model, increasing the variety of parameters until the larger model outperformed the smaller model (indicating that the larger model received a sufficient variety of tokens, as Chinchilla’s law). The authors noted that there was a linear relationship between the variety of tokens required and the dimensions of the model (confirming what DeepMind saw with Chinchilla).

The C4 dataset is restricted (doesn’t have infinite tokens) so to extend the variety of parameters the authors found themselves in a tokens-scarcity condition. Thus they decided to simulate what happens if an LLM sees repeated data. They sampled a certain variety of tokens, so the model found itself seeing them again in tokens training. This showed:

Repeated tokens result in degraded performance.
Larger models are more prone to overfitting under tokens-crisis conditions (so although it theoretically consumes more computational resources this results in degraded performance).

As well as, these models are used for downstream tasks. Often an LLM is trained unsupervised on a considerable amount of text after which fine-tuned on a smaller dataset for a downstream task. Or it could undergo a process called alignment (as within the case of ChatGPT).

When an LLM is trained on repeated data although it’s then fine-tuned on one other dataset, performance is degraded. So the downstream tasks are also impacted.

We just saw that repeated tokens harm training. But why does this occur?

The authors decided to analyze by keeping the variety of repeated tokens fixed and increasing the variety of total tokens within the dataset. The outcomes show that a bigger dataset alleviates multi-epoch degradation issues.

Last yr Galactica was published (a model that was speculated to help scientists but lasted only three days). Other than the spectacular debacle, the article suggested that a part of their results was from the standard of the information. In keeping with the authors, data quality reduced the chance of overfitting:

We’re capable of train on it for multiple epochs without overfitting, where upstream and downstream performance improves with use of repeated tokens. (source)

For the authors, the repeated tokens actually not only don’t harm the model training but actually improved downstream performance.

On this latest study, the authors use the Wikipedia dataset which is taken into account the next quality dataset than C4, and add repeated tokens. The outcomes show that there’s the same level of degradation, which is against what’s stated in Galactica’s article.

The authors also tried to analyze whether it was also attributable to model scaling. In the course of the scaling of a model, each the variety of parameters and the computational cost increase. The authors decided to review these two aspects individually:

Mixture-of-Experts (MoE) because even though it increases the variety of parameters it maintains the same computational cost.
ParamShare, however, reduces the variety of parameters but maintains the identical computational cost.

The outcomes show that the model with fewer parameters is less affected by repeated tokens. In contrast, the MoE model (greater variety of parameters) is more susceptible to overfitting. The result’s interesting because MoE has been used successfully in lots of AI models, so the authors suggest that although MoE is a useful technique when there’s enough data, it may hurt performance when there will not be enough tokens.

The authors also explored whether objective training impacts performance degradation. On the whole, there are two training objectives:

Recently, with PaLM2–2, Google introduced UL2 which is a combination between these two training objectives. UL2 has been shown to speed up model training nevertheless interestingly, UL2 is more susceptible to overfitting and has greater multi-epoch degradation.

The authors next explored how they might attempt to alleviate multi-epoch degradation. Since regularization techniques are used precisely to stop overfitting, the authors tested whether these techniques had a useful effect here as well.

Dropout shows to be one of the crucial efficient techniques to alleviate the issue. This is just not surprising because one of the crucial efficient regularization techniques, it is definitely parallelized and utilized by many of the models.

Furthermore, it really works best for the authors to start out without dropout and only at a later point within the training so as to add dropout.

However, the authors note that using Dropout in some models, especially the larger ones, can result in a slight reduction in performance. So even though it can have useful effects against overfitting it could lead on to unexpected behaviors in other contexts. A lot that models GPT-3, PaLM, LLaMA, Chinchilla, and Gopher don’t use it of their architecture.

As described within the table below, the authors used for his or her experiments what are actually considered almost small models. Thus, it is dear to check different hyperparameters when designing an LLM:

As an illustration, in our specific scenario, training T5-XL five times would require roughly $37,000 USD for renting Google Cloud TPUs. Considering even larger models like PaLM and GPT-4, trained on even larger datasets, this cost becomes unmanageable (source)

Since of their experiments, a Sparse MoE model approximates the behavior of a dense model (which is more computationally expensive), one can use it to go looking for the very best hyperparameters.

For instance, the authors show that one can test different learning rates for the MoE model and it exhibits the identical performance because the equivalent dense model. So for the authors, one can test different hyperparameters with the MoE model after which train with the chosen parameters the dense model, thus saving cost:

sweeping the MoE Large model incurred an expenditure of roughly 10.6K USD on the Google Cloud Platform. Conversely, training the Dense XL model just once required 7.4K USD. Consequently, the whole development process, including sweeping, amounted to a complete cost of 18K USD, which is simply 0.48 times the expense of directly tuning the Dense XL model (source)

Say Once! Repeating Words Is Not Helping AI

1 COMMENT

LEAVE A REPLY Cancel reply