2023 has seen a surge of public interest in Large Language Models (LLMs), and now that almost all people have an idea of what they’re and might do, the general public debates around open versus closed source have reached a large audience as well. At Hugging Face, we follow open models with great interest, as they permit research to be reproducible, empower the community to take part in the event of AI models, permit the better scrutiny of model biases and limitations, and lower the general carbon impact of our field by favoring checkpoint reuse (amongst many other advantages).
So let’s do a retrospective of the yr in open LLMs!
To maintain this document manageable in length, we can’t have a look at code models.
🍜 Recipe for a pretrained Large Language Model
First, how do you get a Large Language Model? (Be at liberty to skim this section if you happen to already know!)
The model architecture (its code) describes its specific implementation and mathematical shape: it’s a listing of all its parameters, in addition to how they interact with inputs. In the intervening time, most highly performing LLMs are variations on the “decoder-only” Transformer architecture (more details within the original transformers paper).
The training dataset incorporates all examples and documents on which the model is trained (aka the parameters are learned), subsequently, the precise patterns learned. More often than not, these documents contain text, either in natural language (ex: French, English, Chinese), a programming language (ex: Python, C), or any sort of structured data expressible as text (ex: tables in markdown or latex, equations, …).
A tokenizer defines how the text from the training dataset is converted to numbers (as a model is a mathematical function and subsequently needs numbers as inputs). Tokenization is finished by transforming text into sub-units called tokens (which will be words, sub-words, or characters, depending on tokenization methods). The vocabulary size of the tokenizer indicates how many alternative tokens it knows, typically between 32k and 200k. The dimensions of a dataset is commonly measured because the variety of tokens it incorporates once split in a sequence of those individual, “atomistic” units, and today range from several hundred billion tokens to several trillion tokens!
Training hyperparameters then define how the model is trained. How much should the parameters change to suit each recent example? How briskly should the model be updated?
Once these parameters have been chosen, you simply need 1) a variety of computing power to coach the model and a couple of) competent (and type) people to run and monitor the training. The training itself will consist in instantiating the architecture (creating the matrices on the hardware used for training) and running the training algorithm on the training dataset with the above mentioned hyperparameters. The result’s a set of model weights. These are the model parameters after learning and what most individuals mean when discussing access to an open pretrained model. These weights can then be used for inference, i.e. for prediction on recent inputs, as an example to generate text.
Pretrained LLMs will also be specialized or adapted for a particular task after pretraining, particularly when the weights are openly released. They’re then used as a start line to be used cases and applications through a process called fine-tuning. Nice-tuning involves applying additional training steps on the model on a distinct –often more specialized and smaller– dataset to optimize it for a particular application. Though this step has a value when it comes to compute power needed, it will likely be much more cost effective than training a model from scratch, each financially and environmentally. That is one reason high-quality open-source pretrained models are very interesting, as they will be freely used and built upon by the community even when the practitioners have only access to a limited computing budget.
🗝️ 2022, from a race for size to a race for data
What open models were available to the community before 2023?
Until early 2022, the trend in machine learning was that the larger a model was (i.e. the more parameters it had), the higher its performance. Specifically, it seemed that models going above specific size thresholds jumped in capabilities, two concepts which were dubbed emergent abilities and scaling laws. Pretrained open-source model families published in 2022 mostly followed this paradigm.
-
BLOOM (BigScience Large Open-science Open-access Multilingual Language Model)
BLOOM is a family of models released by BigScience, a collaborative effort including 1000 researchers across 60 countries and 250 institutions, coordinated by Hugging Face, in collaboration with the French organizations GENCI and IDRIS. These models use decoder-only transformers, with minor modifications (post embedding normalization,[^1] and the usage of ALiBi positional embeddings [^2]). The largest model of this family is a 176B parameters model, trained on 350B tokens of multilingual data in 46 human languages and 13 programming languages. A lot of the training data was released, and details of its sources, curation, and processing were published. It’s the most important open source massively multilingual model up to now. -
OPT (Open Pre-trained Transformer)
The OPT model family was released by Meta. These models use a decoder-only transformers architecture, following the tricks of the GPT-3 paper (a particular weights initialization, pre-normalization), with some changes to the eye mechanism (alternating dense and locally banded attention layers). The largest model of this family is a 175B parameters model trained on 180B tokens of information from mostly public sources (books, social data through Reddit, news, Wikipedia, and other various web sources). This model family was of comparable performance to GPT-3 models, using coding optimization to make it less compute-intensive. -
GLM-130B (General Language Model)
GLM-130B was released by Tsinghua University and Zhipu.AI. It uses a full transformer architecture with some changes (post-layer-normalisation with DeepNorm, rotary embeddings). The 130B parameters model was trained on 400B tokens of English and Chinese web data (The Pile, Wudao Corpora, and other Chinese corpora). It was also of comparable performance to GPT-3 models. -
Smaller or more specialized open LLM
Smaller open-source models were also released, mostly for research purposes: Meta released the Galactica series, LLM of as much as 120B parameters, pre-trained on 106B tokens of scientific literature, and EleutherAI released the GPT-NeoX-20B model, a wholly open source (architecture, weights, data included) decoder transformer model trained on 500B tokens (using RoPE and a few changes to attention and initialization), to supply a full artifact for scientific investigations.
These huge models were exciting but additionally very expensive to run! When performing inference (computing predictions from a model), the model must be loaded in memory, but a 100B parameters model will typically require 220GB of memory to be loaded (we explain this process below), which could be very large, and never accessible to most organization and practitioners!
Nonetheless, in March 2022, a recent paper by DeepMind got here out, investigating what the optimal ratio of tokens to model parameters is for a given compute budget. In other words, if you happen to only have an amount X of cash to spend on model training, what should the respective model and data sizes be? The authors discovered that, overall, for the common compute budget being spent on LLMs, models must be smaller but trained on considerably more data. Their very own model, Chinchilla (not open source), was a 70B parameters model (a 3rd of the dimensions of the above models) but trained on 1.4T tokens of information (between 3 and 4 times more data). It had similar or higher performance than its greater counterparts, each open and closed source.
This paradigm shift, while probably already known in closed labs took the open science community by storm.
🌊 2023, a yr of open releases
The rise of small Large Language Models
2023 saw a wave of decoder style transformers arise, with recent pretrained models released every month, and shortly every week and even day: LLaMA (by Meta) in February, StableLM (by StabilityAI) and Pythia (by Eleuther AI) in April, MPT (by MosaicML) in May, X-GEN (by Salesforce) and Falcon (by TIIUAE) in June, Llama 2 (by Meta) in July, StableLM v2 (by StabilityAI) in August, Qwen (by Alibaba) and Mistral (by Mistral.AI) in September, Yi (by 01-ai) in November, DeciLM (by Deci), Phi-2, and SOLAR (by Upstage) in December.
All these releases a) included model weights (under varyingly open licenses) and b) had good performance for models on the smaller side (between 3B and 70B parameters), and subsequently, they were immediately adopted by the community. Just about all of those models use the decoder transformer architecture, with various tweaks (ALiBi or RoPE, RMS pre-normalization, SwiGLU), in addition to some changes to the eye functions (Flash-Attention, GQA, sliding windows) and different code base implementations to optimize for training or inference speed. These tweaks are more likely to affect the performance and training speed to some extent; nevertheless, as all of the architectures have been released publicly with the weights, the core differences that remain are the training data and the licensing of the models.
The primary model family on this series was the LLaMA family, released by Meta AI. The express objective of the researchers was to coach a set of models of assorted sizes with the perfect possible performances for a given computing budget. For certainly one of the primary times, the research team explicitly decided to contemplate not only the training budget but additionally the inference cost (for a given performance objective, how much does it cost to run inference with the model). In this attitude, they decided to coach smaller models on much more data and for more steps than was often done, thereby reaching higher performances at a smaller model size (the trade-off being training compute efficiency). The largest model within the Llama 1 family is a 65B parameters model trained on 1.4T tokens, while the smaller models (resp. 6 and 13B parameters) were trained on 1T tokens. The small 13B LLaMA model outperformed GPT-3 on most benchmarks, and the most important LLaMA model was state-of-the-art when it got here out. The weights were released with a non-commercial license though, limiting the adoption by the community.
The Pythia models were released by the open-source non-profit lab Eleuther AI, and were a suite of LLMs of various sizes, trained on completely public data, provided to assist researchers to grasp different steps of LLM training.
The MPT models, which got here out a few months later, released by MosaicML, were close in performance but with a license allowing business use, and the small print of their training mix. The primary MPT model was a 7B model, followed up by 30B versions in June, each trained on 1T tokens of English and code (using data from C4, CommonCrawl, The Stack, S2ORC).
The MPT models were quickly followed by the 7 and 30B models from the Falcon series, released by TIIUAE, and trained on 1 to 1.5T tokens of English and code (RefinedWeb, Project Gutemberg, Reddit, StackOverflow, Github, arXiv, Wikipedia, amongst other sources) – later within the yr, a big 180B model was also released. The Falcon models, data, and training process were detailed in a technical report and a later research paper.
Inheriting from the GPT-Neo-X model, StabilityAI released the StableLM-Base-Alpha models, a small (3B and 7B) pre-trained series using 1.5T tokens of an experimental dataset built on ThePile, followed by a v2 series with an information mix including RefinedWeb, RedPajama, ThePile, and undisclosed internal datasets, and lastly by a really small 3B model, the StableLM-3B-4e1T, complete with a detailed technical report.
Where previous models were mostly public about their data, from then on, following releases gave near no details about what was used to coach the models, and their efforts can’t be reproduced – nevertheless, they supply starting points for the community through the weights released.
Early in the summertime got here the X-Gen models from Salesforce, 7B parameters models trained on 1.5T tokens of “natural language and code”, in several steps, following an information scheduling system (not all data is introduced at the identical time to the model).
X-Gen was a bit over-shadowed by the much visible recent LLaMA-2 family from Meta, a variety of 7 to 70B models trained on 2T tokens “from publicly available sources”, with a permissive community license and an intensive technique of finetuning from human-preferences (RLHF), so-called alignment procedure.
A few months later, the primary model from the newly created startup Mistral, the so-called Mistral-7B was released, trained on an undisclosed variety of tokens from data “extracted from the open Web”. The top of 2023 was busy with model releases with a second larger model from Mistral (Mixtral 8x7B), a primary impressive model from Deci.AI called DeciLM in addition to a bigger merge of models from upstage, SOLAR also trained on undisclosed amount and sources of information. All these models carried regular increases on the leaderboards and open benchmarks.
In parallel, a notable event of the tip of the yr 2023 was the rise of performances and quite a few models trained in China and openly released. Two bilingual English-Chinese model series were released: Qwen, from Alibaba, models of seven to 70B parameters trained on 2.4T tokens, and Yi, from 01-AI, models of 6 to 34B parameters, trained on 3T tokens. The performance of those models was a step ahead of previous models each on open leaderboards just like the Open LLM leaderboard and a few of the most difficult benchmarks like Skill-Mix. One other strong contender from late 2023 was the DeepSeek coding model from DeepSeek AI trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in each English and Chinese (mostly a code model).
Dialog models in all places
In comparison with 2022, just about all pretrained models released in 2023 got here with each a pre-trained version and a dialog-finetuned version, using certainly one of several existing approaches. While approaches for adapting models to chat-setting were developed in 2022 and before, wide adoption of those techniques really took off in 2023, emphasizing the growing use of those chat models by most of the people in addition to the growing manual evaluation of the models by chatting with them (“vibe-check” evaluation). We detail essentially the most well-known approaches to adapt pretrained models for chat here, but many variations exist!
Chat-based fine-tuning is a variant of supervised fine-tuning, where the annotated data is chat data (multiturn dialogue-like data, very like what you’d find on social media) that you just fine-tune your model on. You utilize the identical technique as when training your model: for decoder transformers, you teach your model to predict the subsequent words one after the other (called an auto-regressive approach).
Instruction fine-tuning (IFT) follows the identical approach but with instruction datasets, which contain a group of query-like prompts plus answers (with optional additional input if needed). These datasets teach the models tips on how to follow an instruction and will be human or LLM-generated.
Using large-scale model-outputs synthetic datasets (datasets that are composed of model generations, e.g., generations from GPT-4 either from instructions of from interactions between users and said model) is certainly one of the ways to perform instruction and chat finetuning. This is commonly called distillation because it involves taking the knowledge from a high-performing model to coach or fine-tune a smaller model.
Each these methods are relatively easy to implement: you only need to seek out or generate related datasets after which fine-tune your model using the identical technique as when training. A terrific variety of instruct datasets were published last yr, which improved model performance in dialogue-like setups. For more information on this topic, you may read an intro blog here. Nonetheless, the models, though higher, can still not match what humans expect.
Reinforcement learning from human feedback (RLHF) is a particular approach that goals to align what the model predicts to what humans like best (depending on specific criteria). It was (originally of the yr) a brand new technique for fine-tuning. From a given prompt, the model generates several possible answers; humans rank these answers; the rankings are used to coach what known as a preference model (which learns to provide a rating reflecting human preference for answers); the preference model is then used to fine-tune the language model using reinforcement learning. For more detailed information, see this blog post, the original RLHF paper, or the Anthropic paper on RLHF. It is a costly method (annotating/rating + training a brand new model + fine-tuning is kind of expensive) that has been mostly used to align models for safety objectives. A more cost effective variation of this method has been developed that uses a high-quality LLM to rank model outputs as a substitute of humans: reinforcement learning from AI feedback (RLAIF).
Direct preference optimization (DPO) is one other variation of RLHF, but doesn’t require the training and use of a separate preference model – the tactic requires the identical human or AI rating dataset but uses this data to update the model directly by the difference between its original policy (way of predicting) and the optimal one (which might predict the best-ranked answers). In other words, the aligned model can be the preference model, which makes the optimization procedure quite a bit simpler while giving what appears to be equivalent final performances.
So, to come back back to our wave of small open weights models from (mostly) private corporations, a variety of them were released with fine-tuned counterparts: MPT-7B also got here with an instruct and a chat version, instruct-tuned versions of Falcon and XGen models were released at the tip of the yr, Llama-2, Qwen and Yi were released with chat versions and DeciLM with an instruct version. The discharge of Llama-2 was particularly notable resulting from the strong deal with safety, each within the pretraining and fine-tuning models.
What in regards to the community?
While chat models and instruction fine-tuned models were often provided directly with recent model releases, the community and researchers didn’t take this with no consideration: a large and healthy community of model fine-tuners bloomed over the fruitful grounds provided by these base models, with discussions spontaneously occurring on Reddit, Discord, the Hugging Face Hub, and Twitter. Community model releases were frequent, in parallel with the creation of recent interesting datasets (also used to finetune models to determine their good performances and quality).
Originally of 2023, a number of datasets for instruction/chat finetuning were already released. For example, for human preferences, the WebGPT dataset by OpenAI, HH-RLHF dataset by Anthropic, and Summarize by OpenAI were pioneer on this direction. Examples of instruction datasets are the Public Pool of Prompts by BigScience, FLAN 1 and a couple of by Google, Natural Instructions by AllenAI, Self Instruct, a framework to generate automatic instructions by researchers from different affiliations, SuperNatural instructions, an authority created instruction benchmark sometimes used as fine-tuning data, Unnatural instructions, an routinely generated instruction dataset by Tel Aviv University and Meta, amongst others.
❄️ Winter 2022/2023: In January this yr, the Human ChatGPT Instruction corpus (HC3) was released by Chinese researchers from various institutions, and contained humans versus model answers to numerous questions. March was full of releases: Stanford opened the Alpaca model, which was the primary instruction-following LLaMA model (7B), and the associated dataset, 52K instructions generated with an LLM. LAION (a non profit open source lab) released the Open Instruction Generalist (OIG) dataset, 43M instructions each created with data augmentation and compiled from other pre-existing data sources. The identical month, LMSYS org (at UC Berkeley) released Vicuna, also a LLaMA fine-tune (13B), this time on chat data: conversations between users and ChatGPT, shared publicly by the users themselves on ShareGPT. The Guanaco dataset, an extension of the Alpaca dataset (containing an added 500K entries in additional languages), was also released, in addition to the associated LLaMA-7B fine-tune.
🌱 Spring: In April, BAIR (Berkeley AI Research lab) released Koala, a chat-tuned LLaMA model, using several of the previous datasets (Alpaca, HH-RLHF, WebGPT, ShareGPT), and DataBricks released the Dolly dataset, an awesome human effort of 15K manually generated instructions in addition to the associated model, a Pythia fine-tune. In May, Tsinghua University released UltraChat, a dataset of 1.5M conversations containing instructions, and UltraLLaMA, a fine-tune on said dataset. Microsoft then released the GPT4-LLM dataset/framework to generate instructions with GPT4, and in June, Microsoft research shared a brand new method, Orca, to construct instruction datasets by utilizing the reasoning trace of larger models (which explain their step-by-step reasoning) – it was soon reproduced by the community (notably Alignmentlab.ai), who created Open Orca datasets, several million of entries, then used to fine-tune quite a few models (Llama, Mistral, …). In May and June, Camel-AI released quite a few instruction or chat datasets on different topics (greater than 20K examples in each domain, physics, biology, chemistry, …) obtained with GPT4. In June, too, the Airoboros framework to fine-tune models using model-generated data (following the self-instruct approach) was released, together with quite a few instruct datasets.
🌻Summer: In August, UltraLM (a high-performing chat fine-tune of LLaMA) was released by OpenBMB, a Chinese non-profit, and in September, they released the associated preference dataset UltraFeedback, a feedback dataset of inputs compared by GPT4 (with annotations). Throughout the summer, NousResearch, a collective, released several fine-tunes (notably the Hermes and Capybara collections) based on several private and public instruct datasets. In September, a student team from Tsinghua University released OpenChat, a LLaMA fine-tune using a brand new RL finetuning strategy, and Intel released an Orca style DPO dataset.
🍂 Autumn: In October, Hugging Face released Zephyr, a Mistral fine-tune using DPO and AIF on UltraChat and UltraFeedback, and community members released OpenHermes 2, a Mistral-7B fine-tuned on 900K entries either from the online or generated with Axolotl. Lmsys released LMSYS-Chat-1M, real-life user conversations with 25 LLMs. In November, OpenBuddy released OpenBuddy-Zephyr, a Zephyr fine-tuned on multi-turn dialogue data, and Argilla released Notus, a DPO fine-tune of Zephyr. NVIDIA released HelpSteer, an alignment fine-tuning dataset providing prompts, associated model responses, and grades of said answers on several criteria, while Microsoft Research released the Orca-2 model, a Llama 2 fine-tuned on a brand new synthetic reasoning dataset and Intel Neural Chat, a Mistral fine-tune on Orca and with DPO. In December, Berkeley released Starling, a RLAIF fine-tuned of Open-Chat, and the associated dataset, Nectar, 200K entries of comparison data.
As we are able to see, this whole yr’s development relies each on the creation of recent datasets through the usage of high-quality pretrained LLMs, in addition to on all of the open models released by the community, making the sphere go forward by leaps and bounds! And if you happen to now see certainly one of these names in a model name, you may have the opportunity to get an idea of where it’s coming from 🤗
Note: Some more specialized datasets (similar to MetaMath or MathInstruct math problem fine-tuning datasets, Evol-Instruct, math and code instructions, CodeAlpaca and CodeCapybara code instructions) were also released, but we can’t cover them intimately here, though they’ve also been used to enhance model performance on specific tasks. You can even see the awesome instructions dataset for a compilation of other relevant datasets.
Democratizing access
Note: Plenty of tools also emerged to support inference and deployment for more beginner users, similar to llama.cpp, ollama, text-generation-inference, vllm, amongst others. They’re out of scope for this document.
Merging: Extreme customization
In a typical open-source fashion, certainly one of the landmark of the community is model/data merging. With each merge/commit, it may be harder to trace each the info used (as quite a few released datasets are compilations of other datasets) and the models’ history, as highly performing models are fine-tuned versions of fine-tuned versions of comparable models (see Mistral’s “child models tree” here). On this summary, we’ve not had the time yet to discuss this amazing technique, so let’s spend a pair of ultimate words on it.
But what does it mean to merge a model?
Model merging is a solution to fuse the weights of various models together in a single model to (ideally) mix the respective strengths of every model in a unified single model. Just a few techniques exist to accomplish that which have been prolonged and infrequently published mostly in community forums, a striking case of fully decentralized research happening all around the world between a community of practitioners, researchers, and hobbyists. One in every of the best published methods consists in averaging the parameters of a set of models sharing a standard architecture (example 1, example 2) but more complex parameter mixtures exist, similar to determining which parameters are essentially the most influential in each model for a given task (weighted averaging), or considering parameters interference between models before choosing which parameters to maintain when merging (ties merging). For overview of the litterature, you may check this cool paper collection!
These techniques allow anybody to simply generate mixtures of models and are made especially easy by the proven fact that most models are nowadays variations on the identical architecture. That is the explanation some models submitted to the open LLM leaderboard have names similar to llama2-zephyr-orca-ultra. This particular example is probably going a merge of llama2 and zephyr models, fine-tuned on orca and ultra datasets. Often, more details are to be present in the respective model card on the Hugging Face hub.
PEFT: Personalization on the tip of your fingers
Sometimes, it’s possible you’ll want more controlled personalization, without enough memory to load a complete model in memory to superb tune it. Did you understand that you just need not use a whole model when fine-tuning?
It is advisable to use what known as parameter efficient fine-tuning (PEFT).
This system first freezes up the parameters of your pretrained model of interest, then adds quite a few recent parameters on top of it, called the adapters. What you then fine-tune in your task are only the (lightweight) adapter weights, considerably smaller than the unique model. You then just must share your small adapter weights (and the bottom model)! You will find a listing of interesting approaches for PEFT here.
Quantization: Models running in all places
We have seen that well-performing models now are available all sizes and shapes… but even then, it does not imply that they’re accessible to all! A 30B parameters model can require greater than 66G of RAM simply to load in memory (not even use), and never everyone locally has the hardware mandatory to accomplish that.
That is where quantization is available in! Quantization is a special technique which reduces a model’s size by changing the precision of its parameters.
What does it mean?
In a pc, numbers are stored with a given precision (similar to float32, float16, int8, and so forth). A precision indicates each the number type (is it a floating point number or an integer) in addition to on how much memory the number is stored: float32 stores floating point numbers on 32 bits. For a more in-depth explanation, see this link. So, the upper the precision, the more physical memory a number takes, as it is going to be stored on more bits.
So, if you happen to reduce the precision, you reduce the memory each model parameter takes in storage, subsequently reducing the model size! This also implies that you reduce… the actual precision of the computations, which might reduce the model’s performance. Nonetheless, we discovered that on greater models, this performance degradation is definitely very limited.
To return to our above example, our 30B parameters model in float16 requires a bit lower than 66G of RAM, in 8bit it only requires half that, so 33G of RAM, and it 4bit we reach even half of this, so around 16G of RAM, making it considerably more accessible.
There are lots of ways to go from one precision to a different, with many alternative “translation” schemes existing, each with its own advantages and disadvantages. Popular approaches include bitsandbytes, GPTQ, and AWQ. Some users, similar to TheBloke, are even converting popular models to make them accessible to the community. All are very recent and still developing, and we hope to see much more progress on this as time goes on.
What’s next?
The yr will not be over yet! And these final months days hours have already include the share of surprises: will a brand new architecture finally overperform the easy and efficient Transformer?
Latest releases include
- A combination of experts:
- Mixtral, the model is product of 8 sub-models (transformer decoders), and for every input, a router picks the two best sub-models and sums their outputs.
- Several state space models (models that map input to output through a latent space and which might expressed as either an RNN or a CNN depending on the tasks, this resource is great at explaining state models if you happen to want more information):
- Mamba, a state space model with an added selection mechanism
- Striped Hyena, a state space model with fast convolutions kernel
It’s still a bit too early to say if these recent approaches will take over the Transformer, but state space models are quite promising!
Takeaways
- This yr has seen an increase of open releases from every kind of actors (big corporations, start ups, research labs), which empowered the community to start out experimenting and exploring at a rate never seen before.
- Model announcement openness has seen ebbs and flow, from early releases this yr being very open (dataset mixes, weights, architectures) to late releases indicating nothing about their training data, subsequently being unreproducible.
- Open models emerged from many recent places, including China, with several recent actors positioning themselves as strong contenders within the LLM game.
- Personalization possibilities reached an all-time high, with recent strategies for fine-tuning (RLHF, adapters, merging), that are only at their starting.
- Smaller model sizes and upgrades in quantization made LLMs really accessible to many more people!
- Latest architectures have also appeared – will they finally replace the Transformer?
That is it folks!
I hope you enjoyed this yr’s review, learned a thing or two, and feel as enthusiastic as me about how much of AI progress now relies on open source and community effort! 🤗
[^1]: Post embedding normalisation is a trick to make learning more stable.
[^2]: ALiBi positional embeddings introduce a penalty when tokens too distant in a sequence are connected together by the model (where normal positional embeddings would just store information in regards to the order and respective position of tokens in a sequence).
