Finally, a Alternative for BERT: Introducing ModernBERT

-


This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8192 sequence length, higher downstream performance and far faster processing.

ModernBERT is out there as a slot-in alternative for any BERT-like models, with each a base (149M params) and large (395M params) model size.

Click to see the way to use these models with transformers

ModernBERT can be included in v4.48.0 of transformers. Until then, it requires installing transformers from primary:

pip install git+https://github.com/huggingface/transformers.git

Since ModernBERT is a Masked Language Model (MLM), you should utilize the fill-mask pipeline or load it via AutoModelForMaskedLM. To make use of ModernBERT for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes.

⚠️ In case your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to achieve the very best efficiency. To achieve this, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)


masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)

Using a pipeline:

import torch
from transformers import pipeline
from pprint import pprint

pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)

input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)

Note: ModernBERT doesn’t use token type IDs, unlike some earlier BERT models. Most downstream usage is an identical to plain BERT models on the Hugging Face Hub, except you possibly can omit the token_type_ids parameter.



Introduction

BERT was released in 2018 (millennia ago in AI-years!) and yet it’s still widely used today: the truth is, it’s currently the second most downloaded model on the HuggingFace hub, with greater than 68 million monthly downloads, only second to one other encoder model fine-tuned for retrieval. That’s because its encoder-only architecture makes it ideal for the sorts of real-world problems that come up day by day, like retrieval (corresponding to for RAG), classification (corresponding to content moderation), and entity extraction (corresponding to for privacy and regulatory compliance).

Finally, 6 years later, now we have a alternative! Today, we at Answer.AI and LightOn (and friends!) are releasing ModernBERT. ModernBERT is a brand new model series that could be a Pareto improvement over BERT and its younger siblings across each speed and accuracy. This model takes dozens of advances from recent years of labor on large language models (LLMs), and applies them to a BERT-style model, including updates to the architecture and the training process.

We expect to see ModernBERT grow to be the brand new standard in the various applications where encoder-only models are actually deployed, corresponding to in RAG pipelines (Retrieval Augmented Generation) and suggestion systems.

Along with being faster and more accurate, ModernBERT also increases context length to 8k tokens (in comparison with just 512 for many encoders), and is the primary encoder-only model that features a considerable amount of code in its training data. These features open up recent application areas that were previously inaccessible through open models, corresponding to large-scale code search, recent IDE features, and recent forms of retrieval pipelines based on full document retrieval moderately than small chunks.

But with a view to explain just what we did, let’s first take a step back and have a look at where we’ve come from.



Decoder-only models

The recent high-profile advances in LLMs have been in models like GPT, Llama, and Claude. These are decoder-only models, or generative models. Their ability to generate human-like content has enabled astonishing recent GenAI application areas like generated art and interactive chat. These striking applications have attracted major investment, funded booming research, and led to rapid technical advances. What we’ve done, essentially, is port these advances back to an encoder-only model.

Why? Because many practical applications need a model that’s lean and mean! And it doesn’t should be a generative model.

More bluntly, decoder-only models are too big, slow, private, and expensive for a lot of jobs. Consider that the unique GPT-1 was a 117 million parameter model. The Llama 3.1 model, against this, has 405 billion parameters, and its technical report describes a knowledge synthesis and curation recipe that is just too complex and expensive for many corporations to breed. So to make use of such a model, like ChatGPT, you pay in cents and wait in seconds to get an API reply back from heavyweight servers outside of your control.

After all, the open-ended capabilities of those giant generative models mean which you could, in a pinch, press them into service for non-generative or discriminative tasks, corresponding to classification. It’s because you possibly can describe a classification task in plain English and … just ask the model to categorise. But while this workflow is great for prototyping, you don’t wish to pay prototype prices when you’re in mass production.

The favored buzz around GenAI has obscured the role of encoder-only models. These are the workhorses of practical language processing, the models which are actually getting used for such workloads without delay in lots of scientific and industrial applications.



Encoder-only models

The output of an encoder-only model is an inventory of numerical values (an embedding vector). You may say that as an alternative of answering with text, an encoder model literally encodes its “answer” into this compressed, numerical form. That vector is a compressed representation of the model’s input, which is why encoder-only models are sometimes known as representational models.

While decoder-only models (like a GPT) can do the work of an encoder-only model (like a BERT), they’re hamstrung by a key constraint: since they’re generative models, they’re mathematically “not allowed” to “peek” at later tokens. They will only ever look backwards. That is in contrast to encoder-only models, that are trained so each token can look forwards and backwards (bi-directionally). They’re built for this, and it makes them very efficient at what they do.

Mainly, a frontier model like OpenAI’s O1 is sort of a Ferrari SF-23. It’s an obvious triumph of engineering, designed to win races, and that’s why we speak about it. Nevertheless it takes a special pit crew just to vary the tires and you possibly can’t buy one for yourself. In contrast, a BERT model is sort of a Honda Civic. It’s also an engineering triumph, but more subtly, since it is engineered to be reasonably priced, fuel-efficient, reliable, and intensely useful. And that’s why they’re absolutely in every single place.

You may see this by taking a look at it numerous ways.

Supporting generative models: One strategy to understand the prevalence of representational models (encoder-only) is to notice how steadily they’re utilized in concert with a decoder-only model to make a system which is secure and efficient.

The plain example is RAG. As an alternative of counting on the LLM’s knowledge trained into the model’s parameters, the system uses a document store to furnish the LLM with information relevant to the query. But after all this only defers the issue. If the LLM doesn’t know which documents are relevant to the query, then the system will need another process to pick those documents? It’s going to wish a model which is fast and low-cost enough that it might probably be used to encode the massive quantities of data needed to make the LLM useful. That model is commonly a BERT-like encoder-only model.

One other example is supervision architectures, where an inexpensive classifier is likely to be used to be certain that generated text doesn’t violate content safety requirements.

In brief, at any time when you see a decoder-only model in deployment, there’s an inexpensive probability an encoder-only model can also be a part of the system. However the converse isn’t true.

Encoder-based systems: Before there was GPT, there have been content recommendations in social media and in platforms like Netflix. There was ad targeting in those venues, in search, and elsewhere. There was content classification for spam detection, abuse detection, etc.. These systems weren’t built on generative models, but on representational models like encoder-only models. And all these systems are still on the market and still running at enormous scale. Imagine what number of ads are targeted per second all over the world!

Downloads: On HuggingFace, RoBERTa, one in all the leading BERT-based models, has more downloads than the ten hottest LLMs on HuggingFace combined. In reality, currently, encoder-only models add as much as over a billion downloads per 30 days, nearly thrice greater than decoder-only models with their 397 million monthly downloads. In reality, the `fill-mask` model category, composed of encoder “base models” corresponding to ModernBERT, able to be fine-tuned for other downstream applications, is essentially the most downloaded model category overall.

Inference costs: What the above suggests, is that on an inference-per-inference basis, there are lots of times more inferences performed per yr on encoder-only models than on decoder-only or generative models. An interesting example is FineWeb-Edu, where model-based quality filtering needed to be performed over 15 trillion tokens. The FineWeb-Edu team selected to generate annotations with a decoder-only model, Llama-3-70b-Instruct, and perform the majority of the filtering with a fine-tuned BERT-based model. This filtering took 6,000 H100 hours, which, at HuggingFace Inference Endpoints’ pricing of $10/hour, involves a complete of $60,000. Alternatively, feeding 15 trillion tokens to popular decoder-only models, even with the lowest-cost option of using Google’s Gemini Flash and its low inference cost of $0.075/million tokens, would cost over a million dollars!



Performance



Overview

Here’s a snapshot of the accuracy of ModernBERT and other models across a spread of tasks, as measured by standard academic benchmarks – as you possibly can see, ModernBERT is the one model which is a top scorer across every category, which makes it the one model you should utilize for all of your encoder-based tasks:

In case you’ve ever done an NLP competition on Kaggle, you then’ll know that DeBERTaV3 has been the selection of champions for years. But not: not only is ModernBERT the primary base-size model to beat DeBERTaV3 on GLUE, it also uses lower than 1/fifth of Deberta’s memory.

And naturally, ModernBERT is fast. It’s twice as fast as DeBERTa – the truth is, as much as 4x faster within the more common situation where inputs are mixed length. Its long context inference is sort of 3 times faster than other high-quality models corresponding to NomicBERT and GTE-en-MLM.

ModernBERT’s context length of 8,192 tokens is over 16x larger than most existing encoders. That is critical, for example, in RAG pipelines, where a small context often makes chunks too small for semantic understanding. ModernBERT can also be the state-of-the-art long context retriever with ColBERT, and is 9 percentage points above the opposite long context models. Much more impressive: this in a short time trained model, simply tuned to match to other backbones, outperforms even widely-used retrieval models on long-context tasks!

For code retrieval, ModernBERT is exclusive. There’s nothing to essentially compare it to, since there’s never been an encoder model like this trained on a considerable amount of code data before. As an example, on the StackOverflow-QA dataset (SQA), which is a hybrid dataset mixing each code and natural language, ModernBERT’s specialized code understanding and long-context capabilities make it the one backbone to attain over 80 on this task.

This implies whole recent applications are more likely to be built on this capability. As an example, imagine an AI-connected IDE which had a whole enterprise codebase indexed with ModernBERT embeddings, providing fast long context retrieval of the relevant code across all repositories. Or a code chat service which described how an application feature worked that integrated dozens of separate projects.

In comparison with the mainstream models, ModernBERT performs higher across nearly all three broad task categories of retrieval, natural language understanding, and code retrieval. Whilst it barely lags DeBERTaV3 in a single area (natural language understanding), it’s repeatedly faster. Please note that ModernBERT, as another base model, can only do masked word prediction out-of-the-box. To have the ability to perform other tasks, the bottom model ought to be fine-tuned as done in these boilerplates.

In comparison with the specialized models, ModernBERT is comparable or superior in most tasks. As well as, ModernBERT is quicker than most models across most tasks, and may handle inputs as much as 8,192 tokens, 16x longer than the mainstream models.



Efficiency

Here’s the memory (max batch size, BS) and Inference (in hundreds of tokens per second) efficiency results on an NVIDIA RTX 4090 for ModernBERT and other decoder models:

The very first thing you would possibly notice is that we’re analysing the efficiency on a reasonable consumer GPU, moderately than the newest unobtainable hyped hardware. Before everything, ModernBERT is concentrated on practicality, not hype.

As a part of this focus, it also means we’ve made sure ModernBERT works well for real-world applications, moderately than simply benchmarks. Models of this type are normally tested on just the one exact size they’re best at – their maximum context length. That’s what the “fixed” column within the table shows. But input sizes vary in the actual world, in order that’s the performance we worked hard to optimise – the “variable” column. As you possibly can see, for variable length inputs, ModernBERT is way faster than all other models.

For long context inputs, which we consider can be the idea for the most dear and necessary future applications, ModernBERT is 2-3x faster than the subsequent fastest model. And, on the “practicality” dimension again: ModernBERT doesn’t require the extra heavy “xformers” dependency, but as an alternative only requires the now commonplace Flash Attention as a dependency.

Moreover, because of ModernBERT’s efficiency, it might probably use a bigger batch size than nearly another model, and could be used effectively on smaller and cheaper GPUs. The efficiency of the bottom size, specifically, may enable recent applications that run directly in browsers, on phones, and so forth.



Why is ModernBERT, well, Modern?

Now, we’ve made our case to why we should give some more like to encoder models. As trusted, under-appreciated workhorses, they’ve had surprisingly few updates since 2018’s BERT!

Much more surprising: since RoBERTa, there was no encoder providing overall improvements without tradeoffs (fancily often called “Pareto improvements”): DeBERTaV3 had higher GLUE and classification performance, but sacrificed each efficiency and retrieval. Other models, corresponding to AlBERT, or newer ones, like GTE-en-MLM, all improved over the unique BERT and RoBERTa in some ways but regressed in others.

Nevertheless, because the duo’s original release, we have learned an unlimited amount about the way to construct higher language models. In case you’ve used LLMs in any respect, you’re thoroughly aware of it: while they’re rare within the encoder-world, Pareto improvements are constant in decoder-land, where models consistently grow to be higher at every thing. And as we’ve all learned by now: model improvements are only partially magic, and mostly engineering.

The goal of the (hopefully aptly named) ModernBERT project was thus fairly easy: bring this contemporary engineering to encoder models. We did so in three core ways:

  1. a modernized transformer architecture
  2. particular attention to efficiency
  3. modern data scales & sources



Meet the Latest Transformer, Same because the Old Transformer

The Transformer architecture has grow to be dominant, and is utilized by the overwhelming majority of models nowadays. Nevertheless, it’s necessary to do not forget that there isn’t one but many Transformers. The primary thing they share in common is their deep belief that spotlight is indeed all you would like, and as such, construct various improvements centered across the attention mechanism.

ModernBERT takes huge inspiration from the Transformer++ (as coined by Mamba), first utilized by the Llama2 family of models. Namely, we replace older BERT-like constructing blocks with their improved equivalent, namely, we:

  • Replace the old positional encoding with “rotary positional embeddings” (RoPE): this makes the model significantly better at understanding where words are in relation to one another, and allows us to scale to longer sequence lengths.
    • Switch out the old MLP layers for GeGLU layers, improving on the unique BERT’s GeLU activation function.
    • Streamline the architecture by removing unnecessary bias terms, letting us spend our parameter budget more effectively
    • Add an additional normalization layer after embeddings, which helps stabilize training



Upgrading a Honda Civic for the Race Track

We’ve covered this already: encoders are not any Ferraris, and ModernBERT isn’t any exception. Nevertheless, that doesn’t mean it might probably’t be fast. Once you get on the highway, you generally don’t go and trade in your automotive for a race automotive, but moderately hope that your on a regular basis reliable ride can comfortably hit the speed limit.

In reality, for all the appliance cases we mentioned above, speed is important. Encoders are very talked-about in uses where they either need to process tons of information, allowing even tiny speed increments so as to add up in a short time, or where latency may be very necessary, as is the case on RAG. In plenty of situations, encoders are even run on CPU, where efficiency is much more necessary if we wish ends in an inexpensive period of time.

As with most things in research, we construct while standing on the shoulders of giants, and heavily leverage Flash Attention 2’s speed improvements. Our efficiency improvements depend on three key components: Alternating Attention, to enhance processing efficiency, Unpadding and Sequence Packing, to cut back computational waste, and Hardware-Aware Model Design, to maximise hardware utilization.



Global and Local Attention

One in all ModernBERT’s most impactful features is Alternating Attention, moderately than full global attention. In technical terms, because of this our attention mechanism only attends to the total input every 3 layers (global attention), while all other layers use a sliding window where every token only attends to the 128 tokens nearest to itself (local attention).
As attention’s computational complexity balloons up with every additional token, this implies ModernBERT can process long input sequences considerably faster than another model.

In practice, it looks like this:

Conceptually, the rationale this works is pretty easy: Picture yourself reading a book. For each sentence you read, do it is advisable be fully aware of your complete plot to know most of it (full global attention)? Or is awareness of the present chapter enough (local attention), so long as you occasionally think back on its significance to the primary plot (global attention)? Within the overwhelming majority of cases, it’s the latter.



Unpadding and Sequence Packing

One other core mechanism contributing to ModernBERT’s efficiency is its use for Unpadding and Sequence packing.

In an effort to have the ability to process multiple sequences throughout the same batch, encoder models require them to be the same length, so that they can perform parallel computation. Traditionally, we’ve relied on padding to realize this: work out which sentence is the longest, and add meaningless tokens (padding tokens) to refill every other sequence.

While padding solves the issue, it doesn’t achieve this elegantly: plenty of compute finally ends up being spent and wasted on padding tokens, which don’t contribute any semantic information.

Padding vs sequence packing
Comparing padding with sequence packing. Sequence packing (‘unpadding’) avoids wasting compute on padding tokens and has more consistent non-padding token counts per batch. Samples are still processed individually through careful masking.

Unpadding solves this issue: moderately than keeping these padding tokens, we remove all of them, and concatenate them into mini-batches with a batch size of 1, avoiding all unnecessary computations. In case you’re using Flash Attention, our implementation of unpadding is even faster than previous methods, which heavily relied on unpadding and repadding sequences as they went through the model: we go one step further by introducing our own implementation of unpadding, relying heavily on recent developments in Flash Attention’s RoPE support. This permits ModernBERT to only need to unpad once, and optionally repad sequences after processing, leading to a 10-20% speedup over previous methods.

To hurry up pre-training even further, unpadding is in good company inside our model, as we use it together with sequence packing. Sequence packing here’s a logical next step: as we’re concatenating inputs right into a single sequence, and GPUs are excellent at parallelisation, we wish to maximise the computational efficiency we are able to squeeze out of a single forward model pass. To achieve this, we use a greedy algorithm to group individual sequences into concatenated ones which are as near the model’s maximum input length as possible.



Paying Attention to Hardware

Finally, the third facet of ModernBERT’s efficiency is hardware design.

We attempted to balance two insights which have been highlighted by previous research:

  1. Deep & Narrow vs Wide & Shallow: Research shows that deeper models with narrower layers, often perform higher than shallow models with fewer, wider layers. Nevertheless, this can be a double-edged sword: the deeper the model, the less parallelizable it becomes, and thus, the slower it runs at an identical parameter counts.
  2. Hardware Efficiency: Model dimensions must align well with GPU hardware for optimum performance, and different goal GPUs result in several constraints.

Sadly, there isn’t any magic recipe to make a model run similarly well on a wide selection of GPUs, but there is a wonderful cookbook: The Case for Co-Designing Model Architectures with Hardware, through which the ways to optimize a model architecture for a given GPU are fastidiously laid out. We got here up with a heuristic to increase their method to a basket of GPUs, while respecting a given set of constraints. Logically, step one is to define said constraints, in our case:

  • Defining our goal GPUs as common inference ones (RTX 3090/4090, A10, T4, L4)
  • Roughly defining our goal model sizes at 130-to-150 million parameters for ModernBERT-Base, and 350-to-420 for ModernBERT-Large.
  • The ultimate embedding sizes must match the unique BERT’s dimensions, 768 for base and 1024 for giant, to maximise backwards compatibility
  • Set performance constraints that are common across the basket of GPUs

Afterwards, we experimented with multiple model designs via a constrained grid search, various each layer counts and layer width. Once we’d identified shapes that seemed to be essentially the most efficient ones, we confirmed that our heuristics matched real-world GPU performance, and settled on the ultimate model designs.



Training



def data(): return [‘text’, ‘bad_text’, ‘math’, ‘code’]

https://media1.tenor.com/m/xJSM2Ky3WpgAAAAd/steve-ballmer-microsoft.gif
Picture this exact scene, but replace Developers with Data

One other big aspect through which encoders have been trailing behind is training data. This is commonly understood to mean solely training data scale, but this isn’t actually the case: previous encoders, corresponding to DeBERTaV3, were trained for long enough that they could have even breached the trillion tokens scale!

The difficulty, moderately, has been training data diversity: lots of the older models train on limited corpora, generally consisting of Wikipedia and Wikibooks. These data mixtures are very noticeably single text modality: they contain nothing but high-quality natural text.

In contrast, ModernBERT is trained on data from quite a lot of English sources, including web documents, code, and scientific articles. It’s trained on 2 trillion tokens, of which most are unique, moderately than the usual 20-to-40 repetitions common in previous encoders.

The impact of this is straight away noticeable: out of all the prevailing open source encoders, ModernBERT is in a category of its own on programming-related tasks. We’re particularly inquisitive about what downstream uses this can result in, by way of improving programming assistants.



Process

We keep on with the unique BERT’s training recipe, with some slight upgrades inspired by subsequent work: we remove the Next-Sentence Prediction objective, since then shown so as to add overhead for no clear gains, and increase the masking rate from 15% to 30%.

Each models are trained with a three-phase process. First, we train on 1.7T tokens at a sequence length of 1024. We then adopt a long-context adaptation phase, training on 250B tokens at a sequence length of 8192, while keeping the full tokens seen per batch kind of consistent by lowering the batch size. Finally, we perform annealing on 50 billion tokens sampled in a different way, following the long-context extension ideal mix highlighted by ProLong.

Training in three phases is our way of ensuring our model is nice across the board, which is reflected in its results: it’s competitive on long-context tasks, without charge to its ability to process short context…

… Nevertheless it has one other profit: for the primary two-phases, we train using a continuing learning rate once the warmup phase is complete, and only perform learning rate decay on the ultimate 50 billion tokens, following the Trapezoidal (or Warmup-Stable-Decay) learning rate. And what’s more: we’ll release each immediate intermediate checkpoints from these stable phases, inspired by Pythia. Our primary reason for doing so was supporting future research and applications: anyone is free to restart training from any of our pre-decay checkpoints, and perform annealing on domain-appropriate data for his or her intended use!



The tricks, it’s all in regards to the tricks!

In case you’ve made it this far into this announcement, you’re probably used to this: after all, we use tricks to make things quicker here too. To be precise, now we have two primary tricks.

Let’s start with the primary one, which is pretty common: because the initial training steps are updating random weights, we adopt batch-size warmup: we start with a smaller batch size so the identical variety of tokens update the model weights more often, then regularly increase the batch size to the ultimate training size. This significantly hastens the initial phase of model training, where the model learns its most elementary understanding of language.

The second trick is way more unusual: weight initialization via tiling for the larger model size, inspired by Microsoft’s Phi family of models. This one’s based on the next realization: Why initialize the ModernBERT-large’s initial weights with random numbers when now we have a superbly good (if we dare say so ourselves) set of ModernBERT-base weights just sitting there?

And indeed, it seems that tiling ModernBERT-base’s weights across ModernBERT-large works higher than initializing from random weights. It also has the additional benefit of stacking nicely with batch size warmup for even faster initial training.



Conclusion

On this blog post we introduced the ModernBERT models, a brand new state-of-the-art family of small and efficient encoder-only models, finally giving BERT a much needed do-over.

ModernBERT demonstrates that encoder-only models could be improved by modern methods. They proceed to supply very strong performance on some tasks, providing an especially attractive size/performance ratio.

Greater than anything, we’re really looking forward to seeing what creative ways to make use of these models the community will provide you with! To encourage this, we’re opening a call for demos until January tenth, 2025: the 5 best ones will get added to this post in a showcase section and win a $100 (or local currency equivalent) Amazon gift card, in addition to a 6-month HuggingFace Pro subscription! In case you need a touch to start, here’s a demo we considered: code similarity HF space! And remember, that is an encoder model, so all the good downstream applications will likely require some type of fine-tuning (on real or perhaps decoder-model synthetic data?). Thankfully, there’s plenty of cool frameworks on the market to support fine-tuning encoders: 🤗Transformers itself for various tasks, including classification, GliNER for zero-shot Named Entity Recognition, or Sentence-Transformers for retrieval and similarity tasks!



Links

LightOn sponsored the compute for this project on Orange Business Cloud Avenue.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x