— that he saw further only by standing on the shoulders of giants — captures a timeless truth about science. Every breakthrough rests on countless layers of prior progress, until someday … all of it just works. Nowhere is that this more evident than within the recent and ongoing revolution in natural language processing (NLP), driven by the Transformers architecture that underpins most generative AI systems today.
In this text, I tackle the role of an educational Sherlock Holmes, tracing the evolution of.
Like all scientific revolutions, language modelling didn’t emerge overnight but builds on a wealthy heritage. In this text, I concentrate on a small slice of the vast literature in the sphere. Specifically, our journey will begin with a pivotal earlier technology — the Relevance-Based Language Models of Lavrenko and Croft — which marked a step change within the performance of Information Retrieval systems within the early 2000s and continues to depart its mark in TREC competitions. From there, the trail results in 2017, when Google published the seminal Attention Is All You Need paper, unveiling the Transformers architecture that revolutionised sequence-to-sequence translation tasks.
The important thing link between the 2 approaches is, at its core, quite easy: . Just as Lavrenko and Croft’s Relevance Modelling estimates which terms are almost definitely to co-occur with a question, the Transformer’s attention mechanism computes the similarity between a question and all tokens in a sequence, weighting each token’s contribution to the query’s contextual meaning.
Each models are generative frameworks over text, differing mainly of their scope: RM1 models short queries from documents, transformers model full sequences.
In the next sections, we’ll explore the background of Relevance Models and the Transformer architecture, highlighting their shared foundations and clarifying the parallels between them.
Relevance Modelling — Introducing Lavrenko’s RM1 Mixture Model
Let’s dive into the conceptual parallel between Lavrenko & Croft’s Relevance Modelling framework in Information Retrieval and the Transformer’s attention mechanism. Each emerged in numerous domains and eras, but they share the identical mental DNA. We are going to walk through the background on Relevance Models, before outlining the important thing link to the next Transformer architecture.
When Victor Lavrenko and W. Bruce Croft introduced the Relevance Model within the early 2000s, they offered a chic probabilistic formulation for bridging the gap between queries and documents. At their core, these models start from an easy idea: assume there exists a hidden ” over vocabulary terms that characterises documents a user would consider relevant to their query. The duty then becomes estimating this distribution from the observed data, namely the user query and the document collection.
The primary Relevance Modelling variant — RM1 (there have been two other models in the identical family, not highlighted intimately here) — does this directly by , essentially modelling relevance as a latent language model that sitseach queries and documents.

with the posterior probability of a document d given a question q given by:

That is the classic proposed in the unique paper by Lavrenko and Croft. To estimate this relevance model, RM1 uses the top-retrieved documents as it assumes the highest-scoring documents are prone to be relevant. Which means no costly relevance judgements are required, a key advantage of Lavrenko’s formulation.

To accumulate an intuition into how the RM1 model works, we’ll code it up step-by-step in Python, using an easy toy document corpus consisting of three “documents”, defined below, with a question “cat”.
import math
from collections import Counter, defaultdict
# -----------------------
# Step 1: Example corpus
# -----------------------
docs = {
"d1": "the cat sat on the mat",
"d2": "the dog barked on the cat",
"d3": "dogs and cats are friends"
}
# Query
query = ["cat"]
Next — for the needs of this toy example IR situation— we flippantly pre-process the document collection, by splitting the documents into tokens, determining the count of every token inside each document, and defining the vocabulary:
# -----------------------
# Step 2: Preprocess
# -----------------------
# Tokenize and count
doc_tokens = {d: doc.split() for d, doc in docs.items()}
doc_lengths = {d: len(toks) for d, toks in doc_tokens.items()}
doc_term_counts = {d: Counter(toks) for d, toks in doc_tokens.items()}
# Vocabulary
vocab = set(w for toks in doc_tokens.values() for w in toks)
If we run the above code we’ll get the next output, with 4 easy data structures holding the data we want to compute the RM1 distribution of relevance for any query.
doc_tokens = {
'd1': ['the', 'cat', 'sat', 'on', 'the', 'mat'],
'd2': ['the', 'dog', 'barked', 'at', 'the', 'cat'],
'd3': ['dogs', 'and', 'cats', 'are', 'friends']
}
doc_lengths = {
'd1': 6,
'd2': 6,
'd3': 5
}
doc_term_counts = {
'd1': Counter({'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}),
'd2': Counter({'the': 2, 'dog': 1, 'barked': 1, 'at': 1, 'cat': 1}),
'd3': Counter({'dogs': 1, 'and': 1, 'cats': 1, 'are': 1, 'friends': 1})
}
vocab = {
'the', 'cat', 'sat', 'on', 'mat',
'dog', 'barked', 'at',
'dogs', 'and', 'cats', 'are', 'friends'
}
If we have a look at the RM1 equation defined earlier, we are able to break it up into key . P(w|d) defines the probability distribution of the words w in a document d. P(w|d) is generally computed using Dirichlet prior smoothing (Zhai & Lafferty, 2001). This prior avoids zero probabilities for unseen words and balances document-specific evidence with background collection statistics. That is defined as:

The above equation gives us a for every of the documents in our corpus. As an aside, you possibly can imagine how as of late — with powerful language models available of Hugging-face — we could swap out this formulation for e.g. a BERT-based variant, using embeddings to estimate the distribution .
In to , we are able to derive a document embedding g(d) via mean pooling and a word embedding , then mix them in the next equation:

Here V denotes the pruned vocab (e.g., union of document terms) and 𝜏 is a temperature parameter. This might be step one on making a , an untouched and potentially novel direction in the sphere of IR.
Back to the unique formulation: this prior formulation may be coded up in Python, as our first estimate of :
# -----------------------
# Step 3: P(w|d)
# -----------------------
def p_w_given_d(w, d, mu=2000):
"""Dirichlet-smoothed language model."""
tf = doc_term_counts[d][w]
doc_len = doc_lengths[d]
# collection probability
cf = sum(doc_term_counts[dd][w] for dd in docs)
collection_len = sum(doc_lengths.values())
p_wc = cf / collection_len
return (tf + mu * p_wc) / (doc_len + mu)
Next up, we compute the query likelihood under the document model — :
# -----------------------
# Step 4: P(q|d)
# -----------------------
def p_q_given_d(q, d):
"""Query likelihood under doc d."""
rating = 0.0
for w in q:
rating += math.log(p_w_given_d(w, d))
return math.exp(rating) # return likelihood, not log
RM1 requires so we flip the probability — — :
def p_d_given_q(q):
"""Posterior distribution over documents given query q."""
# Compute query likelihoods for all documents
scores = {d: p_q_given_d(q, d) for d in docs}
# Assume uniform prior P(d), so proportionality is just scores
Z = sum(scores.values()) # normalization
return {d: scores[d] / Z for d in docs}
We assume here that the document prior is uniform, and so it cancels. We also then normalize across all documents so the posteriors sum to 1:

Just like , it’s value considering how we could the terms in RM1. A primary approach could be to make use of an off-the-shelf cross- or dual-encoder model (reminiscent of the MS MARCO–fine-tuned BERT cross-encoder) to embed the query and document, produce a similarity rating, and normalize it with a softmax:

With and converted to neural network-based representations, we are able to plug each together to get an easy initial version of a neural RM1 model that may give us back .
For the needs of this text — nevertheless — we’ll switch back into the classic RM1 formulation. Let’s run the (non-neural, standard RM1) code to this point to see the output of the assorted components we’ve just discussed. Recall that our toy document corpus is:
d1: "the cat sat on the mat"
d2: "the dog barked on the cat"
d3: "dogs and cats are friends"
Assuming Dirichlet smoothing (with μ=2000), the values will likely be to the gathering probability of “cat” for the reason that documents are very short. For illustration:
- d1: “cat” appears once in 6 words → is roughly 0.16
- d2: “cat” appears once in 6 words → is roughly 0.16
- d3: “cat” never appears → is roughly 0 (with smoothing, a small >0 value)
We now normalize this distribution to reach on the posterior distribution:
q)': 0.4997,
'P(d3
tells us how well the documentthe query. If we imagine that every document is itself a mini language model: if it were generating text, how likely is it to supply the words we see within the query? This probability is high if the query words look natural under the documents word distribution. For instance, for query “cat”, a document that literally mentions “cat” will give a high likelihood; one about “dogs and cats” a bit less; one about “Charles Dickens” near zero.
So as an alternative of evaluating how well the document explains the query, we treat documents ass for relevance and normalise them right into a distribution over all documents. This offers us a rating rating changed into probability mass — the upper it’s, the more likely this document is relevant in comparison with the remaining of the gathering.
We now have all components to complete our implementation of Lavrenko’s RM1 model:
# -----------------------
# Step 6: RM1: P(w|R,q)
# -----------------------
def rm1(q):
pdq = p_d_given_q(q)
pwRq = defaultdict(float)
for w in vocab:
for d in docs:
pwRq[w] += p_w_given_d(w, d) * pdq[d]
# normalize
Z = sum(pwRq.values())
for w in pwRq:
pwRq[w] /= Z
return dict(sorted(pwRq.items(), key=lambda x: -x[1]))
# -----------------------
We will now see that RM1 defines a probability distribution over the vocabulary that tells us which words are almost definitely to occur in documents relevant to the query. This distribution can then be used for query expansion, by adding high-probability words, or for re-ranking documents by measuring the KL divergence between each document’s language model and the query’s relevance model.
Top terms from RM1 for query ['cat']
cat 0.1100
the 0.1050
dog 0.0800
sat 0.0750
mat 0.0750
barked 0.0700
on 0.0700
at 0.0680
dogs 0.0650
friends 0.0630
In our toy example, the term naturally rises to the highest, because it matches the query directly. High-frequency background words like also appear strongly, though in practice these could be filtered out as stop words. More interestingly, content words from documents containing (reminiscent of ) are elevated as well. That is the facility of RM1: it introduces related terms not present within the query itself, without requiring explicit relevance judgments or supervision. Words unique to (e.g., ) receive small but nonzero probabilities because of smoothing.
Having now seen how RM1 builds a query-specific language model by reweighing document terms based on their posterior relevance, it’s hard not to note the parallel with what got here much later in deep learning:
In RM1, we estimate a brand new distribution P(w|R, q) over words by combining document language models, weighted by how likely each document is relevant given the query. The Transformer architecture does something relatively similar: given a token (the “query”), it computes a similarity to all other tokens (the “keys”), then uses those scores to weight their “values.” This produces a brand new, context-sensitive representation of the query token.
Lavrenko’s RM1 Model as a “proto-Transformer”
The eye mechanism, introduced as a part of the Transformer architecture, was designed to beat a key weakness of earlier sequence models like LSTMs and RNNs: . While recurrent models struggled to capture long-range dependencies, attention made it possible to directly connect any token in a sequence with another, whatever the distance within the sequence.
What’s interesting is that the mathematics of attention looks very much like what RM1 was doing a few years earlier. In RM1, as we’ve seen, we construct a query-specific distribution by weighting documents; in Transformers, we construct a token-specific representation by weighting other tokens within the sequence. The principle is similar — assign to essentially the most relevant context — but applied on the token level relatively than the document level.
This could be seen as a daring claim, so it’s incumbent upon us to supply some proof!
Let’s first dig a little bit deeper into the eye mechanism, and I defer to the implausible wealth of high-quality existing introductory material for a fuller and deeper dive.
Within the Transformer’s attention layer — — given a question vector q, we compute its similarity to all other tokens’ keys k. These similarities are normalized into weights through a softmax. Finally, those weights are used to mix the corresponding values v, producing a brand new, context-aware representation of the query token.
Scaled dot-product attention is:

Here, Q = query vector(s), K = key vectors (documents, in our analogy, V = value vectors (words/features to be mixed). The softmax is a normalised distribution over the keys.
Now, recall RM1 (Lavrenko & Croft 2001):

The eye weights in scaled dot-product attention parallel the document–query distribution P(d|q) in RM1. Reformulating attention in per-query form makes this connection explicit:


The worth vector — v — in attention may be considered corresponding to P(w|d) within the RM1 model, but as an alternative of an explicit word distribution, v is a dense semantic vector — a for the total distribution. It’s effectively the content we mix together once we arrive on the relevance scores for every document.
We will moreover draw further parallels with the broader Transfomer architecture.
- Robust Probability Estimation: For instance, now we have previously discussed that RM1 needs smoothing (e.g., Dirichlet) to smooth zero counts and avoid overfitting to rare terms. Similarly, Transformers use residual connections and layer normalisation to stabilise and avoid collapsing attention distributions. Each models implement robustness in probability estimation when the info signal is sparse or noisy.
- Pseudo Relevance Feedback: RM1 performs a single round of probabilistic expansion through pseudo-relevance feedback (PRF), restricting attention to the top-K retrieved documents. The PRF set functions like an attention context window: the query distributes probability mass over a limited set of documents, and words are reweighed accordingly. Similarly, transformer attention is proscribed to the local input sequence. Unlike RM1, nevertheless, transformers stack many layers of attention, each reweighting and refining token distributions. Deep attention stacking can thus be seen as — repeatedly pooling across related context to construct richer representations.
The analogy between RM1 and the Transformer is summarised within the below table, where we tie together each component and draw links between each:

Nearly twenty years later, the identical principle re-emerged within the Transformer’s attention mechanism — now at the extent of tokens relatively than documents. What began as a statistical model for query expansion in Information Retrieval evolved into the mathematical core of contemporary Large Language Models (LLMs). It’s a reminder that lovely ideas in science rarely disappear; they travel forward through time, reshaped and reinterpreted in recent contexts.
Sometimes the best ideas are essentially the most powerful. Who would have imagined that “attention” could turn into the important thing to unlocking language? And yet, it’s.
Conclusions and Final Thoughts
In this text, now we have traced one branch of the vast tree that’s language modelling, uncovering a compelling connection between the event of relevance models in early information retrieval and the emergence of Transformers in modern NLP. RM1 — ther first variant within the family of relevance models, was, in some ways, a proto-Transformer for IR — foreshadowing the mechanism that may later reshape how machines understand language.
We even coded up a neural variant of the Relevance Model, using modern encoder-only models, thereby formally unifying past (relevance model) and present (transformer architecture) in the identical formal probabilistic model!
Originally, we invoked Newton’s image of standing on the shoulders of giants. Allow us to close with one other of his reflections:
I hope that you simply agree that the trail from RM1 to Transformers is just such a discovery — a highly polished pebble on the shore of a much greater ocean of AI discoveries yet to return.
