The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel

-

of this series, we’ll speak about deep learning.

And when people speak about deep learning, we immediately consider these images of deep neural networks architectures, with many layers, neurons, and parameters.

In practice, the actual shift introduced by deep learning is elsewhere.

It’s about learning data representations.

In this text, we deal with text embeddings, explain their role within the machine learning landscape, and show how they might be understood and explored in Excel.

1. Classic Machine earning vs. Deep learning

We are going to discuss, on this part, why embedding is introduced.

1.1 Where does deep learning fit?

To know embeddings, we first have to make clear the place of deep learning.

We are going to use the term classic machine learning to explain methods that don’t depend on deep architectures.

All of the previous articles cope with classic machine learning, that might be described in two complementary ways.

Learning paradigms

  • Supervised learning
  • Unsupervised learning

Model families

  • Distance-based models
  • Tree-based models
  • Weight-based models

Across this series, we’ve already studied the training algorithms behind these models. Specifically, we’ve seen that gradient descent applies to all weight-based models, from linear regression to neural networks.

Deep learning is usually reduced to neural networks with many layers.

But this explanation is incomplete.

From an optimization standpoint, deep learning doesn’t introduce a brand new learning rule.

So what does it introduce?

1.2 Deep learning as data representation learning

Deep learning is about how features are created.

As an alternative of manually designing features, deep learning learns representations robotically, often through multiple successive transformations.

This also raises a vital conceptual query:

Where is the boundary between feature engineering and model learning?

Some examples make this clearer:

  • Polynomial regression continues to be a linear model, however the features are polynomial
  • Kernel methods project data right into a high-dimensional feature space
  • Density-based methods implicitly transform the information before learning

Deep learning continues this concept, but at scale.

From this angle, deep learning belongs to:

  • the feature engineering philosophy, for representation
  • the weight-based model family, for learning

1.3 Images and convolutional neural networks

Images are represented as pixels.

From a technical standpoint, image data is already numerical and structured: a grid of numbers. Nevertheless, the information contained in these pixels will not be structured in a way that classical models can easily exploit.

Pixels don’t explicitly encode: edges, shapes, textures, or objects.

Convolutional Neural Networks (CNNs) are designed to create information from pixels. They apply filters to detect local patterns, then progressively mix them into higher-level representations.

I even have published a this text showing how CNNs might be implemented in Excel to make this process explicit.

CNN in Excel – all images by creator

For images, the challenge is not to make the information numerical, but to extract meaningful representations from already numerical data.

1.4 Text data: a distinct problem

Text presents a fundamentally different challenge.

Unlike images, text is not numerical by nature.

Before modeling context or order, the primary problem is more basic:

How can we represent words numerically?

Making a numerical representation for text step one.

In deep learning for text, this step is handled by embeddings.

Embeddings transform discrete symbols (words) into vectors that models can work with. Once embeddings exist, we are able to then model: context, order and relationships between words.

In this text, we deal with this primary and essential step:
how embeddings create numerical representations for text, and the way this process might be explored in Excel.

2. Two ways to learn text embeddings

In this text, we’ll use the IMDB movie reviews dataset as an example each approaches. The dataset is distributed under the Apache License 2.0.

There are two most important ways to learn embeddings for text, and we’ll do each with this dataset:

  • supervised: we’ll create embeddings to predict the sentiment
  • unsupervised or self-supervised: we’ll use word2vec algorithm

In each cases, the goal is similar:
to rework words into numerical vectors that might be utilized by machine learning models.

Before comparing these two approaches, we first have to make clear what embeddings are and the way they relate to classic machine learning.

IMDB dataset image by creator –Apache License 2.0

2.1 Embeddings and classic machine learning

In classic machine learning, categorical data is generally handled with:

  • label encoding, which assigns fixed integers but introduces artificial order
  • one-hot encoding, which removes order but produces high-dimensional sparse vectors

How they might be used rely on the character of the models.

Distance-based models cannot effectively use one-hot encoding, because all categories find yourself being equally distant from one another. Label encoding could work provided that we are able to attribute meaningful numerical values for the categories, which is usually not the case in classic models.

Weight-based models can use one-hot encoding, since the model learns a weight for every category. In contrast, with label encoding, the numerical values are fixed and can’t be adjusted to represent meaningful relationships.

Tree-based models treat all variables as categorical splits reasonably than numerical magnitudes, which makes label encoding acceptable in practice. Nevertheless, most implementations, including scikit-learn, still require numerical inputs. Consequently, categories have to be converted to numbers, either through label encoding or one-hot encoding. If the numerical values carried semantic meaning, this could again be useful.

Overall, this highlights a limitation of classic approaches:
category values are fixed and never learned.

Embeddings extend this concept by learning the representation itself.
Each word is related to a trainable vector, turning the representation of categories right into a learning problem reasonably than a preprocessing step.

2.2 Supervised embeddings

In supervised learning, embeddings are learned as a part of a prediction task.

For instance, the IMDB dataset has labels in regards to the in sentiment evaluation. So we are able to create a quite simple architecture:

In our case, we are able to use a quite simple architecture: each word is mapped to a one-dimensional embedding

This is feasible because the target is binary sentiment classification.

Once training is complete, we are able to export the embeddings and explore them in Excel.

When plotting the embeddings on the x-axis and word frequency on the y-axis, a transparent pattern appears:

  • positive values are related to words reminiscent of or ,
  • negative values are related to words reminiscent of or

Depending on the initialization, the sign might be inverted, because the logistic regression layer also has parameters that influence the ultimate prediction.

Finally, in Excel, we reconstruct the total pipeline that corresponds to the architecture we define early.

Input column
The input text (a review) is cut into words, and every row corresponds to at least one word.

Embedding search
Using a lookup function, the embedding value related to each word is retrieved from the embedding table learned during training.

Global average
The worldwide average embedding is computed by averaging the embeddings of all words seen to this point. This corresponds to a quite simple sentence representation: the mean of word vectors.

Probability prediction
The averaged embedding is then passed through a logistic function to supply a sentiment probability.

What we observe

  • Words with strongly positive embeddings (for instance , , ) push the typical upward.
  • Words with strongly negative embeddings (for instance , , ) pull the typical downward.
  • Neutral or weakly weighted words have little influence.

As more words are added, the worldwide average embedding stabilizes, and the sentiment prediction becomes more confident.

2.3 Word2Vec: embeddings from co-occurrence

In Word2Vec, similarity doesn’t mean that two words have the identical meaning.
It signifies that they appear in similar contexts.

Word2Vec learns word embeddings by taking a look at which words are inclined to co-occur inside a hard and fast window within the text. Two words are considered similar in the event that they often appear around the identical neighboring words, even when their meanings are opposite.

As shown within the Excel sheet below, we compute the cosine similarity for the word and retrieve essentially the most similar words.

From the model’s perspective, the encircling words are almost similar. The one thing that changes is the adjective itself.

Consequently, Word2Vec learns that “good” and “bad” play an analogous role in language, although their meanings are opposite.

So, Word2Vec captures distributional similarity, not semantic polarity.

A useful strategy to give it some thought is:

2.4 How embeddings are used

In modern systems reminiscent of RAG (Retrieval-Augmented Generation), embeddings are sometimes used to retrieve documents or passages for query answering.

Nevertheless, this approach has limitations.

Mostly used embeddings are trained in a self-supervised way, based on co-occurrence or contextual prediction objectives. Consequently, they capture general language similarity, not task-specific meaning.

Which means:

  • embeddings may retrieve text that’s linguistically similar but not relevant
  • semantic proximity doesn’t guarantee answer correctness

Other embedding strategies might be used, including task-adapted or supervised embeddings, but they often remain self-supervised at their core.

Understanding how embeddings are created, what they encode, and what they don’t encode is due to this fact essential before using them in downstream systems reminiscent of RAG.

Conclusion

Embeddings are learned numerical representations of words that make similarity measurable.

Whether learned through supervision or through co-occurrence, embeddings map words to vectors based on how they’re utilized in data. By exporting them to Excel, we are able to inspect these representations directly, compute similarities, and understand what they capture and what they don’t.

This makes embeddings less mysterious and clarifies their role as a foundation for more complex systems reminiscent of retrieval or RAG.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x