were first introduced for images, and for images they are sometimes easy to know.
A filter slides over pixels and detects edges, shapes, or textures. You possibly can read this text I wrote earlier to know how CNNs work for images with Excel.
For text, the concept is similar.
As a substitute of pixels, we slide filters over words.
As a substitute of visual patterns, we detect linguistic patterns.
And lots of necessary patterns in text are very local. Let’s take these quite simple examples:
- “good” is positive
- “bad” is negative
- “not good” is negative
- “not bad” is commonly positive
In my previous article, we saw find out how to represent words as numbers using embeddings.
We also saw a key limitation: after we used a world average, word order was completely ignored.
From the model’s standpoint, “not good” and “good not” looked the exact same.
So the following challenge is obvious: we would like the model to take word order into consideration.
A 1D Convolutional Neural Network is a natural tool for this, since it scans a sentence with small sliding windows and reacts when it recognizes familiar local patterns.
1. Understanding a 1D CNN for Text: Architecture and Depth
1.1. Constructing a 1D CNN for text in Excel
In this text, we construct a 1D CNN architecture in Excel with the next components:
- Embedding dictionary
We use a 2-dimensional embedding. Because one dimension isn’t enough for this task.
One dimension encodes sentiment, and the second dimension encodes negation. - Conv1D layer
That is the core component of a CNN architecture.
It consists of filters that slide across the sentence with a window length of two words. We elect 2 words to be easy. - ReLU and global max pooling
These steps keep only the strongest matches detected by the filters.
We can even discuss the undeniable fact that ReLU is optional. - Logistic regression
That is the ultimate classification layer, which mixes the detected patterns right into a probability.
This pipeline corresponds to a regular CNN text classifier.
The one difference here is that we explicitly write and visualize the forward pass in Excel.
1.2. What “deep learning” means on this architecture
Before going further, allow us to take a step back.
Yes, I do know, I do that often, but having a world view of models really helps to know them.
The definition of is commonly blurred.
For many individuals, deep learning simply means “many layers”.
Here, I’ll take a rather different standpoint.
What really characterizes deep learning isn’t the variety of layers, however the depth of the transformation applied to the input data.
With this definition:
- Even a model with a single convolution layer will be considered deep learning,
- since the input is transformed right into a more structured and abstract representation.
Then again, taking raw input data, applying one-hot encoding, and stacking many fully connected layers doesn’t necessarily make a model deep in a meaningful sense.
In theory, if we don’t have any transformation, one layer is enough.
In CNNs, the presence of multiple layers has a really concrete motivation.
Consider a sentence like:
This movie isn’t excellent
With a single convolution layer and a small window, we will detect easy local patterns comparable to: “very + good”
But we cannot yet detect higher-level patterns comparable to: “not + (excellent)”
This is the reason CNNs are sometimes stacked:
- the primary layer detects easy local patterns,
- the second layer combines them into more complex ones.
In this text, we deliberately give attention to one convolution layer.
This makes every step visible and simple to know in Excel, while keeping the logic equivalent to deeper CNN architectures.
2. Turning words into embeddings
Allow us to start with some easy words. We’ll attempt to detect negation, so we are going to use these terms, with other words (that we are going to not model)
- “good”
- “bad”
- “not good”
- “not bad”
We keep the representation intentionally small in order that every step is visible.
We’ll only use a dictionary of three words : good, bad and never.
All other words can have 0 as embeddings.
2.1 Why one dimension isn’t enough
In a previous article on sentiment detection, we used a single dimension.
That worked for “good” versus “bad”.
But now we would like to handle negation.
One dimension can only represent one concept well.
So we’d like two dimensions:
- senti: sentiment polarity
- neg: negation marker
2.2 The embedding dictionary
Each word becomes a 2D vector:
- good → (senti = +1, neg = 0)
- bad → (senti = -1, neg = 0)
- not → (senti = 0, neg = +1)
- some other word → (0, 0)

This isn’t how real embeddings look. Real embeddings are learned, high-dimensional, and in a roundabout way interpretable.
But for understanding how Conv1D works, this toy embedding is ideal.
In Excel, that is only a lookup table.
In an actual neural network, this embedding matrix can be trainable.

3. Conv1D filters as sliding pattern detectors
Now we arrive on the core idea of a 1D CNN.
A Conv1D filter is nothing mysterious. It’s only a small set of weights plus a bias that slides over the sentence.
Because:
- each word embedding has 2 values (senti, neg)
- our window incorporates 2 words
each filter has:
- 4 weights (2 dimensions × 2 positions)
- 1 bias
That’s all.
You possibly can consider a filter as repeatedly asking the identical query at every position:
“Do these two neighboring words match a pattern I care about?”
3.1 Sliding windows: how Conv1D sees a sentence
Consider this sentence:
it isn’t bad in any respect
We elect a window size of two words.
Which means the model looks at every adjoining pair:
- (it, is)
- (is, not)
- (not, bad)
- (bad, at)
- (at, all)
Essential point:
The filters slide in every single place, even when each words are neutral (all zeros).

3.2 4 intuitive filters
To make the behavior easy to know, we use 4 filters.

Filter 1 – “I see GOOD”
This filter looks only on the sentiment of the current word.
Plain-text equation for one window:
z = senti(current_word)
If the word is “good”, z = 1
If the word is “bad”, z = -1
If the word is neutral, z = 0
After ReLU, negative values change into 0. However it is optional.
Filter 2 – “I see BAD”
This one is symmetric.
z = -senti(current_word)
So:
- “bad” → z = 1
- “good” → z = -1 → ReLU → 0
Filter 3 – “I see NOT GOOD”
This filter looks at two things at the identical time:
- neg(previous_word)
- senti(current_word)
Equation:
z = neg(previous_word) + senti(current_word) – 1
Why the “-1”?
It acts like a threshold in order that each conditions should be true.
Results:
- “not good” → 1 + 1 – 1 = 1 → activated
- “is sweet” → 0 + 1 – 1 = 0 → not activated
- “not bad” → 1 – 1 – 1 = -1 → ReLU → 0
Filter 4 – “I see NOT BAD”
Same idea, barely different sign:
z = neg(previous_word) + (-senti(current_word)) – 1
Results:
- “not bad” → 1 + 1 – 1 = 1
- “not good” → 1 – 1 – 1 = -1 → 0
That is an important intuition:
A CNN filter can behave like a local logical rule, learned from data.
3.3 Outcome of sliding windows
Here is the ultimate results of those 4 filters.

4. ReLU and max pooling: from local to global
4.1 ReLU
After computing z for each window, we apply ReLU:
ReLU(z) = max(0, z)
Meaning:
- negative evidence is ignored
- positive evidence is kept
Each filter becomes a presence detector.
By the best way, it’s an activation function within the Neural network. So a Neural network isn’t that difficult in any case.

4.2 Global Max pooling
Then comes global max pooling.
For every filter, we keep only:
max activation over all windows
Interpretation:
“I don’t care where the pattern appears, only whether it appears strongly somewhere.”
At this point, the entire sentence is summarized by 4 numbers:
- strongest “good” signal
- strongest “bad” signal
- strongest “not good” signal
- strongest “not bad” signal

4.3 What happens if we remove ReLU?
Without ReLU:
- negative values stay negative
- max pooling may select negative values
This mixes two ideas:
- absence of a pattern
- opposite of a pattern
The filter stops being a clean detector and becomes a signed rating.
The model could still work mathematically, but interpretation becomes harder.
5. The ultimate layer is logistic regression
Now we mix these signals.
We compute a rating using a linear combination:
rating = 2 × F_good – 2 × F_bad – 3 × F_not_good – 3 × F_not_bad – bias

Then we convert the rating right into a probability:
probability = 1 / (1 + exp(-score))
That is precisely logistic regression.
So yes:
- the CNN extracts features: this step will be regarded as feature engineering, right?
- logistic regression makes the ultimate decisions, it’s a classic machine learning model we all know well

6. Full examples with sliding filters
Example 1
“it’s bad, so it isn’t good in any respect”
The sentence incorporates:
After max pooling:
- F_good = 1 (because “good” exists)
- F_bad = 1
- F_not_good = 1
- F_not_bad = 0
Final rating becomes strongly negative.
Prediction: negative sentiment.

Example 2
“it is sweet. yes, not bad.”
The sentence incorporates:
After max pooling:
- F_good = 1
- F_bad = 1 (since the word “bad” appears)
- F_not_good = 0
- F_not_bad = 1
The ultimate linear layer learns that “not bad” should outweigh “bad”.
Prediction: positive sentiment.
This also shows something necessary: max pooling keeps all strong signals.
The ultimate layer decides find out how to mix them.

Exemple 3 with A limitation that explains why CNNs get deeper
Do this sentence:
“it isn’t very bad”
With a window of size 2, the model sees:
It never sees (not, bad), so the “not bad” filter never fires.
It explains why real models use:
- larger windows
- multiple convolution layers
- or other architectures for longer dependencies

Conclusion
The strength of Excel is visibility.
You possibly can see:
- the embedding dictionary
- all filter weights and biases
- every sliding window
- every ReLU activation
- the max pooling result
- the logistic regression parameters
Training is just the technique of adjusting these numbers.
When you see that, CNNs stop being mysterious.
They change into what they are surely: structured, trainable pattern detectors that slide over data.
