as a black box. We all know that it learns from data, however the query is it truly learns.
In this text, we are going to construct a tiny Convolutional Neural Network (CNN) directly in Excel to know, step-by-step, how a CNN actually works for images.
We are going to open this black box, and watch each step occur right before our eyes. We are going to understand all of the calculations which can be the inspiration of what we call “deep learning”.
This text is in a series of articles about implementing machine learning and deep learning algorithms in Excel. And you could find all of the Excel files on this Kofi link.
1. How Images are Seen by Machines
1.1 Two Ways to Detect Something in an Image
Once we attempt to detect an object in an image, like a cat, there are two most important ways: the deterministic approach and the machine learning approach. Let’s see how these two approaches work for this instance of recognizing a cat in an image.
The deterministic way means writing rules by hand.
For instance, we will say that a cat has a round face, two triangle ears, a body, a tail, etc. So the developer will do all of the work to define the foundations.
Then the pc runs all these rules, and offers a rating of similarity.
The machine learning approach implies that we don’t write rules by ourselves.
As a substitute, we give the pc many examples, pictures with cats and pictures without cats. Then it learns by itself what makes a cat a cat.

That’s where things may develop into mysterious.
We normally say that the machine will figure it out by itself, but the actual query is how.
In truth, we still must tell the machines learn how to create these rules. And rules needs to be learnable. So the important thing point is: how can we define the form of rules that might be used?
To know learn how to define rules, we first have to know what a picture is.
1.2 Understanding What an Image Is
A cat is complex form, but we will take an easy and clear example: recognizing handwritten digits from the MNIST dataset.
First, what’s a picture?
A digital image might be seen as a grid of pixels. Each pixel is a number that shows how vivid it’s, from 0 for white to 255 for black.
In Excel, we will represent this grid with a table where each cell corresponds to at least one pixel.

The unique dimension of the digits is 28 x 28. But to maintain things easy, we are going to use a ten×10 table. It’s sufficiently small for quick calculations but still large enough to point out the final shape.
So we are going to reduce the dimension.
For instance, the handwritten number “1” might be represented by a ten×10 grid as below in Excel.

1.3 Before Deep Learning: Classic Machine Learning for Images
Before using CNNs or any deep learning method, we will already recognize easy images with classic machine learning algorithms comparable to logistic regression or decision trees.
On this approach, each pixel becomes one feature. For instance, a ten×10 image has 100 pixels, so there are 100 features as input.
The algorithm then learns to associate patterns of pixel values with labels comparable to “0”, “1”, or “2”.

In truth with this straightforward machine learning approach, logistic regression can achieve quite good results with an accuracy around 90%.
This shows that classic models are capable of learn useful information from raw pixel values.
Nevertheless, they’ve a serious limitation. They treat each pixel as an independent value, without considering its neighbors. Because of this, they can not understand spatial relationships with the pixels.
So intuitively, we all know that the performance won’t be good for complex images. So this method shouldn’t be scalable.
Now, should you already understand how classic machine learning works, that there isn’t a magic. And the truth is, you already know what to do: you could have to enhance the feature engineering step, you could have to remodel the features, with a view to get more meaningful information from the pixels.
2. Constructing a CNN Step by Step in Excel
2.1 From complex CNNs to an easy one in Excel
Once we speak about Convolutional Neural Networks, we frequently see very deep and complicated architectures, like VGG-16. Many layers, hundreds of parameters, and countless operations, it seems very complex, and say that it’s inconceivable to know exactly how it really works.

The most important idea behind the layers is: detecting patterns step-by-step.
With the instance of handwritten digits, let’s ask an issue: what might be the best possible CNN architecture?
First, for the hidden layers, before doing all of the layers, let’s reduce the number. What number of? Let’s do one. That’s right: just one.
As for the filters, what about their dimensions? In real CNN layers, we normally use 3×3 filters to detect small pattern. But let’s begin with big ones.
How big? 10×10!
Yes, why not?
This also implies that you don’t must slide the filter across the image. This manner, we will directly compare the input image with the filter and see how well they match.
This easy case shouldn’t be about performance, but about clarity.
It would show how CNNs detect patterns step-by-step.
Now, we’ve got to define the variety of filters. We are going to say 10, it’s the minimum. Why? Because there are 10 digits, so we’ve got to have a minimum of 10 filters. And we are going to see how they might be present in the subsequent section.
Within the image below, you could have the diagram of this simplest architecture of a CNN neural network:

2.2 Training the Filters (or Designing Them Ourselves)
In an actual CNN, the filters usually are not written by hand. They’re learned during training.
The neural network adjusts the values inside each filter to detect the patterns that best help to acknowledge the pictures.
In our easy Excel example, we won’t train the filters.
As a substitute, we are going to create them ourselves to know what they represent.
Since we already know the shapes of handwritten digits, we will design filters that appear to be each digit.
For instance, we will draw a filter that matches the shape of 0, one other for 1, and so forth.
Another choice is to take the common image of all examples for every digit and use that because the filter.
Each filter will then represent the “average shape” of a number.
That is where the frontier between human and machine becomes visible again. We are able to either let the machine discover the filters, or we will use our own knowledge to construct them manually.
That is correct: machines don’t define the character of the operations. Machine learning researchers define them. Machines are only good to do loops, to seek out the optimal values for these defines rules. And in easy cases, humans are all the time higher than machines.
So, if there are only 10 filters to define, we all know that we will directly define the ten digits. So we all know, intuitively, the character of those filters. But there are other options, after all.
Now, to define the numerical values of those filters, we will directly use our knowledge. And we can also use the training dataset.
Below you’ll be able to see the ten filters created by averaging all the pictures of every handwritten digit. Every one shows the standard pattern that defines a number.

2.3 How a CNN Detects Patterns
Now that we’ve got the filters, we’ve got to check the input image to those filters.
The central operation in a CNN is known as cross-correlation. It’s the important thing mechanism that permits the pc to match patterns in a picture.
It really works in two easy steps:
- Multiply values/dot product: we take each pixel within the input image, and we are going to multiply it by the pixel in the identical position of the filter. Which means that the filter “looks” at each pixel of the image and measures how similar it’s to the pattern stored within the filter. Yes, if the 2 values are large, then the result’s large.
- Add results/sum: The products of those multiplications are then added together to provide a single number. This number expresses how strongly the input image matches the filter.

In our simplified architecture, the filter has the identical size because the input image (10×10).
For this reason, the filter doesn’t must move across the image.
As a substitute, the cross-correlation is applied once, comparing the entire image with the filter directly.
This number represents how well the image matches the pattern contained in the filter.
If the filter looks like the common shape of a handwritten “5”, a high value implies that the image might be a “5”.
By repeating this operation with all filters, one per digit, we will see which pattern gives the very best match.
2.4 Constructing a Easy CNN in Excel
We are able to now create a small CNN from end to finish to see how the total process works in practice.
- Input: A ten×10 matrix represents the image to categorise.
- Filters: We define ten filters of size 10×10, every one representing the common image of a handwritten digit from 0 to 9. These filters act as pattern detectors for every number.
- Cross correlation: Each filter is applied to the input image, producing a single rating that measures how well the image matches that filter’s pattern.
- Decision: The filter with the very best rating gives the anticipated digit. In deep learning frameworks, this step is commonly handled by a Softmax function, which converts all scores into probabilities.
In our easy Excel version, taking the maximum rating is sufficient to determine which digit the image almost certainly represents.

2.5 Convolution or Cross Correlation?
At this point, you may wonder why we call it a Convolutional Neural Network when the operation we described is definitely cross-correlation.
The difference is subtle but easy:
- Convolution means flipping the filter each horizontally and vertically before sliding it over the image.
- Cross-correlation means applying the filter directly, without flipping.
For more information, you’ll be able to read this text:
For some historical reason, the term Convolution stayed, whereas the operation that is definitely done in a CNN is cross-correlation.
As you’ll be able to see, in most deep-learning frameworks, comparable to PyTorch or TensorFlow, actually use cross-correlation when performing “convolutions”.

In brief:
CNNs are “convolutional” in name, but “cross-correlational” in practice.
3. Constructing More Complex Architectures
3.1 Small filters to detect more detailed patterns
Within the previous example, we used a single 10×10 filter to check the entire image with one pattern.
This was enough to know the principle of cross-correlation and the way a CNN detects similarity between a picture and a filter.
Now we will take one step further.
As a substitute of 1 global filter, we are going to use several smaller filters, each of size 5×5. These filters will take a look at smaller regions of the image, detecting local details as a substitute of the complete shape.
Let’s take an example with 4 5×5 filters applied to a handwritten digit.
The input image might be cut into 4 smaller parts of 5×5 pixels for every one.
We still can use the common value of all of the digits to start with. So each filter will give 4 values, as a substitute of 1.

At the tip, we will apply a Softmax function to get the ultimate prediction.
But in this straightforward case, additionally it is possible simply to sum all of the values.
3.2 What if the digit shouldn’t be in the middle of the image
In my previous examples, I compare the filters to fixed areas of the image. And one intuitive query that we will ask is what if the article shouldn’t be centered. Yes, it may be at any position on a image.
The answer is unfortunately very basic: you slide the filter across the image.
Let’s take an easy example again: the dimension of the input image is 10×14. The peak shouldn’t be modified, and the width is 14.
So the filter continues to be 10 x 10, and it is going to slide horizontally across the image. Then, we are going to get 5 cross-correlation.
We have no idea where the image is, but it surely shouldn’t be an issue because we will just get the max value of the 5 the-cross correlations.
That is what we call max pooling layer.

3.3 Other Operations Utilized in CNNs
We try to clarify, why each component is beneficial in a CNN.
An important component is the cross-correlation between the input and the filters. And we also explain that small filters might be useful, and the way max pooling handles objects that might be anywhere in a picture.
There are also other steps commonly utilized in CNNs, comparable to using several layers in a row or applying non-linear activation functions.
These steps make the model more flexible, more robust, and capable of learn richer patterns.
Why are they useful exactly?
I’ll leave this query to you as an exercise.
Now that you just understand the core idea, attempt to take into consideration how each of those steps helps a CNN go further, and you’ll be able to attempt to take into consideration some concrete example in Excel.
Conclusion
Simulating a CNN in Excel is a fun and practical method to see how machines recognize images.
By working with small matrices and easy filters, we will understand the most important steps of a CNN.
I hope this text gave you some food for considered what deep learning really is. The difference between machine learning and deep learning shouldn’t be only about how deep the model is, but about how it really works with representations of images and data.
