The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel

Neural Network Regressor, we now move to the classifier version.

From a mathematical viewpoint, the 2 models are very similar. In truth, they differ mainly by the interpretation of the output and the selection of the loss function.

Nonetheless, this classifier version is where intuition often becomes much stronger.

In practice, neural networks are used way more often for classification than for regression. Pondering when it comes to probabilities, decision boundaries, and classes makes the role of neurons and layers easier to know.

In this text, you will note:

learn how to define the structure of a neural network in an intuitive way,
why the variety of neurons matters,
and why a single hidden layer is already sufficient, no less than in theory.

At this point, a natural query arises:

The reply is very important.

Deep learning is not nearly stacking many hidden layers on top of one another. Depth helps, however it shouldn’t be the entire story. What really matters is how representations are built, reused, and constrained, and why deeper architectures are more efficient to coach and generalize in practice.

We’ll come back to this distinction later. For now, we deliberately keep the network small, in order that every computation could be understood, written, and checked by hand.

That is the most effective approach to truly understand how a neural network classifier works.

As with the neural network regressor we built yesterday, we’ll split the work into two parts.

First, we have a look at forward propagation and define the neural network as a set mathematical function that maps inputs to predicted probabilities.

Then, we move to backpropagation, where we train this function by minimizing the log loss using gradient descent.

The principles are the exact same as before. Only the interpretation of the output and the loss function change.

1. Forward propagation

On this section, we concentrate on just one thing: the model itself. No training yet. Just the function.

1.1 An easy dataset and the intuition of constructing a function

We start with a really small dataset:

12 observations
One single feature x
A binary goal y

The dataset is intentionally easy in order that every computation could be followed manually. Nonetheless, it has one necessary property: the classes are not linearly separable.

Which means that an easy logistic regression cannot solve the issue, no matter how well it’s trained.

Dataset for Neural Network Classifier – all images by creator

Nonetheless, the intuition is precisely the alternative of what it could appear at first.

What we’re going to do is construct two logistic regressions first. Each creates a cut within the input space, as illustrated below.

In other words, we start with one single feature, and we transform it into two latest features.

Neural Network Classifier – all images by creator

Then, we apply one other logistic regression, this time on these two features, to acquire the ultimate output probability.

When written as a single mathematical expression, the resulting function is already a bit complex to read. This is strictly why we use a diagram: not since the diagram is more accurate, but since it is easier to grasp how the function is built by composition.

1.2 Neural Network Structure

So the visual diagram represents the next model:

One hidden layer with two neurons within the hidden layer, which allows us to represent the 2 cuts we observe within the dataset
One output neuron, and it’s a logistic regression here.

In our case, the model depends upon seven coefficients:

Weights and biases for the 2 hidden neurons
Weights and bias for the output neuron

Taken together, these seven numbers fully define the model.

Now, should you already understand how a neural network classifier works, here is an issue for you:

How many various solutions can this model have?

In other words, what number of distinct sets of seven coefficients can produce the identical classification boundary, or almost the identical predicted probabilities, on this dataset?

1.3 Implementing forward propagation in Excel

We now implement the model using Excel formulas.

To visualise the output of the neural network, we generate latest values of x starting from −2 to 2 with a step of 0.02.

For every value of x, we compute:

The outputs of the 2 hidden neurons (A1 and A2)
The ultimate output of the network

At this stage, the model shouldn’t be trained yet. We due to this fact have to fix the seven parameters of the network. For now, we simply use a set of reasonable values, shown below, which allows us to visualise the forward propagation of the model.

It is only one possible configuration of the parameters. Even before training, this already raises an interesting query: how many various parameter configurations could produce a legitimate solution for this problem?

Coefficients selected for the neural network (image by creator)

We will use the next equations to compute the values of the hidden layers and the output.

The intermediate values A1 and A2 are displayed explicitly. This avoids large, unreadable formulas and makes the forward propagation easy to follow.

Formulas for forward propagation (image by the creator)

The dataset has been successfully divided into two distinct classes using the neural network.

Visualization of the output of the neural network — image by the creator

1.4 Forward propagation: summary and observations

To recap, we began with an easy training dataset and defined a neural network as an explicit mathematical function, implemented using straightforward Excel formulas and a set set of coefficients. By feeding latest values of xxx into this function, we were in a position to visualize the output of the neural network and observe the way it separates the information.

Now, should you look closely on the shapes produced by the hidden layer, which accommodates the 2 logistic regressions, you possibly can see that there are 4 possible configurations. They correspond to the several possible orientations of the slopes of the 2 logistic functions.

Each hidden neuron can have either a positive or a negative slope. With two neurons, this results in 2×2=4 possible combos. These different configurations can produce very similar decision boundaries on the output, though the underlying parameters are different.

This explains why the model can admit multiple solutions for a similar classification problem.

The tougher part is now to find out the values of those coefficients.

That is where backpropagation comes into play.

2. Backpropagation: training the neural network with gradient descent

Once the model is defined, training becomes a numerical problem.

Despite its name, backpropagation shouldn’t be a separate algorithm. It is just gradient descent applied to a composed function.

2.1 Reminder of the backpropagation algorithm

The principle is similar for all weight-based models.

We first define the model, that’s, the mathematical function that maps the input to the output.

Then we define the loss function. Since it is a binary classification task, we use log loss, exactly as in logistic regression.

Finally, as a way to learn the coefficients, we compute the partial derivatives of the loss with respect to every coefficient of the model. These derivatives are what allow us to update the parameters using gradient descent.

Below is a screenshot showing the ultimate formulas for these partial derivatives.

The backpropagation algorithm can then be summarized as follows:

Initialize the weights of the neural network randomly.
Feedforward the inputs through the neural network to get the expected output.
Calculate the error between the expected output and the actual output.
Backpropagate the error through the network to calculate the gradient of the loss function with respect to the weights.
Update the weights using the calculated gradient and a learning rate.
Repeat steps 2 to five until the model converges.

2.2 Initialization of the coefficients

The dataset is organized in columns to make Excel formulas easy to increase.

The coefficients are initialized with specific values here. You possibly can change them, but convergence shouldn’t be guaranteed. Depending on the initialization, the gradient descent may converge to a special solution, converge very slowly, or fail to converge altogether.

Initial values for the coefficients (image by creator)

2.3 Forward propagation

Within the columns from AG to BP, we implement the forward propagation step. We first compute the 2 hidden activations A1 and A2, after which the output of the network. These are the exact same formulas as those used earlier to define the forward propagation of the model.

To maintain the computations readable, we process each remark individually. In consequence, now we have 12 columns for the hidden layer outputs (A1 and A2) and 12 columns for the output layer.

As a substitute of writing a single summation formula, we compute the values remark by remark. This avoids very large and hard-to-read formulas, and it makes the logic of the computations much clearer.

This column-wise organization also makes it easy to mimic a for-loop during gradient descent: the formulas can simply be prolonged by row to represent successive iterations.

2.4 Errors and the associated fee function

Within the columns from BQ to CN, we compute the error terms and the values of the associated fee function.

For every remark, we evaluate the log loss based on the expected output and the true label. These individual losses are then combined to acquire the full cost for the each iteration.

Errors and value function (image by creator)

2.5 Partial derivatives

We now move to the computation of the partial derivatives.

The neural network has 7 coefficients, so we’d like to compute 7 partial derivatives, one for every parameter. For every derivative, the computation is finished for all 12 observations, which ends up in a complete of 84 intermediate values.

To maintain this manageable, the sheet is fastidiously organized. The columns are grouped and color-coded in order that each derivative could be followed easily.

Within the columns from CO to DL, we compute the partial derivatives related to a11 and a12.

Within the columns from DM to EJ, we compute the partial derivatives related to b11 and b12.

Within the columns from EK to FH, we compute the partial derivatives related to a21 and a22.

Within the columns from FI to FT, we compute the partial derivatives related to b2.

And to wrap it up, we sum the partial derivatives across the 12 observations.

The resulting gradients are grouped and shown within the columns from Z to FI.

2.6 Updating weights in a for loop

These partial derivatives allow us to perform gradient descent for every coefficient. The updates are computed within the columns from R to X.

At each iteration, we are able to observe how the coefficients evolve. The worth of the associated fee function is shown in column Y, which makes it easy to see whether the descent is working and whether the loss is decreasing.

After updating the coefficients at each step of the for loop, we recompute the output of the neural network.

If the initial values of the coefficients are poorly chosen, the algorithm may fail to converge or may converge to an undesired solution, even with an inexpensive step size.

Local minimum neural network (Image by creator)

The GIF below shows the output of the neural network at each iteration of the for loop. It helps visualize how the model evolves during training and the way the choice boundary step by step converges toward an answer.

Neural network output visualization with weights updating — Image by creator

Conclusion

We now have now accomplished the total implementation of a neural network classifier, from forward propagation to backpropagation, using only explicit formulas.

By constructing the whole lot step-by-step, now we have seen that a neural network is nothing greater than a mathematical function, trained by gradient descent. Forward propagation defines what the model computes. Backpropagation tells us learn how to adjust the coefficients to scale back the loss.

This file permits you to experiment freely: you possibly can change the dataset, modify the initial values of the coefficients, and observe how the training behaves. Depending on the initialization, the model may converge quickly, converge to a special solution, or get stuck in a neighborhood minimum.

Through this exercise, the mechanics of neural networks turn into concrete. Once these foundations are clear, using high-level libraries feels much less opaque, because exactly what is going on behind the scenes.

The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel

1. Forward propagation