to this “Advent Calendar” of Machine learning and deep learning in Excel.
For Day 1, we start with the k-NN (k-Nearest Neighbors) regressor algorithm. And as you will notice, this is absolutely the best model and it’s a great solution to start.
For individuals who already know this model, listed here are some questions for you. Hopefully, they may make you must proceed reading. And there are some subtle lessons that usually are not taught in traditional courses.
- Is scaling of continuous features necessary for this model?
- How can categorical features be handled?
- What must be done to the continual features to enhance model performance?
- What sorts of distance measures will be more suitable in certain situations? For instance, when predicting house prices where geographical location matters?
Spoiler: with a naive k-NN, you can not get the best scaling routinely.
Additionally it is a chance, if you happen to usually are not accustomed to Excel formulas, to make use of formulas equivalent to RANK, IF, SUMPRODUCT, and other useful Excel functions.
You need to use this link to get the Excel/Google Sheet file, and my advice is that you just follow the article, and also you do some testing with the file to raised understand.
The principle of k-NN
If you must sell or buy an apartment, how would you estimate the value?
Please take into consideration a really realistic approach, not some complex model that you’ve to spend hours to construct.
Something that you could do for real.
Well, you’ll probably ask your neighbors who’ve an apartment of the identical or similar size. And also you calculate the common value of those apartments.
Yes, that is strictly the concept of k-NN, for k-Nearest Neighbors: seek for essentially the most similar examples, and use their values to estimate the brand new one.
For instance this task with a concrete example of house pricing estimation, we’ll use this well-known dataset called California Housing Dataset. That is the Census data from California block groups, used to predict median house value.

Each remark is just not a person house, but it surely remains to be interesting to make use of this instance.
Here’s a quick description of the variables.
Goal variable is MedHouseVal, which is the median house value, in units of 100,000 USD (example: 3.2 means 320,000 dollars).
The feature variables are as follows:
1. MedInc: median income (in units of 10,000 USD)
2. HouseAge: median age of homes
3. AveRooms: average variety of rooms per household
4. AveBedrms: average variety of bedrooms per household
5. Population: people living within the block group
6. AveOccup: average variety of occupants per household
7. Latitude: geographic latitude
8. Longitude: geographic longitude
k-NN with One Continuous Feature
Before we use multiple features to seek out the neighbors, let’s first only use one feature and a couple of observations.
Although the method for one continuous feature will likely be quite simple, we’ll still follow every step. We first explore our dataset, then we train the model with a hyperparameter, and ultimately, we will use the model to predict.
Training dataset
Here is the plot of this straightforward dataset of 10 observations. The x-axis is the continual feature, and the y-axis is the goal variable.

Now, imagine that we now have to predict the worth for a brand new remark x=10. How can we do this?
Model training?
Step one for just about all machine learning models is training.
But for k-NN, your model is your entire dataset. In other words, you don’t must train the model, you utilize the unique dataset directly.
So in scikit-learn, once you do model.fit, for a k-NN estimator, nothing really happens.
Some may ask: what about k?
Well, k is the hyperparameter. So you’ve to decide on a price for k, and it could possibly be tuned.
Prediction for one latest remark
For the hyperparameter k, we’ll use k=3, because the dataset could be very small.
For one feature variable, the gap can trivially be the absolute value of the difference of the worth between the brand new remark and the others.
Within the sheet “algo1D”, you possibly can change the worth of the brand new remark, and use the filter on distance column C to order the dataset within the increasing order, the 3-nearest neighbors will likely be plotted.
To make the calculation more automatic, we will use RANK function to see the smallest observations by way of distance.
And we can also create a column of indicators (column G), with indicator = 1, in the event that they belong to the k-nearest neighbors.
Finally, for the prediction, we will use SUMPRODUCT to calculate the common value of all y values with indicator =1.
Within the plot,
- the sunshine blue dots represent the dataset
- the red dot represents the brand new remark with the expected y value
- the yellow dots represent the 3-nearest neighbors of the brand new remark (in red)

Let’s recap — the prediction phase consists of the next steps:
- For one given latest remark, calculate the gap between this latest remark and all of the observations within the training dataset.
- Discover the k observations which have the shortest distance. In Excel, we’ll use the filter to order manually the training dataset. Or we will use RANK (and a indicator column) to get the highest k observations.
- Calculate the expected value, by calculating the common value of the goal variable, through the use of SUMPRODUCT.
Prediction for an interval of recent observations
Within the sheet “” (f for final), I plotted the prediction for a listing of recent observations, starting from 1 to 17.
With a programming language, we could do it easily in a loop, and for a bigger number of recent observations, so the representation might be denser.
With Excel, I manually repeated the next steps:
- input a price for x
- order the gap column
- copy-paste the prediction

Effect of the hyperparameter k
The hyperparameter that’s utilized in k-NN is the variety of neighbors that we have in mind for the calculation of average value.
We often use this following graph to clarify how a model will be underfitted, or overfitted.

In our case, if k is small, there could be a risk of overfitting.
If k is large, there will be risk of underfitting.
The acute case of very large k is that k will be the overall variety of the training dataset. And the worth of the prediction will likely be the identical for each latest remark: it’s the worldwide average.

So, we will say that k-NN improves the concept of predicting with a calculation of the common value with a couple of observations which might be near the brand new remark.
k-NN with Two Continuous Features
Now, we’ll study the case of two continuous feature variables x1 and x2. And we’ll only talk in regards to the differences with the previous situation of 1 feature variable.
Two continuous feature variables dataset
When we now have two feature variables, I cannot plot in 3D with Excel, so the plot incorporates only x1 as x-axis, and x2 as y-axis.
So don’t be confused with the previous dataset, for which y-axis represents the goal value y.

Prediction with the Euclidean distance
Now that we now have two features, we now have to take each of them under consideration.
One usual distance we will use is the Euclidean Distance.
Then we will use the identical process to the highest k observations which have the minimum distance with the brand new remark.

To get a visible plot, we will use the identical colours
- Blue for training dataset
- Red for the brand new remark
- Yellow for the found k nearest neighbors

Impact of the size of the variables
When you’ve two features, one query that we will ask is the impact of the size of the feature for the results of prediction.
First, let’s see this straightforward example, I multiplied the feature x2 by 10.

Will this scaling impact the predictions? The reply is after all yes.
And we will easily compare them, as in the next image.

It is straightforward enough to grasp that the Euclidean Distance sums the squared difference of the features, no matter their scales.
Because of this, the feature that has a big scale will dominate the gap.
On the subject of feature scaling, one common operation is standardization (also called centering and reduction) or min–max scaling. The concept is to put all features on a comparable scale.
BUT, let’s take into consideration this example: what if one feature is expressed in dollars, and the opposite in yen.
In the true world, the proper relation between the 2 scales is about 1 dollar = 156 yen (as of November 2025). We all know this because we understand the meaning of the units.
How would the model know this? It DOES NOT.
The one hyperparameter is k, and the model doesn’t adjust anything to correct for differences in units or scales. k-NN has no internal mechanism for understanding that two features have different units.
And this is just the begining of the issues…
k-NN with the California Housing dataset
Now, let’s finally use the real-world dataset of California Housing dataset.
With the one-feature dataset, we got the fundamental idea of how k-NN works. With two-feature dataset, we saw that the size of features is essential.
Now, with this real-world dataset, we’ll see that the heterogeneous nature of the features make the Euclidean distance meaningless.
We’ll see another more necessary ideas after we use k-NN in practice.
Naive application of k-NN regressor
Since all features on this dataset are continuous, we will easily calculate the Euclidean Distance. And we define a number k, to calculate the common value of the goal variable, here MedHouseVal.
In Excel, you possibly can easily do this yourself. Or you possibly can support me here and get all of the files.

Notion of distance based on different features
I said that the previous application is naive, because if you happen to look closer, you will notice these problems:
MedInc (median income) is expressed in units of 10,000 USD. If we determine to precise it in 100,000 USD or in 1,000 USD as an alternative, the prediction will change, because k-NN is sensitive to the size of the features. We saw this problem before.
Now, furthermore, each feature has a special nature.
- MedInc is an amount of cash (in dollars).
- HouseAge is an age in years.
- AveRooms is a count of rooms.
- Population is various individuals.
- Latitude and longitude are geographic coordinates.
Due to this fact, the Euclidean distance is doomed.
Several types of distances
Essentially the most common selection is the Euclidean distance, but it surely is just not the just one.
We also can use Manhattan distance when features represent grid-like movements, and Cosine distance when only the direction matters (as with text embeddings).
Each distance changes how “nearest” is defined, and subsequently can change which neighbors KNN selects.
Depending on the info, other distances will be more appropriate.
For instance, with latitude and longitude, we will use the true geographical distance (in meters) as an alternative of a straightforward Euclidean distance on degrees.
Within the California Housing dataset, this is very useful because we now have the precise latitude and longitude of every district.
Nevertheless, once we attempt to mix these geographical distances with other variables (equivalent to median income, variety of rooms, or population), the issue becomes more complex, since the variables have very different natures and scales.

Within the cartography renderings below, I used k-NN as a smoothing function to refine the values related to different areas of Paris.
On the left, each area has only a single value, so from one quarter to its neighboring quarters, there could be a discontinuity of the variable.
On the suitable, k-NN allows me to estimate a price for each specific address by smoothing the data based on nearby areas.
Furthermore, for indicators equivalent to the proportion of certain skilled categories, I also applied population-based weighting in order that larger areas have a stronger influence within the smoothing process.

As a conclusion, when the situation allows it, selecting a more specific distance will help us higher capture the underlying reality.
By linking the gap to the character of the info, we will make k-NN far more meaningful: geographical distance for coordinates, cosine distance for embeddings, and so forth. The selection of distance is just not only a technical detail, it changes how the model “sees” the world and which neighbors it considers relevant.
How Categorical Features will be Modeled
You might hear that .
But this is just not completely true.
k-NN can work with categorical variables so long as we will define a distance between two observations.
Many individuals will say:
Others mention , or ordinal encoding.
But these methods behave very otherwise in a distance-based model.
To make this clear, we’ll use one other dataset: the diamond price dataset (CC BY 4.0 license), which incorporates several features equivalent to carat, , , and .
For simplicity, we’ll use only carat (numerical) and clarity (categorical) to display a couple of results.
Predicting Prices with Carat
First, we’ll start with carat, since you most likely know that the value of a diamond depends mainly on the scale (carat) of the stone.
The graphic below shows how k-NN can find diamonds with similar sizes to estimate the value.

One-Hot Encoding for Clarity Feature
Now allow us to take a look at clarity.
Below is the table of categories with their meanings, and we apply one-hot encoding to remodel each category right into a binary vector.
| Clarity | Meaning |
| IF | Internally Flawless |
| VVS1 | Very Very Barely Included 1 |
| VVS2 | Very Very Barely Included 2 |
| VS1 | Very Barely Included 1 |
| VS2 | Very Barely Included 2 |
| SI1 | Barely Included 1 |
| SI2 | Barely Included 2 |
| I1 | Included 1 |
On this table, we see that for the brand new diamond with clarity VVS2, the closest neighbors are all diamonds from the identical clarity category.
The numerical feature has little or no influence on the gap, whereas it’s a more necessary feature, as you possibly can see in the value column.

Key issue 1: all categories are equally distant
When using Euclidean distance on one-hot vectors:
- IF vs VVS1 → distance = √2
- IF vs SI2 → distance = √2
- IF vs I1 → distance = √2
Every different category is at the exact same distance.
This doesn’t reflect the true diamond grading scale.
Key issue 2: scaling problem with continuous variables
Because we mix one-hot clarity with carat (a continuous feature), we face one other problem:
- carat values in our example are below 1
- clarity vectors have differences of √2 → clarity dominates the gap calculation
So even small changes in clarity overpower the effect of carat.
This is strictly the identical scaling issue we face with multi-continuous features, but even stronger.
Ordinal Encoding for Clarity
Now we will try encoding the Clarity feature with numerical labels. But as an alternative of using the classic labels 1, 2, 3… we use expert-based labels that reflect the true grading scale.
The concept is to translate the clarity levels into values that behave more like a continuous feature, just like carat, even when clarity is just not strictly continuous.

With this expert-based encoding, the distances develop into more meaningful.
Carat and clarity are actually on comparable scales, so neither feature completely dominates the gap calculation.
So, we obtain a higher balance between size and clarity when choosing neighbors, which provides more realistic predictions.

Conclusions
In conclusion, the k-NN regressor is a highly non-linear, local estimator. It’s so local that only the K closest observations are literally used.
After implementing k-NN regressor in Excel, I feel that we will really ask this query: Is the k-NN regressor really a Machine Learning model?
- There isn’t any model training
- When predicting, the number of the neighbor observations doesn’t depend upon the worth of goal variable
But, it is very easy to grasp, that with Excel, we will just implement the entire algorithm. Furthermore, we will adjust the gap as we wish.
Businesspeople can see the concept directly: to predict a price, we take a look at similar observations.
The actual problems with k-NN, and all models which might be based on distance:
- the size of the features
- the heterogeneous nature of features, which makes the sum meaningless
- the particular distance that must be defined in concrete situations
- for categorical features, label/ordinal encoding might be optimized if we could find the optimal scaling.
So briefly, the issue is the scaling of the features. We might imagine that they will be tuned as hyperparameters, but then the tuning would require a lot time.
We’ll see later that it is strictly the motivation behind one other family of models.
Here, the notion of scale can be comparable to the notion of feature importance, because in k-NN the importance of every feature is defined before using the model.
So this is just the start of our journey. We’ll discover together other models that may do higher, from this straightforward model, by improving in several directions: feature scaling, from distance to probability, splitting to raised model each category…
