Home Artificial Intelligence Machine Learning Made Intuitive

Machine Learning Made Intuitive

6
Machine Learning Made Intuitive

ML: all you have to know with none overcomplicated math

What you might think ML is… (Photo Taken by Justin Cheigh in Billund, Denmark)

What’s Machine Learning?

Sure, the actual theory behind models like ChatGPT is admittedly very difficult, however the underlying intuition behind Machine Learning (ML) is, well, intuitive! So, what’s ML?

Machine Learning allows computers to learn using data.

But what does this mean? How do computers use data? What does it mean for a pc to learn? And initially, who cares? Let’s start with the last query.

Nowadays, data is throughout us. So it’s increasingly vital to make use of tools like ML, as it may well help find meaningful patterns in data without ever being explicitly programmed to accomplish that! In other words, by utilizing ML we’re capable of apply generic algorithms to a wide range of problems successfully.

There are a couple of predominant categories of Machine Learning, with among the predominant types being supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL). Today I’ll just be describing supervised learning, though in subsequent posts I hope to elaborate more on unsupervised learning and reinforcement learning.

1 Minute SL Speedrun

Look, I get that you just won’t wish to read this whole article. On this section I’ll teach you the very basics (which for quite a lot of people is all you have to know!) before going into more depth within the later sections.

Supervised learning involves learning how one can predict some label using different features.

Imagine you are attempting to determine a approach to predict the value of diamonds using features like carat, cut, clarity, and more. Here, the goal is to learn a function that takes as input the features of a selected diamond and outputs the associated price.

Just as humans learn by example, on this case computers will do the identical. To give you the option to learn a prediction rule, this ML agent needs “labeled examples” of diamonds, including each their features and their price. The supervision comes since you might be given the label (price). In point of fact, it’s vital to contemplate that your labeled examples are literally true, because it’s an assumption of supervised learning that the labeled examples are “ground truth”.

Okay, now that we’ve gone over essentially the most fundamental basics, we are able to get a bit more in depth in regards to the whole data science/ML pipeline.

Problem Setup

Let’s use a particularly relatable example, which is inspired from this textbook. Imagine you’re stranded on an island, where the one food is a rare fruit referred to as “Justin-Melon”. Despite the fact that you’ve never eaten Justin-Melon particularly, you’ve eaten loads of other fruits, and you don’t wish to eat fruit that has gone bad. You furthermore mght know that typically you possibly can tell if a fruit has gone bad by the colour and firmness of the fruit, so that you extrapolate and assume this holds for Justin-Melon as well.

In ML terms, you used prior industry knowledge to find out two features (color, firmness) that you’re thinking that will accurately predict the label (whether or not the Justin-Melon has gone bad).

But how will what color and what firmness correspond to the fruit being bad? Who knows? You simply have to try it out. In ML terms, we want data. More specifically, we want a labeled dataset consisting of real Justin-Melons and their associated label.

Data Collection/Processing

So that you spend the following couple of days eating melons and recording the colour, firmness, and whether or not the melon was bad. After a couple of painful days of continually eating melons which have gone bad, you may have the next labeled dataset:

Code by Justin Cheigh

Each row is a selected melon, and every column is the worth of the feature/label for the corresponding melon. But notice now we have words, because the features are categorical relatively than numerical.

Really we want numbers for our computer to process. There are a lot of techniques to convert categorical features to numerical features, starting from one hot encoding to embeddings and beyond.

The only thing we are able to do is turn the column “Label” right into a column “Good”, which is 1 if the melon is sweet and 0 if it’s bad. For now, assume there’s some methodology to show color and firmness to a scale from -10 to 10, in such a way that is wise. For bonus points, think in regards to the assumptions of putting a categorical feature like color on such a scale. After this preprocessing, our dataset might look something like this:

Code by Justin Cheigh

We now have a labeled dataset, which implies we are able to employ a supervised learning algorithm. Our algorithm must be a classification algorithm, as we’re predicting a category good (1) or bad (0). Classification is in opposition to regression algorithms, which predict a continuous value like the value of a diamond.

Exploratory Data Evaluation

But what algorithm? There are a lot of supervised classification algorithms, ranging in complexity from basic logistic regression to some hardcore deep learning algorithms. Well, let’s first take a have a look at our data by doing a little exploratory data evaluation (EDA):

Code by Justin Cheigh

The above image is a plot of the feature space; now we have two features, and we’re simply putting each example onto a plot with the 2 axes being the 2 features. Moreover, we make the purpose purple if the associated melon was good, and we make it yellow if it was bad. Clearly, with just somewhat little bit of EDA, there’s an obvious answer!

Code by Justin Cheigh

We should always probably classify all points contained in the red circle nearly as good melons, while ones outside of the circle needs to be classified in bad melons. Intuitively, this is sensible! For instance, you don’t need a melon that’s rock solid, but you furthermore mght don’t want it to be absurdly squishy. Relatively, you would like something in between, and the identical might be true about color as well.

We determined we’d want a call boundary that could be a circle, but this was just based off of preliminary data visualization. How would we systematically determine this? This is particularly relevant in larger problems, where the reply just isn’t so easy. Imagine a whole bunch of features. There’s no possible approach to visualize the 100 dimensional feature space in any reasonable way.

What are we learning?

Step one is to define your model. There are tons of classification models. Since each has their very own set of assumptions, it’s vital to attempt to make a great alternative. To emphasise this, I’ll start by making a very bad alternative.

One intuitive idea is to make a prediction by weighing each of the aspects:

Formula by Justin Cheigh using Embed Fun

For instance, suppose our parameters w1 and w2 are 2 and 1, respectively. Also assume our input Justin Melon is one with Color = 4, Firmness = 6. Then our prediction Good = (2 x 4) + (1 x 6) = 14.

Our classification (14) just isn’t even one in every of the valid options (0 or 1). It is because this is definitely a regression algorithm. The truth is, it’s a straightforward case of the only regression algorithm: linear regression.

So, let’s turn this right into a classification algorithm. One easy way can be this: use linear regression and classify as 1 if the output is higher than a bias term b. The truth is, we are able to simplify by adding a continuing term to our model in such a way that we classify as 1 if the output is higher than 0.

In math, let PRED = w1 * Color + w2 * Firmness + b. Then we get:

Formula by Justin Cheigh using Embed Fun

That is definitely higher, as we’re at the least performing a classification, but let’s make a plot of PRED on the x axis and our classification on the y axis:

Code by Justin Cheigh

It is a bit extreme. A slight change in PRED could change the classification entirely. One solution is that the output of our model represents the probability that the Justin-Melon is sweet, which we are able to do by smoothing out the curve:

Code by Justin Cheigh

It is a sigmoid curve (or a logistic curve). So, as a substitute of taking PRED and apply this piecewise activation (Good if PRED ≥ 0), we are able to apply this sigmoid activation function to get a smoothed out curve like above. Overall, our logistic model looks like this:

Formula by Justin Cheigh using Embed Fun

Here, the sigma represents the sigmoid activation function. Great, so now we have our model, and we just have to determine what weights and biases are best! This process is referred to as training.

Training the Model

Great, so all we want to do is determine what weights and biases are best! But this is way easier said than done. There are an infinite variety of possibilities, and what does best even mean?

We start with the latter query: what’s best? Here’s one easy, yet powerful way: essentially the most optimal weights are the one which get the best accuracy on our training set.

So, we just have to determine an algorithm that maximizes accuracy. Nevertheless, mathematically it’s easier to attenuate something. In words, relatively than defining a price function, where higher value is “higher”, we prefer to define a loss function, where lower loss is healthier. Although people typically use something like binary cross entropy for (binary) classification loss, we’ll just use a straightforward example: minimize the variety of points classified incorrectly.

To do that, we use an algorithm referred to as gradient descent. At a really high level, gradient descent works like a nearsighted skier attempting to get down a mountain. A crucial property of a great loss function (and one which our crude loss function actually lacks) is smoothness. If you happen to were to plot our parameter space (parameter values and associated loss on the identical plot), the plot would appear to be a mountain.

So, we first start with random parameters, and subsequently we likely start with bad loss. Like a skier attempting to go down the mountain as fast as possible, the algorithm looks in every direction, attempting to see the steepest approach to go (i.e. how one can change parameters as a way to lower loss essentially the most). But, the skier is nearsighted, so that they only look somewhat in each direction. We iterate this process until we find yourself at the underside (keen eyed individuals may notice we actually might find yourself at a neighborhood minima). At this point, the parameters we find yourself with are our trained parameters.

When you train your logistic regression model, you realize your performance continues to be really bad, and that your accuracy is just around 60% (barely higher than guessing!). It is because we’re violating one in every of the model assumptions. Logistic regression mathematically can only output a linear decision boundary, but we knew from our EDA that the choice boundary needs to be circular!

With this in mind, you are trying different, more complex models, and also you get one which gets 95% accuracy! You now have a totally trained classifier able to differentiating between good Justin-Melons and bad Justin-Melons, and you possibly can finally eat all of the tasty fruit you would like!

Conclusion

Let’s take a step back. In around 10 minutes, you learned loads about machine learning, including what is actually the entire supervised learning pipeline. So, what’s next?

Well, that’s for you to determine! For some, this text was enough to get a high level picture of what ML actually is. For others, this text may leave quite a lot of questions unanswered. That’s great! Perhaps this curiosity will assist you to further explore this topic.

For instance, in the info collection step we assumed that you just would just eat a ton of melons for a couple of days, without really making an allowance for any specific features. This is unnecessary. If you happen to ate a green mushy Justin-Melon and it made you violently in poor health, you almost certainly would stray away from those melons. In point of fact, you’d learn through experience, updating your beliefs as you go. This framework is more much like reinforcement learning.

And what if you happen to knew that one bad Justin-Melon could kill you immediately, and that it was too dangerous to ever try one without being sure? Without these labels, you couldn’t perform supervised learning. But perhaps there’s still a approach to gain insight without labels. This framework is more much like unsupervised learning.

In following blog posts, I hope to analogously expand on reinforcement learning and unsupervised learning.

Thanks for Reading!

6 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here