Essential Challenges of Machine Learning


On this era of Generative AI, it appears that evidently the basics of machine learning are sometimes ignored. Achieving the extent of accuracy and precision seen in popular models akin to GPT-4 requires a major investment of effort and time, in addition to overcoming quite a few obstacles. That’s why I’m writing this blog post, to make clear among the challenges involved in creating machine learning models and to emphasise the tremendous amount of labor required to construct a flawless model.


For a toddler to study a cat or a dog, you only need to point out them a cat or a dog once, and so they can recognize them more often than not, even in numerous colours or breeds. Nonetheless, for machine learning to study cats and dogs, it requires hundreds of thousands of images.

In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric Brill showed that very different machine learning algorithms, including fairly easy ones, performed almost identically in the event that they are supplied with enough amount of information. It needs to be noted that small and medium-sized datasets are still quite common, and it is just not at all times easy or low cost to get extra training data.

Let’s take the instance of constructing a model to predict animal classes. Imagine that your training data only comprises images of cats and dogs, but your test data includes images of horses and rabbits. On this scenario, your training data is just not representative of the real-world problem you are attempting to unravel. You would possibly wonder why you’d use a model that only classifies cats and dogs to categorise horses and rabbits. It’s clear that the training data must be more diverse and representative of the real-world problem to realize accurate results.

Let me provide a special example that will make more sense. Suppose you are attempting to predict crude oil prices using data that only includes crude oil prices from the previous 20 years. Even in case you use the very best possible model, you might not have the option to scale back your errors. The explanation for that is that your training data is just not representative of the complex web of things that affect crude oil prices, akin to the balance between supply and demand, geopolitical events, OPEC policies, economic conditions, natural disasters, and currency exchange rates. Due to this fact, to realize accurate results, you’ll want to incorporate a more diverse range of information that’s representative of the real-world aspects that impact crude oil prices.

Still, the very best example of nonrepresentative data is the Literary Digest Poll.

In case your training data is filled with errors, outliers, and noise (e.g., as a result of poor-quality measurements), it should make it harder for the system to detect the underlying patterns, so your system is less more likely to perform well.

It is usually well well worth the effort to spend time cleansing up your training data. The reality is, most data scientists spend almost 80 % of their time doing just that.

e.g. of a Outlier (Credit —

Because the saying goes — “Garbage in garbage out.” The performance of a machine learning model is extremely dependent upon the relevance of the features present within the datasets. In case your data comprises too many irrelevant features, your model can even learn from those irrelevant features which ultimately impacts the model’s performance.

A critical a part of the success of a machine learning project is coming up with a great set of features to coach on. This process, called feature engineering, involves the next steps:

Say for instance you joined your masters and also you meet a one who is just not social in any respect, and also you make a presumption that everybody in masters is just not social and doesn’t help one another. Here you might be overgeneralizing your assumption after interacting with only one person. The person you met may be an outlier from the entire class.

Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the identical trap if we aren’t careful. In machine learning this is named overfitting: it implies that the model performs well on the training data, but it surely doesn’t generalize well.

Just think if our model learns from the below messy room that that is the very best option to keep all the things in a room.

A photo depicting a messy room

Overfitting happens often when there may be a number of noise in the information or there is just not enough data to learn from.

An example of underfitting in real life could possibly be a student who’s studying for an exam. If the coed only reads the fabric once and doesn’t practice any questions or review the fabric in depth, they could not perform well on the exam. It is because they’ve not learned the fabric well enough to use it to different scenarios or questions.

On this case, the coed is underfitting the fabric by not studying it thoroughly enough to realize a great understanding of it. This is comparable to underfitting in machine learning, where the model is just not complex enough to accurately capture the relationships between the features and the goal variable.

Similar to the coed must practice and review the fabric to perform well on the exam, a machine learning model must be trained with sufficient data and an acceptable level of complexity to avoid underfitting and achieve good performance on the duty at hand.

These are only just a few of the challenges that an information scientist or machine learning engineer faces when constructing an efficient machine learning model. There are methods to beat these challenges, but for now, I would like to give attention to understanding the challenges themselves. In my next post, I’ll delve into ways to tackle these challenges.

For future updates please follow me.



What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x