Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning

-


Introduction

Data science is undoubtedly probably the most fascinating fields today. Following significant breakthroughs in machine learning a couple of decade ago, data science has surged in popularity throughout the tech community. Every year, we witness increasingly powerful tools that when seemed unimaginable. Innovations similar to the , , the ) framework, and state-of-the-art  — including  — have had a profound impact on our world.

Nevertheless, with the abundance of tools and the continued hype surrounding AI, it may be overwhelming — especially for beginners — to find out which skills to prioritize when aiming for a profession in data science. Furthermore, this field is extremely demanding, requiring substantial dedication and perseverance.

The primary three parts of this series outlined the essential skills to turn out to be an information scientist in three key areas: math, software engineering, and machine learning. While knowledge of classical Machine Learning and neural network algorithms is a superb place to begin for aspiring data specialists, there are still many necessary topics in machine learning that have to be mastered to work on more advanced projects.

The importance of learning evolution of methods in machine learning

In contrast to previous articles on this series, I actually have decided to vary the format during which I present the essential skills for aspiring data scientists. As a substitute of directly listing specific competencies to develop and the motivation behind mastering them, I’ll briefly outline an important approaches, presenting them in chronological order as they’ve been developed and used over the past many years in machine learning.

The explanation is that I imagine it’s crucial to review these algorithms from the very starting. In machine learning, many recent methods are built upon older approaches, which is very true for NLP and computer vision.

For instance, jumping directly into the implementation details of recent large language models (LLMs) with none preliminary knowledge may make it very difficult for beginners to know the motivation and underlying ideas of specific mechanisms.

# 04. NLP

Natural language processing (NLP) is a broad field that focuses on processing textual information. Machine learning algorithms cannot work directly with raw text, which is why text is frequently preprocessed and converted into numerical vectors which can be then fed into neural networks.

Before being converted into vectors, words undergo preprocessing, which incorporates easy techniques similar to parsingstemming, lemmatization, normalization, or removing stop words. After preprocessing, the resulting text is encoded into tokens. Tokens represent the smallest textual elements in a group of documents. Generally, a token generally is a a part of a word, a sequence of symbols, or a person symbol. Ultimately, tokens are converted into numerical vectors.

The bag of words method is probably the most basic solution to encode tokens, specializing in counting the frequency of tokens in each document. Nevertheless, in practice, this is frequently not sufficient, as additionally it is essential to account for token importance — an idea introduced within the TF-IDF and BM25 methods. While TF-IDF improves upon the naive counting approach of bag of words, researchers have developed a very recent approach called embeddings.

Embeddings are numerical vectors whose components preserve the semantic meanings of words. For this reason, embeddings play an important role in NLP, enabling input data to be trained or used for model inference. Moreover, embeddings may be used to match text similarity, allowing for the retrieval of probably the most relevant documents from a group.

As a field, NLP has been evolving rapidly over the past 10–20 years to efficiently solve various text-related problems. Complex tasks like text translation and text generation were initially addressed using recurrent neural networks (RNNs), which introduced the concept of memory, allowing neural networks to capture and retain key contextual information in long documents.

Although RNN performance step by step improved, it remained suboptimal for certain tasks. Furthermore, RNNs are relatively slow, and their sequential prediction process doesn’t allow for parallelization during training and inference, making them less efficient.

Moreover, the unique Transformer architecture may be decomposed into two separate modules: BERT and GPT. Each of those form the muse of probably the most state-of-the-art models used today to unravel various NLP problems. Understanding their principles is beneficial knowledge that can help learners advance further when studying or working with other large language models (LLMs).

Transformer architecture

With regards to LLMs, I strongly recommend studying the evolution of at the very least the primary three GPT models, as they’ve had a big impact on the AI world we all know today. Particularly, I would love to focus on the concepts of few-shot and zero-shot learning, introduced in GPT-2, which enable LLMs to unravel text generation tasks without explicitly receiving any training examples for them.

One other necessary technique developed in recent times is retrieval-augmented generation (RAG).  Because of this, they lack knowledge of any information beyond their training data.

Example of a RAG pipeline

The retriever converts the input prompt into an embedding, which is then used to question a vector database. The database returns probably the most relevant context based on the similarity to the embedding. This retrieved context is then combined with the unique prompt and passed to a generative model. The model processes each the initial prompt and the extra context to generate a more informed and contextually accurate response.

To deal with this limitation, OpenAI researchers developed a RAG pipeline, which incorporates a continually updated database containing recent information from external sources. When ChatGPT is given a task that requires external knowledge, it queries the database to retrieve probably the most relevant context and integrates it into the ultimate prompt sent to the machine learning model.

The goal of distillation is to create a smaller model that may imitate a bigger one. In practice, which means that if a big model makes a prediction, the smaller model is anticipated to supply the same result.

In the fashionable era, LLM development has led to models with tens of millions and even billions of parameters. As a consequence, the general size of those models may exceed the hardware limitations of ordinary computers or small portable devices, which include many constraints.

Quantization is the means of reducing the memory required to store numerical values representing a model’s weights.

That is where optimization techniques turn out to be particularly useful, allowing LLMs to be compressed without significantly compromising their performance. Essentially the most commonly used techniques today include distillation, quantization, and pruning.

Pruning refers to discarding the least necessary weights of a model.

High quality-tuning

Whatever the area during which you want to specialize, knowledge of fine-tuning is a must have skill! High quality-tuning is a strong concept that lets you efficiently adapt a pre-trained model to a brand new task.

High quality-tuning is very useful when working with very large models. For instance, imagine you ought to use BERT to perform semantic evaluation on a particular dataset. While BERT is trained on general data, it may not fully understand the context of your dataset. At the identical time, training BERT from scratch on your specific task would require a large amount of resources.

Here is where fine-tuning is available in: it involves taking a pre-trained BERT (or one other model) and freezing a few of its layers (normally those initially). Because of this, BERT is retrained, but this time only on the brand new dataset provided. Since BERT updates only a subset of its weights and the brand new dataset is probably going much smaller than the unique one BERT was trained on, fine-tuning becomes a really efficient technique for adapting BERT’s wealthy knowledge to a particular domain.

# 05. Computer vision

Because the name suggests, computer vision (CV) involves analyzing images and videos using machine learning. Essentially the most common tasks include image classification, object detection, image segmentation, and generation.

Most CV algorithms are based on neural networks, so it is crucial to grasp how they work intimately. Particularly, CV uses a special kind of network called convolutional neural networks (CNNs). These are similar to completely connected networks, except that they typically begin with a set of specialised mathematical operations called convolutions.

Computer vision roadmap

The following step is to review the preferred CNN architectures for classification tasks, similar to AlexNet, VGG, Inception, ImageNet, and ResNet.

Speaking of the article detection task, the YOLO algorithm is a transparent winner. It shouldn’t be essential to review all the dozens of versions of YOLO. In point of fact, going through the unique paper of the primary YOLO must be sufficient to grasp how a comparatively difficult problem like object detection is elegantly transformed into each classification and regression problems. This approach in YOLO also provides a pleasant intuition on how more complex CV tasks may be reformulated in simpler terms.

While there are lots of architectures for performing image segmentation, I might strongly recommend learning about UNet, which introduces an encoder-decoder architecture.

Finally, image generation might be probably the most difficult tasks in CV. Personally, I consider it an optional topic for learners, because it involves many advanced concepts. Nevertheless, gaining a high-level intuition of how generative adversial networks (GAN) function to generate images is a very good solution to broaden one’s horizons.

# 06. Other areas

It will be very hard to present intimately the Roadmaps for all existing machine learning domains in a single article. That’s the reason, on this section, I would love to briefly list and explain a number of the other hottest areas in data science price exploring.

Initially, recommender systems (RecSys) have gained lots of popularity in recent times. They’re increasingly implemented in online shops, social networks, and streaming services. The important thing idea of most algorithms is to take a big initial matrix of all users and items and decompose it right into a product of several matrices in a way that associates every user and each item with a high-dimensional embedding. This approach could be very flexible, because it then allows several types of comparison operations on embeddings to seek out probably the most relevant items for a given user. Furthermore, it’s way more rapid to perform evaluation on small matrices fairly than the unique, which normally tends to have huge dimensions.

Matrix decomposition in recommender systems is probably the most commonly used methods

Rating often goes hand in hand with RecSys. When a RecSys has identified a set of probably the most relevant items for the user, rating algorithms are used to sort them to find out the order during which they will probably be shown or proposed to the user. An excellent example of their usage is serps, which filter query results from top to bottom on an online page.

Closely related to rating, there’s also a matching problem that goals to optimally map objects from two sets, A and B, in a way that, on average, every object pair is mapped “well” in line with an identical criterion. A use case example might include distributing a gaggle of scholars to different university disciplines, where the variety of spots in each class is restricted.

Clustering is an unsupervised machine learning task whose objective is to separate a dataset into several regions (clusters), with each dataset object belonging to certainly one of these clusters. The splitting criteria can vary depending on the duty. Clustering is beneficial since it allows for grouping similar objects together. Furthermore, further evaluation may be applied to treat objects in each cluster individually.

The goal of clustering is to group dataset objects (on the left) into several categories (on the precise) based on their similarity.

Dimensionality reduction is one other unsupervised problem, where the goal is to compress an input dataset. When the dimensionality of the dataset is large, it takes more time and resources for machine learning algorithms to investigate it. By identifying and removing noisy dataset features or people who don’t provide much beneficial information, the information evaluation process becomes considerably easier.

Similarity search is an area that focuses on designing algorithms and data structures (indexes) to optimize searches in a big database of embeddings (vector database). More precisely, given an input embedding and a vector database, the goal is to  find probably the most similar embedding within the database relative to the input embedding.

The goal of similarity search is to roughly find probably the most similar embedding in a vector database relative to a question embedding.

The word “roughly” implies that the search shouldn’t be guaranteed to be 100% precise. Nevertheless, that is the foremost idea behind similarity search algorithms — sacrificing a little bit of accuracy in exchange for significant gains in prediction speed or data compression.

Time series evaluation involves studying the behavior of a goal variable over time. This problem may be solved using classical tabular algorithms. Nevertheless, the presence of time introduces recent aspects that can not be captured by standard algorithms. For example:

  • the goal variable can have an overall trend, where in the long run its values increase or decrease .
  • the goal variable can have a seasonality which makes its values change based on the currently given period .

More often than not series models take each of those aspects under consideration. Usually, time series models are mainly used lots in financial, stock or demographic evaluation.

Time series data if often decomposed in several components which include trend and seasonality.

One other advanced area I might recommend exploring is reinforcement learning, which fundamentally changes the algorithm design in comparison with classical machine learning. In easy terms, its goal is to coach an agent in an environment to make optimal decisions based on a reward system (also referred to as the ). By taking an motion, the agent receives a reward, which helps it understand whether the chosen motion had a positive or negative effect. After that, the agent barely adjusts its strategy, and the complete cycle repeats.

Reinforcement learning framework. Image adopted by the creator. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Reinforcement learning is especially popular in complex environments where classical algorithms are usually not able to solving an issue. Given the complexity of reinforcement learning algorithms and the computational resources they require, this area shouldn’t be yet fully mature, however it has high potential to achieve much more popularity in the long run.

Primary applications of reinforcement learning

Currently the preferred applications are:

  • Games. Existing approaches can design optimal game strategies and outperform humans. Essentially the most well-known examples are chess and Go.
  • Robotics. Advanced algorithms may be incorporated into robots to assist them move, carry objects or complete routine tasks at home.
  • Autopilot. Reinforcement learning methods may be developed to mechanically drive cars, control helicopters or drones.

Conclusion

This text was a logical continuation of the previous part and expanded the skill set needed to turn out to be an information scientist. While a lot of the mentioned topics require time to master, they will add significant value to your portfolio. This is very true for the NLP and CV domains, that are in high demand today.

After reaching a high level of experience in data science, it continues to be crucial to remain motivated and consistently push yourself to learn recent topics and explore emerging algorithms.

Data science is a continually evolving field, and in the approaching years, we would witness the event of recent state-of-the-art approaches that we couldn’t have imagined prior to now.

Resources

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x