Home Artificial Intelligence AutoML — Let Machine Learning Give Your Model Selection a Jump-Start What Is AutoML? Implementation Implementation Notebook Conclusions Thanks for Reading!

AutoML — Let Machine Learning Give Your Model Selection a Jump-Start What Is AutoML? Implementation Implementation Notebook Conclusions Thanks for Reading!

1
AutoML — Let Machine Learning Give Your Model Selection a Jump-Start
What Is AutoML?
Implementation
Implementation Notebook
Conclusions
Thanks for Reading!

Human vs. Machine, by DALL.E 2

We use Machine Learning (ML) on a daily basis to find solutions to problems and make predictions, which usually involves getting to know the data through exploratory analysis, followed by data cleaning, deciding based on our best judgement on what ML models to use to solve that problem, followed by hyperparameter optimization and iteration. But what if we could use ML to solve the more meta-level problem of doing all of those steps and even selection of the best model, instead of us manually going through these repetitoud and tedious steps? AutoML is here to oblige!

In this post I will demonstrate how with only 3 lines of code, AutoML outperformed a predictive ML model that I had personally developed (for a previous post), in less than 14 seconds.

My goal in this post is not to propose that we no longer need scientists and ML practitioners since we have AutoML and rather the point I wish to make is to demonstrate how we can leverage AutoML to make our model selection process more efficient and hence increase the overall productivity. Once AutoML provides us with a comparison of the performance of various ML model families, we can continue the task and further fine-tune the model to achieve better results.

Let’s get started!

(All images, unless otherwise noted, are by the author.)

Automatic Machine Learning or AutoML is the process of automating the ML workflow of data cleaning, model selection, training, hyperparameter optimization, and even sometimes model deployment. AutoML was initially developed with the goal of making ML more accessible to non-technical users and over time has evolved into a reliable productivity tool even for experienced ML practitioners.

Now that we understand what AutoML is, let’s move on to seeing it in action.

We will initially go through the quick implementation of AutoML, using AutoGluon and then will compare the results to a model that I had developed in my post about Linear Regression (linked below) so that we can compare AutoML’s results to mine.

In order for the comparison to be meaningful, we will be using the same data set of car prices from UCI Machine Learning Repository (CC BY 4.0). You can download the cleaned up data from this link and follow the code step by step.

If this is your first time using AutoGluon, you may need to install it in your environment. Installation steps that I followed for Mac using CPU (Python 3.8) are as follows (if you have a different operating system, please visit here for easy instructions):

pip3 install -U pip
pip3 install -U setuptools wheel
pip3 install torch==1.12.1+cpu torchvision==0.13.1+cpu torchtext==0.13.1 -f https://download.pytorch.org/whl/cpu/torch_stable.html
pip3 install autogluon

Now that AutoGluon is ready to use, let’s import the libraries that we will be using.

# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularDataset, TabularPredictor

# Show all columns/rows of the dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

Next, we will read the data set into a Pandas data frame.

# Load the data into a dataframe
df = pd.read_csv('auto-cleaned.csv')

Then we will split the data into a train and test set. We will use 30% of the data as the test set and the remainder will be the train set. At this point and for the sake of comparability, I will make sure that we use the same random_state = 1234 that I had used in my other post about Linear Regression so that our train and test sets created here are the same as what I had created in that post.

# Split the data into train and test set
df_train, df_test = train_test_split(df, test_size=0.3, random_state=1234)

print(f"Data includes {df.shape[0]} rows (and {df.shape[1]} columns), broken down into {df_train.shape[0]} rows for training and the balance {df_test.shape[0]} rows for testing.")

Results of running the code above is:

As we see above, the data includes 193 rows across 25 columns. One column is the “price”, which is the target variable that we would like to predict and the remainder are the independent variables used to predict the target variable.

Let’s look at the top five rows of the data just to understand what the data look like.

# Return top five rows of the data frame
df.head()

Results:

Next, let’s talk more about AutoGluon. First, we will create a dictionary of the models that would like AutoGluon to use and compare for this exercise. Below is a list of these models:

  • GBM: LightGBM
  • CAT: CatBoost
  • XGB: XGBoost
  • RF: Rrandom forest
  • XT: Eextremely randomized trees
  • KNN: K-nearest neighbors
  • LR: Linear regression

Then we get to the three lines of codes that I promised. These lines will accomplish and correspond to the following steps:

  1. Train (or fit) the model to the training set
  2. Create predictions for the test set using the trained models
  3. Create a leaderboard of the evaluation results of the models

Let’s write the code.

# Run AutoGluon

# Create a dictionary of hyperparameters for the models to be included
hyperparameters_dict = {
'GBM':{},
'CAT':{},
'XGB':{},
'RF':{},
'XT':{},
'KNN':{},
'LR':{},
}

# 1. Fit/train the models
autogluon_predictor = TabularPredictor(label="price").fit(train_data=df_train, presets='best_quality', hyperparameters=hyperparameters_dict)

# 2. Create predictions
predictions = autogluon_predictor.predict(df_test)

# 3. Create the leaderboard
autogluon_predictor.leaderboard(silent=True)

Results:

Leaderboard comparing evaluation results of various ML models

And that is it!

Let’s take a closer look at the leaderboard.

In the final results, the column named “model” shows the name of the models that we included in our dictionary of models. There are eight of them (note that row numbers range from 0 to 7 for a total of 8). Column named “score_val” is the Root Mean Squared Error (RMSE) multiplied by -1 (AutoGluon does this multiplication by -1 so that the higher number is the better). Models are ranked from the best at the top of the table to the worst at the bottom of the table. In other words, “WeightedEnsemble_L2” is the best model in this exercise with an RMSE of ~2,142.

Now let’s see how this number compares to the evaluation results of the ML model that I had created in my post about Linear Regression. If you visit that post and search for MSE, you will find an MSE of ~6,725,127, which is equal to a RMSE of ~2,593 (RMSE is just the root of MSE). Comparing this number to the “score_val” column of the leaderboard shows that my model was better than 4 models that AutoGluon tried and it was worse than the top 4! Remember that I spent quite a bit of time on feature engineering and creating that model in that exercise while AutoGluon managed to find 4 better models in a little over 13 seconds, using 3 lines of code! That is the power of AutoML in practice.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here