Home Artificial Intelligence Auto-Sklearn: How To Boost Performance and Efficiency Through Automated Machine Learning What’s Auto-Sklearn? Practical Example Auto-Sklearn 2.0 — What’s Latest? Conclusion

Auto-Sklearn: How To Boost Performance and Efficiency Through Automated Machine Learning What’s Auto-Sklearn? Practical Example Auto-Sklearn 2.0 — What’s Latest? Conclusion

2
Auto-Sklearn: How To Boost Performance and Efficiency Through Automated Machine Learning
What’s Auto-Sklearn?
Practical Example
Auto-Sklearn 2.0 — What’s Latest?
Conclusion

Lollipop plot showing the various models of the ensemble and their respective weights.
Image by the Writer.

Lots of us are aware of the challenge of choosing an acceptable machine learning model for a particular prediction task, given the vast number of models to select from. On top of that, we also need to search out optimal hyperparameters as a way to maximize our model’s performance.

These challenges can largely be overcome through automated machine learning, or AutoML. I say largely because, despite its name, the method is just not fully automated and still requires some manual tweaking and decision-making by the user.

Essentially, AutoML frees the user from the daunting and time-consuming tasks of knowledge preprocessing, model selection, hyperparameter optimization, and ensemble constructing. Because of this, this toolkit not only saves precious time for the experts, but additionally enables non-technical users to interrupt into the sphere of machine learning. Within the words of the authors:

Automated Machine Learning provides methods and processes to make Machine Learning available for non-Machine Learning experts, to enhance efficiency of Machine Learning and to speed up research on Machine Learning.

While there are a lot of AutoML packages on the market, comparable to AutoWEKA, Auto-PyTorch, or MLBoX, this text will give attention to — a library built on top of the favored scikit-learn package.

Auto-Sklearn is a Python-based, open-source library that automates machine learning processes comparable to data and have preprocessing, algorithm selection, hyperparameter optimization, and ensemble constructing. As a way to achieve this high degree of automation, the library leverages recent advances in Bayesian optimization and likewise takes under consideration past performance on similar datasets.

More specifically, it improves upon previous methods in three key ways. First, it introduces the concept of warm start , which allows for efficient hyperparameter tuning across multiple datasets by leveraging information learned from previous runs. Moreover, it enables of the models considered by Bayesian optimization, further improving model performance. Finally, Auto-Sklearn comes with a highly parameterized machine learning framework that comes with high-performing classifiers and preprocessors from , allowing for flexible and customizable model constructing.

In total, Auto-Sklearn accommodates 16 classifiers, 14 feature preprocessing methods, and various data preprocessing methods, which collectively give rise to a hypothesis space with 122 hyperparameters. These numbers are always evolving with recent releases.

The implementation of this library is pretty straightforward. In reality, the trickiest part is its installation because it is incompatible with Windows and reportedly also has some issues on Mac. It’s subsequently really useful to run it on a Linux operating system (pro-tip: Google Colab runs on Linux, so you need to use that as your experimentation playground).

Once installed, Auto-Sklearn may be run with just 4 lines of code:

import autosklearn.classification

clf = autosklearn.classification.AutoSklearnClassifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

Nonetheless, some manual tweaking and parameterization remains to be really useful to align the user’s intent with the model’s output. Let’s now have a take a look at how Auto-Sklearn may be utilized in practice.

In this instance, we’ll compare the performance of a single classifier with default parameters — on this case, I selected a decision tree classifier — with the considered one of Auto-Sklearn. To achieve this, we’ll be using the publicly available Optical Recognition of Handwritten Digits dataset, whereby each sample consists of an 8×8 image of a digit — hence, dimensionality is 64. In total, this dataset is comprised of 1797 samples which can be assigned to 10 unique classes (~180 samples per class).

Listed here are some samples of this dataset:

Plot showing samples of the digit dataset.
Image by the Writer. License information for data usage: CC BY 4.0.

The dataset may be loaded into Python and split into train and test sets as follows:

from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Benchmark: Decision Tree Classifier

First, let’s train a straightforward decision tree with default parameters on this dataset and see how well it performs under these circumstances.

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

clf = DecisionTreeClassifier().fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

This easy approach yields an accuracy of 86.67% — decent, but not exactly extraordinary. Let’s see if we will outperform this result with Auto-Sklearn.

Auto-Sklearn

Before running this, let’s define some parameters first:

  • time_left_for_this_task: cut-off date in seconds for the overall duration of the search. The upper this limit, the upper the possibilities of finding higher models. The default value is 3600, which corresponds to 1 hour.
  • per_run_time_limit: cut-off date for a single call to the machine learning model. If the algorithm exceeds this limit, model fitting will likely be terminated.
  • ensemble_size: Variety of models added to the ensemble. This may be set to 1 if no ensemble fit is desired.

Now we will fit a model using Auto-Sklearn. We’ll let the duty run for 3 minutes and can limit the time for a single model call to 30 seconds:

import autosklearn.classification
from sklearn.metrics import accuracy_score

automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task = 3*60,
per_run_time_limit = 30
)
automl.fit(X_train, y_train)
y_pred = automl.predict(X_test)
accuracy_score(y_test, y_pred)

This provides us an accuracy of 98.67% — a remarkable increase from our simplistic benchmark.

We are able to obtain some further insights into the training process with the sprint.statistics() method:

print(automl.sprint_statistics())
Screenshot showing model statistics, such as best validation score, number of successful target algorithm runs, etc.
Screenshot by the Writer.

As an example, we will see that our greatest validation rating was 98.88%, that 23 out of 30 algorithms successfully ran, 6 timed out, and 1 exceeded the memory limit. Based on that, we could increase the cut-off date parameters to see if that might increase performance even further.

Using the leaderboard() method, we may also visualize a table of results for all evaluated models (FYI: this table was visualized in a lollipop plot, which is the feature image of this text):

print(automl.leaderboard())
Table showing the model IDs, their rank, ensemble weight, type, cost, and duration.
Screenshot by the Writer.

More details in regards to the individual models that went into the ensemble may be obtained through the show_models() function:

from pprint import pprint
pprint(automl.show_models(), indent=2)
Data showing two examples of model details, including model type, parameters, etc.
Only showing 2 out of 15 models here. Screenshot by the Writer.

While ensembles can actually boost model performance and robustness, they do have some downsides comparable to increased complexity, increased training time, and lack of interpretability. Ensemble fitting may be deactivated as follows: ensemble_size=1.

Last 12 months, various improvements to Auto-Sklearn were released in a paper titled “Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning”. Amongst the most important improvements are (1) , which increases efficiency and ensures that a result’s obtained even when the training times out, (2) , which incorporates the mixing of multiple approaches to approximate the generalization error and the addition of Bayesian Optimization and Hyperband (BOHB), a flexible tool for hyperparameter optimization at scale, and (3) automated policy selection through meta-learning that relieves the user from selecting the configuration of the AutoML system.

To summarize, Auto-Sklearn is a strong and user-friendly library that relieves the user from the slightly difficult and time-consuming tasks of knowledge and have preprocessing, model selection, hyperparameter tuning, and, if desired, ensemble constructing. This has been shown to dramatically increase performance and efficiency of assorted machine learning tasks. Despite the requirement of some user input, Auto-Sklearn within reason automated and, consequently, also allows novices and non-technical users to implement sophisticated machine learning solutions in only just a few lines of code.

Concerned with trying it yourself? Take a look at the many examples made available by the AutoML community.

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here