LightAutoML: AutoML Solution for a Large Financial Services Ecosystem

Although AutoML rose to popularity just a few years ago, the ealy work on AutoML dates back to the early 90’s when scientists published the primary papers on hyperparameter optimization. It was in 2014 when ICML organized the primary AutoML workshop that AutoML gained the eye of ML developers. Certainly one of the foremost focuses over time of AutoML is the hyperparameter search problem, where the model implements an array of optimization methods to find out the very best performing hyperparameters in a big hyperparameter space for a selected machine learning model. One other method commonly implemented by AutoML models is to estimate the probability of a selected hyperparameter being the optimal hyperparameter for a given machine learning model. The model achieves this by implementing Bayesian methods that traditionally use historical data from previously estimated models, and other datasets. Along with hyperparameter optimization, other methods try to pick the very best models from an area of modeling alternatives.

In this text, we are going to cover LightAutoML, an AutoML system developed primarily for a European company operating within the finance sector together with its ecosystem. The LightAutoML framework is deployed across various applications, and the outcomes demonstrated superior performance, comparable to the extent of information scientists, even while constructing high-quality machine learning models. The LightAutoML framework attempts to make the next contributions. First, the LightAutoML framework was developed primarily for the ecosystem of a big European financial and banking institution. Owing to its framework and architecture, the LightAutoML framework is capable of outperform state-of-the-art AutoML frameworks across several open benchmarks in addition to ecosystem applications. The performance of the LightAutoML framework can also be compared against models which can be tuned manually by data scientists, and the outcomes indicated stronger performance by the LightAutoML framework.

This text goals to cover the LightAutoML framework in depth, and we explore the mechanism, the methodology, the architecture of the framework together with its comparison with state-of-the-art frameworks. So let’s start.

Although researchers first began working on AutoML within the mid and early 90’s, AutoML attracted a serious chunk of the eye over the previous few years, with among the distinguished industrial solutions implementing robotically construct Machine Learning models are Amazon’s AutoGluon, DarwinAI, H20.ai, IBM Watson AI, Microsoft AzureML, and rather a lot more. A majority of those frameworks implement a general purpose AutoML solution that develops ML-based models robotically across different classes of applications across financial services, healthcare, education, and more. The important thing assumption behind this horizontal generic approach is that the strategy of developing automatic models stays similar across all applications. Nonetheless, the LightAutoML framework implements a vertical approach to develop an AutoML solution that is just not generic, but fairly caters to the needs of individual applications, on this case a big financial institution. The LightAutoML framework is a vertical AutoML solution that focuses on the necessities of the complex ecosystem together with its characteristics. First, the LightAutoML framework provides fast and near optimal hyperparameter search. Although the model doesn’t optimize these hyperparameters directly, it does manage to deliver satisfactory results. Moreover, the model keeps the balance between speed and hyperparameter optimization dynamic, to make sure the model is perfect on small problems, and fast enough on larger ones. Second, the LightAutoML framework limits the range of machine learning models purposefully to only two types: linear models, and GBMs or gradient boosted decision trees, as an alternative of implementing large ensembles of various algorithms. The first reason behind limiting the range of machine learning models is to hurry up the execution time of the LightAutoML framework without affecting the performance negatively for the given variety of problem and data. Third, the LightAutoML framework presents a singular method of selecting preprocessing schemes for various features utilized in the models on the premise of certain selection rules and meta-statistics. The LightAutoML framework is evaluated on a big selection of open data sources across a big selection of applications.

LightAutoML : Methodology and Architecture

The LightAutoML framework consists of modules referred to as Presets which can be dedicated for end to finish model development for typical machine learning tasks. At present, the LightAutoML framework supports Preset modules. First, the TabularAutoML Preset focuses on solving classical machine learning problems defined on tabular datasets. Second, the White-Box Preset implements easy interpretable algorithms equivalent to Logistic Regression as an alternative of WoE or Weight of Evidence encoding and discretized features to unravel binary classification tasks on tabular data. Implementing easy interpretable algorithms is a standard practice to model the probability of an application owing to the interpretability constraints posed by various factors. Third, the NLP Preset is capable of mixing tabular data with NLP or Natural Language Processing tools including pre-trained deep learning models and specific feature extractors. Finally, the CV Preset works with image data with the assistance of some basic tools. It is necessary to notice that although the LightAutoML model supports all 4 Presets, the framework only uses the TabularAutoML within the production-level system.

The standard pipeline of the LightAutoML framework is included in the next image.

Each pipeline comprises three components. First, Reader, an object that receives task type and raw data as input, performs crucial metadata calculations, cleans the initial data, and figures out the information manipulations to be performed before fitting different models. Next, the LightAutoML inner datasets contain CV iterators and metadata that implement validation schemes for the datasets. The third component are the multiple machine learning pipelines stacked and/or blended to get a single prediction. A machine learning pipeline throughout the architecture of the LightAutoML framework is one among multiple machine learning models that share a single data validation and preprocessing scheme. The preprocessing step can have as much as two feature selection steps, a feature engineering step or could also be empty if no preprocessing is required. The ML pipelines could be computed independently on the identical datasets after which blended together using averaging (or weighted averaging). Alternatively, a stacking ensemble scheme could be used to construct multi level ensemble architectures.

LightAutoML Tabular Preset

Throughout the LightAutoML framework, TabularAutoML is the default pipeline, and it’s implemented within the model to unravel three kinds of tasks on tabular data: binary classification, regression, and multi-class classification for a big selection of performance metrics and loss functions. A table with the next 4 columns: categorical features, numerical features, timestamps, and a single goal column with class labels or continuous value is feeded to the TabularAutoML component as input. Certainly one of the first objectives behind the design of the LightAutoML framework was to design a tool for fast hypothesis testing, a serious reason why the framework avoids using brute-force methods for pipeline optimization, and focuses only on efficiency techniques and models that work across a big selection of datasets.

Auto-Typing and Data Preprocessing

To handle several types of features in alternative ways, the model must know each feature type. Within the situation where there’s a single task with a small dataset, the user can manually specify each feature type. Nonetheless, specifying each feature type manually is not any longer a viable option in situations that include a whole lot of tasks with datasets containing hundreds of features. For the TabularAutoML Preset, the LightAutoML framework must map features into three classes: numeric, category, and datetime. One easy and obvious solution is to make use of column array data types as actual feature types, that’s, to map float/int columns to numeric features, timestamp or string, that may very well be parsed as a timestamp — to datetime, and others to category. Nonetheless, this mapping is just not the very best due to frequent occurrence of numeric data types in category columns.

Validation Schemes

Validation schemes are a significant component of AutoML frameworks since data within the industry is subject to alter over time, and this element of change makes IID or Independent Identically Distributed assumptions irrelevant when developing the model. AutoML models employ validation schemes to estimate their performance, seek for hyperparameters, and out-of-fold prediction generation. The TabularAutoML pipeline implements three validation schemes:

KFold Cross Validation: KFold Cross Validation is the default validation scheme for the TabularAutoML pipeline including GroupKFold for behavioral models, and stratified KFold for classification tasks.

Holdout Validation : The Holdout validation scheme is implemented if the holdout set is specified.
Custom Validation Schemes: Custom validation schemes could be created by users depending on their individual requirements. Custom Validation Schemes include cross-validation, and time-series split schemes.

Feature Selection

Although feature selection is a vital aspect of developing models as per industry standards because it facilitates reduction in inference and model implementation costs, a majority of AutoML solutions don’t focus much on this problem. Quite the opposite, the TabularAutoML pipeline implements three feature selection strategies: No selection, Importance cut off selection, and Importance-based forward selection. Out of the three, Importance cut off selection feature selection strategy is default. Moreover, there are two primary ways to estimate feature importance: split-based tree importance, and permutation importance of GBM model or gradient boosted decision trees. The first aim of importance cutoff selection is to reject features that aren’t helpful to the model, allowing the model to cut back the variety of features without impacting the performance negatively, an approach that may speed up model inference and training.

The above image compares different selection strategies on binary bank datasets.

Hyperparameter Tuning

The TabularAutoML pipeline implements different approaches to tune hyperparameters on the premise of what’s tuned.

Early Stopping Hyperparameter Tuning selects the variety of iterations for all models in the course of the training phase.
Expert System Hyperparameter Tuning is a straightforward solution to set hyperparameters for models in a satisfactory fashion. It prevents the ultimate model from a high decrease in rating in comparison with hard-tuned models.

Tree Structured Parzen Estimation or TPE for GBM or gradient boosted decision tree models. TPE is a mixed tuning strategy that’s the default selection within the LightAutoML pipeline. For every GMB framework, the LightAutoML framework trains two models: the primary gets expert hyperparameters, the second is fine-tuned to suit into the time budget.

Grid Search Hyperparameter Tuning is implemented within the TabularAutoML pipeline to fine-tune the regularization parameters of a linear model alongside early stopping, and warm start.

The model tunes all of the parameters by maximizing the metric function, either defined by the user or is default for the solved task.

LightAutoML : Experiment and Performance

To guage the performance, the TabularAutoML Preset throughout the LightAutoML framework is compared against already existing open source solutions across various tasks, and cements the superior performance of the LightAutoML framework. First, the comparison is carried out on the OpenML benchmark that’s evaluated on 35 binary and multiclass classification task datasets. The next table summarizes the comparison of the LightAutoML framework against existing AutoML systems.

As it may possibly be seen, the LightAutoML framework outperforms all other AutoML systems on 20 datasets throughout the benchmark. The next table comprises the detailed comparison within the dataset context indicating that the LightAutoML delivers different performance on different classes of tasks. For binary classification tasks, the LightAutoML falls short in performance, whereas for tasks with a high amount of information, the LightAutoML framework delivers superior performance.

The next table compares the performance of LightAutoML framework against AutoML systems on 15 bank datasets containing a set of varied binary classification tasks. As it may possibly be observed, the LightAutoML outperforms all AutoML solutions on 12 out of 15 datasets, a win percentage of 80.

Final Thoughts

In this text we’ve got talked about LightAutoML, an AutoML system developed primarily for a European company operating within the finance sector together with its ecosystem. The LightAutoML framework is deployed across various applications, and the outcomes demonstrated superior performance, comparable to the extent of information scientists, even while constructing high-quality machine learning models. The LightAutoML framework attempts to make the next contributions. First, the LightAutoML framework was developed primarily for the ecosystem of a big European financial and banking institution. Owing to its framework and architecture, the LightAutoML framework is capable of outperform state-of-the-art AutoML frameworks across several open benchmarks in addition to ecosystem applications. The performance of the LightAutoML framework can also be compared against models which can be tuned manually by data scientists, and the outcomes indicated stronger performance by the LightAutoML framework.

LightAutoML: AutoML Solution for a Large Financial Services Ecosystem