FormulaFeatures: A Tool to Generate Highly Predictive Features for Interpretable Models

-

Create more interpretable models through the use of concise, highly predictive features, mechanically engineered based on arithmetic combos of numeric features

In this text, we examine a tool called FormulaFeatures. This is meant to be used primarily with interpretable models, similar to shallow decision trees, where having a small variety of concise and highly predictive features can aid greatly with the interpretability and accuracy of the models.

This text continues my series on interpretable machine learning, following articles on ikNN, Additive Decision Trees, Genetic Decision Trees, and PRISM rules.

As indicated within the previous articles (and covered there in additional detail), there is commonly a robust incentive to make use of interpretable predictive models: each prediction could be well understood, and we could be confident the model will perform sensibly on future, unseen data.

There are a lot of models available to offer interpretable ML, although, unfortunately, well lower than we’d likely wish. There are the models described within the articles linked above, in addition to a small variety of others, for instance, decision trees, decision tables, rule sets and rule lists (created, for instance by imodels), Optimal Sparse Decision Trees, GAMs (Generalized Additive Models, similar to Explainable Boosted Machines), in addition to a number of other options.

Basically, creating predictive machine learning models which are each accurate and interpretable is difficult. To enhance the choices available for interpretable ML, 4 of the primary approaches are to:

  1. Develop additional model types
  2. Improve the accuracy or interpretability of existing model types. For this, I’m referring to creating variations on existing model types, or the algorithms used to create the models, versus completely novel model types. For instance, Optimal Sparse Decision Trees and Genetic Decision Trees seek to create stronger decision trees, but ultimately, are still decision trees.
  3. Provide visualizations of the information, model, and predictions made by the model. That is the approach taken, for instance, by ikNN, which works by creating an ensemble of 2D kNN models (that’s, ensembles of kNN models that every use only a single pair of features). The 2D spaces could also be visualized, which provides a high degree of visibility into how the model works and why it made each prediction because it did.
  4. Improve the standard of the features which are utilized by the models, so that models could be either more accurate or more interpretable.

FormulaFeatures is used to support the last of those approaches. It was developed by myself to handle a standard issue in decision trees: they will often achieve a high level of accuracy, but only when grown to a big depth, which then precludes any interpretability. Creating latest features that capture a part of the function linking the unique features to the goal can allow for way more compact (and due to this fact interpretable) decision trees.

The underlying idea is: for any labelled dataset, there may be some true function, f(x) that maps the records to the goal column. This function may take any variety of forms, could also be easy or complex, and should use any set of features in x. But whatever the nature of f(x), by making a model, we hope to approximate f(x) in addition to we are able to given the information available. To create an interpretable model, we also need to do that clearly and concisely.

If the features themselves can capture a big a part of the function, this could be very helpful. For instance, we could have a model that predicts client churn and we could have features for every client including: their variety of purchases within the last 12 months, and the common value of their purchases within the last 12 months. The true f(x), though, could also be based totally on the product of those (the full value of their purchases within the last 12 months, which is found by multiplying these two features).

In practice, we’ll generally never know the true f(x), but on this case, let’s assume that whether a client churns in the subsequent 12 months is expounded strongly to their total purchases within the prior 12 months, and never strongly to their variety of purchase or their average size.

We will likely construct an accurate model using just the 2 original features, but a model using just the product feature will probably be more clear and interpretable. And possibly more accurate.

If we have now only two features, then we are able to view them in a second plot. On this case, we are able to have a look at just num_purc and avg_purc: the variety of purchases within the last 12 months per client, and their average dollar value. Assuming the true f(x) is predicated totally on their product, the space may appear to be the plot below, where the sunshine blue area represents client who will churn in the subsequent 12 months, and the dark blue those that won’t.

If using a choice tree to model this, we are able to create a model by dividing the information space recursively. The orange lines on the plot show a plausible set of splits a choice tree may use (for the primary set of nodes) to attempt to predict churn. It might, as shown, first split on num_purc at a worth of 250, then avg_purc at 24, and so forth. It will proceed to make splits in an effort to fit the curved shape of the true function.

Doing this may create a choice tree that appears something just like the tree below, where the circles represent internal nodes, the rectangles represent the leaf nodes, and ellipses the sub-trees that may like must be grown several more levels deep to realize decent accuracy. That’s, this shows only a fraction of the total tree that may must be grown to model this using these two features. We will see within the plot above as well: using axis-parallel split, we’ll need a lot of splits to suit the boundary between the 2 classes well.

If the tree is grown sufficiently, we are able to likely get a robust tree by way of accuracy. But, the tree will probably be removed from interpretable.

It is feasible to view the choice space, as within the plot above (and this does make the behaviour of the model clear), but this is simply feasible here since the space is restricted to 2 dimensions. Normally that is inconceivable, and our greatest means to interpret the choice tree is to look at the tree itself. But, where the tree has many dozens of nodes or more, it becomes inconceivable to see the patterns it’s working to capture.

On this case, if we engineered a feature for num_purc * avg_purc, we could have a quite simple decision tree, with only a single internal node, with the split point: num_purc * avg_purc > 25000.

In practice, it’s never possible to supply features which are this near the true function, and it’s never possible to create a completely accurate decision trees with only a few nodes. Nevertheless it is commonly quite possible to engineer features which are closer to the true f(x) than the unique features.

Each time there are interactions between features, if we are able to capture these with engineered features, this may allow for more compact models.

So, with FormulaFeatures, we try to create features similar to num_purchases * avg_value_of_purchases, they usually can very often be utilized in models similar to decision trees to capture the true function reasonably well.

As well, simply knowing that num_purchases * avg_value_of_purchases is predictive of the goal (and that higher values are related to lower risk of churn) in itself is informative. But the brand new feature is most useful within the context of searching for to make interpretable models more accurate and more interpretable.

As we’ll describe below, FormulaFeatures also does this in a way that minimizing creating other features, in order that only a small set of features, all relevant, are returned.

With tabular data, the top-performing models for prediction problems are typically boosted tree-based ensembles, particularly LGBM, XGBoost, and CatBoost. It’s going to vary from one prediction problem to a different, but more often than not, these three models are likely to do higher than other models (and are considered, a minimum of outside of AutoML approaches, the present cutting-edge). Other strong model types similar to kNNs, neural networks, Bayesian Additive Regression Trees, SVMs, and others may even occasionally perform one of the best. All of those models types are, though, quite uninterpretable, and are effectively black-boxes.

Unfortunately, interpretable models are likely to be weaker than these with respect to accuracy. Sometimes, the drop in accuracy is fairly small (for instance, within the third decimal), and it’s price sacrificing some accuracy for interpretability. In other cases, though, interpretable models may do substantially worse than the black-box alternatives. It’s difficult, for instance for a single decision tree to compete with an ensemble of many decision trees.

So, it’s common to have the ability to create a robust black-box model, but at the identical time for it to be difficult (or inconceivable) to create a robust interpretable model. That is the issue FormulaFeatures was designed to handle. It seeks to capture a few of logic that black-box models can represent, but in a straightforward, comprehensible way.

Much of the research done in interpretable AI focusses on decision trees, and pertains to making decision trees more accurate and more interpretable. That is fairly natural, as decision trees are a model type that’s inherently straight-forward to grasp (when small enough, they’re arguably as interpretable as another model) and sometimes reasonably accurate (though this may be very often not the case).

Other interpretable models types (e.g. logistic regression, rules, GAMs, etc.) are used as well, but much of the research is targeted on decision trees, and so this text works, for probably the most part, with decision trees. Nevertheless, FormulaFeatures will not be specific to decision trees, and could be useful for other interpretable models. In truth, it’s fairly easy to see, once we explain FormulaFeatures below, how it could be applied as well to ikNN, Genetic Decision Trees, Additive Decision Trees, rules lists, rule sets, and so forth.

To be more precise with respect to decision trees, when using these for interpretable ML, we’re looking specifically at shallow decision trees — trees which have relatively small depths, with the deepest nodes being restricted to perhaps 3, 4, or 5 levels. This ensures two things: that shallow decision trees can provide each what are called local explanations and what are called global explanations. These are the 2 primary concerns with interpretable ML. I’ll explain these here.

With local interpretability, we wish to be sure that each individual prediction made by the model is comprehensible. Here, we are able to examine the choice path taken through the tree by each record for which we generate a choice. If a path includes the feature num_purc * avg_purc, and the trail may be very short, it will possibly be reasonably clear. Then again, a path that features: num_purc > 250 AND avg_purc > 24 AND num_purc < 500 AND avg_purc_50, and so forth (as within the tree generated above without the good thing about the num_purc * avg_pur feature) can grow to be very difficult to interpret.

With global interpretability, we wish to be sure that the model as a complete is comprehensible. This permits us to see the predictions that may be made under any circumstances. Again, using more compact trees, and where the features themselves are informative, can aid with this. It’s much simpler, on this case, to see the massive picture of how the choice tree outputs predictions.

We must always qualify this, though, by indicating that shallow decision trees (which we concentrate on for this text) are very difficult to create in a way that’s accurate for regression problems. Each leaf node can predict only a single value, and so a tree with n leaf nodes can only output, at most, n unique predictions. For regression problems, this often leads to high error rates: normally decision trees have to create a lot of leaf nodes in an effort to cover the total range of values that could be potentially predicted, with each node having reasonable precision.

Consequently, shallow decision trees are likely to be practical just for classification problems (if there are only a small variety of classes that could be predicted, it is sort of possible to create a choice tree with not too many leaf nodes to predict these accurately). FormulaFeatures could be useful to be used with other interpretable regression models, but not typically with decision trees.

Now that we’ve seen among the motivation behind FormulaFeatures, we’ll take a have a look at how it really works.

FormulaFeatures is a type of supervised feature engineering, which is to say that it considers the goal column when producing features, and so can generate features specifically useful for predicting that focus on. FormulaFeatures supports each regression & classification targets (though as indicated, when using decision trees, it could be that only classification targets are feasible).

Profiting from the goal column allows it to generate only a small variety of engineered features, each as easy or complex as mandatory.

Unsupervised methods, alternatively, don’t take the goal feature into consideration, and easily generate all possible combos of the unique features using some system for generating features.

An example of that is scikit-learn’s PolynomialFeatures, which is able to generate all polynomial combos of the features. If the unique features are, say: [a, b, c], then PolynomialFeatures can create (depending on the parameters specified) a set of engineered features similar to: [ab, ac, bc, a², b², c²] — that’s, it can generate all combos of pairs of features (using multiplication), in addition to all original features raised to the 2nd degree.

Using unsupervised methods, there may be very often an explosion within the variety of features created. If we have now 20 features to start out with, returning just the features created by multiplying each pair of features would generate (20 * 19) / 2, or 190 features (that’s, 20 select 2). If allowed to create features based on multiplying sets of three features, there are 20 select 3, or 1140 of those. Allowing features similar to a²bc, a²bc², and so forth leads to much more massive numbers of features (though with a small set of useful features being, quite possibly, amongst these).

Supervised feature engineering methods would are likely to return only a much smaller (and more relevant) subset of those.

Nevertheless, even throughout the context of supervised feature engineering (depending on the particular approach used), an explosion in features should occur to some extent, leading to a time consuming feature engineering process, in addition to producing more features than could be reasonably utilized by any downstream tasks, similar to prediction, clustering, or outlier detection. FormulaFeatures is optimized to maintain each the engineering time, and the variety of features returned, tractable, and its algorithm is designed to limit the numbers of features generated.

The tool operates on the numeric features of a dataset. In the primary iteration, it examines each pair of original numeric features. For every, it considers 4 potential latest features based on the 4 basic arithmetic operations (+, -, *, and /). For the sake of performance, and interpretability, we limit the method to those 4 operations.

If any perform higher than each parent features (by way of their ability to predict the goal — described soon), then the strongest of those is added to the set of features. For instance, if A + B and A * B are each strong features (each stronger than either A or B), only the stronger of those will probably be included.

Subsequent iterations then consider combining all features generated within the previous iteration will all other features, again taking the strongest of those, if any outperformed their two parent features. In this manner, a practical number of latest features are generated, all stronger than the previous features.

Assume we start with a dataset with features A, B, and C, that Y is the goal, and that Y is numeric (it is a regression problem).

We start by determining how predictive of the goal each feature is by itself. The currently-available version uses R2 for regression problems and F1 (macro) for classification problems. We create a straightforward model (a classification or regression decision tree) using only a single feature, determine how well it predicts the goal column, and measure this with either R2 or F1 scores.

Using a choice tree allows us to capture reasonably well the relationships between the feature and goal — even fairly complex, non-monotonic relationships — where they exist.

Future versions will support more metrics. Using strictly R2 and F1, nonetheless, will not be a big limitation. While other metrics could also be more relevant to your projects, using these metrics internally when engineering features will discover well the features which are strongly related to the goal, even when the strength of the association will not be equivalent as it could be found using other metrics.

In this instance, we start with calculating the R2 for every original feature, training a choice tree using only feature A, then one other using only B, and nonetheless using only C. This may occasionally give the next R2 scores:

A   0.43
B 0.02
C -1.23

We then consider the combos of pairs of those, that are: A & B, A & C, and B & C. For every we try the 4 arithmetic operations: +, *, -, and /.

Where there are feature interactions in f(x), it can often be that a brand new feature incorporating the relevant original features can represent the interactions well, and so outperform either parent feature.

When examining A & B, assume we get the next R2 scores:

A + B  0.54
A * B 0.44
A - B 0.21
A / B -0.01

Here there are two operations which have a better R2 rating than either parent feature (A or B), that are + and *. We take the best of those, A + B, and add this to the set of features. We do the identical for A & B and B & C. Generally, no feature will probably be added, but often one is.

After the primary iteration we could have:

A       0.43
B 0.02
C -1.23
A + B 0.54
B / C 0.32

We then, in the subsequent iteration, take the 2 features just added, and check out combining them with all other features, including one another.

After this we could have:

A                   0.43
B 0.02
C -1.23
A + B 0.54
B / C 0.32
(A + B) - C 0.56
(A + B) * (B / C) 0.66

This continues until there is no such thing as a longer improvement, or a limit specified by a hyperparameter, max_iterations, is reached.

At the tip of every iteration, further pruning of the features is performed, based on correlations. The correlation among the many features created throughout the current iteration is examined, and where two or more features which are highly correlated were created, only the strongest is kept, removing the others. This limits creating near-redundant features, which may grow to be possible, especially because the features grow to be more complex.

For instance: (A + B + C) / E and (A + B + D) / E may each be strong, but quite similar, and if that’s the case, only the stronger of those will probably be kept.

One allowance for correlated features is made, though. Basically, because the algorithm proceeds, more complex features are created, and these features more accurately capture the true relationship between the features in x and the goal. But, the brand new features created may be correlated with the features they construct upon, that are simpler, and FormulaFeatures also seeks to favour simpler features over more complex, all the pieces else equal.

For instance, if (A + B + C) is correlated with (A + B), each could be kept even when (A + B + C) is stronger, so that the simpler (A + B) could also be combined with other features in subsequent iterations, possibly creating features which are stronger still.

In the instance above, we have now features A, B, and C, and see that a part of the true f(x) could be approximated with (A + B) – C.

We initially have only the unique features. After the primary iteration, we may generate (again, as in the instance above) A + B and B / C, so now have five features.

In the subsequent iteration, we may generate (A + B) — C.

This process is, basically, a mix of: 1) combining weak features to make them stronger (and more likely useful in a downstream task); in addition to 2) combining strong features to make these even stronger, creating what are most certainly probably the most predictive features.

But, what’s essential is that this combining is finished only after it’s confirmed that A + B is a predictive feature in itself, more so than either A or B. That’s, we don’t create (A + B) — C until we confirm that A + B is predictive. This ensures that, for any complex features created, each component inside them is helpful.

In this manner, each iteration creates a more powerful set of features than the previous, and does so in a way that’s reliable and stable. It minimizes the results of simply trying many complex combos of features, which may easily overfit.

So, FormulaFeatures, executes in a principled, deliberate manner, creating only a small variety of engineered features each step, and typically creates less features each iteration. As such, it, overall, favours creating features with low complexity. And, where complex features are generated, this could be shown to be justified.

With most datasets, ultimately, the features engineered are combos of just two or three original features. That’s, it can often create features more just like A * B than to, say, (A * B) / (C * D).

In truth, to generate a features similar to (A * B) / (C * D), it could have to reveal that A * B is more predictive than either A or B, that C * D is more predictive that C or D, and that (A * B) / (C * D) is more predictive than either (A * B) or (C * D). As that’s quite a lot of conditions, relatively few features as complex as (A * B) / (C * D) will are likely to be created, many more like A * B.

We’ll look here closer at using decision trees internally to judge each feature, each the unique and the engineered features.

To judge the features, other methods can be found, similar to easy correlation tests. But creating easy, non-parametric models, and specifically decision trees, has a lot of benefits:

  • 1D models are fast, each to coach and to check, which allows the evaluation process to execute in a short time. We will quickly determine which engineered features are predictive of the goal, and the way predictive they’re.
  • 1D models are easy and so may reasonably be trained on small samples of the information, further improving efficiency.
  • While 1D decision tree models are relatively easy, they will capture non-monotonic relationships between the features and the goal, so can detect where features are predictive even where the relationships are complex enough to be missed by simpler tests, similar to tests for correlation.
  • This ensures all features useful in themselves, so supports the features being a type of interpretability in themselves.

There are also some limitations of using 1D models to judge each feature, particularly: using single features precludes identifying effective combos of features. This may occasionally end in missing some useful features (features that are usually not useful by themselves but are useful together with other features), but does allow the method to execute in a short time. It also ensures that every one features produced are predictive on their very own, which does aid in interpretability.

The goal is that: where features are useful only together with other features, a brand new feature is created to capture this.

One other limitation related to this way of feature engineering is that just about all engineered features can have global significance, which is commonly desirable, nevertheless it does mean the tool can miss moreover generating features which are useful only in specific sub-spaces. Nevertheless, provided that the features will probably be utilized by interpretable models, similar to shallow decision trees, the worth of features which are predictive in just specific sub-spaces is far lower than where more complex models (similar to large decision trees) are used.

FormulaFeatures does create features which are inherently more complex than the unique features, which does lower the interpretability of the trees (assuming the engineered features are utilized by the trees a number of times).

At the identical time, using these features can allow substantially smaller decision trees, leading to a model that’s, over all, more accurate and more interpretable. That’s, despite the fact that the features utilized in a tree could also be complex, the tree, could also be substantially smaller (or substantially more accurate when keeping the scale to an inexpensive level), leading to a net gain in interpretability.

When FormulaFeatures is used with shallow decision trees, the engineered features generated are likely to be put at the highest of the trees (as these are probably the most powerful features, best capable of maximize information gain). No single feature can ever split the information perfectly at any step, which implies further splits are almost all the time mandatory. Other features are used lower within the tree, which are likely to be simpler engineered features (based only only two, or sometimes three, original features), or the unique features. On the entire, this could produce fairly interpretable decision trees, and tends to limit using the more complex engineered features to a useful level.

To elucidate higher among the context for FormulaFeatures, I’ll describe one other tool, also developed by myself, called ArithmeticFeatures, which is analogous but somewhat simpler. We’ll then have a look at among the limitations related to ArithmeticFeatures that FormulaFeatures was designed to handle.

ArithmeticFeatures is an easy tool, but one I’ve found useful in a lot of projects. I initially created it, because it was a recurring theme that it was useful to generate a set of easy arithmetic combos of the numeric features available for various projects I used to be working on. I then hosted it on github.

Its purpose, and its signature, are just like scikit-learn’s PolynomialFeatures. It’s also an unsupervised feature engineering tool.

Given a set of numeric features in a dataset, it generates a group of latest features. For every pair of numeric features, it generates 4 latest features: the results of the +, -, * and / operations.

This could generate a set of features which are useful, but additionally generates a really large set of features, and potentially redundant features, which implies feature selection is mandatory after using this.

Formula Features was designed to handle the problem that, as indicated above, continuously occurs with unsupervised feature engineering tools including ArithmeticFeatures: an explosion within the numbers of features created. With no goal to guide the method, they simply mix the numeric features in as some ways are are possible.

To quickly list the differences:

  • FormulaFeatures will generate far fewer features, but each that it generates will probably be known to be useful. ArithmeticFeatures provides no check as to which features are useful. It’s going to generate features for each combination of original features and arithmetic operation.
  • FormulaFeatures will only generate features which are more predictive than either parent feature.
  • For any given pair of features, FormulaFeatures will include at most one combination, which is the one which is most predictive of the goal.
  • FormulaFeatures will proceed looping for either a specified variety of iterations, or as long as it’s capable of create more powerful features, and so can create more powerful features than ArithmeticFeatures, which is restricted to features based on pairs of original features.

ArithmeticFeatures, because it executes just one iteration (in an effort to manage the variety of features produced), is commonly quite limited in what it will possibly create.

Imagine a case where the dataset describes houses and the goal feature is the home price. This may occasionally be related to features similar to num_bedrooms, num_bathrooms and num_common rooms. Likely it’s strongly related to the full variety of rooms, which, let’s say, is: num_bedrooms + num_bathrooms + num_common rooms. ArithmeticFeatures, nonetheless is simply capable of produce engineered features based on pairs of original features, so can produce:

  • num_bedrooms + num_bathrooms
  • num_bedrooms + num_common rooms
  • num_bathrooms + num_common rooms

These could also be informative, but producing num_bedrooms + num_bathrooms + num_common rooms (as FormulaFeatures is capable of do) is each more clear as a feature, and allows more concise trees (and other interpretable models) than using features based on only pairs of original features.

One other popular feature engineering tool based on arithmetic operations is AutoFeat, which works similarly to ArithmeticFeatures, and likewise executes in an unsupervised manner, so will create a really large variety of features. AutoFeat is ready it to execute for multiple iterations, creating progressively more complex features each iterations, but with increasing large numbers of them. As well, AutoFeat supports unary operations, similar to square, square root, log and so forth, which allows for features similar to A²/log(B).

So, I’ve gone over the motivations to create, and to make use of, FormulaFeatures over unsupervised feature engineering, but also needs to say: unsupervised methods similar to PolynomialFeatures, ArithmeticFeatures, and AutoFeat are also often useful, particularly where feature selection will probably be performed in any case.

FormulaFeatures focuses more on interpretability (and to some extent on memory efficiency, but the first motivation was interpretability), and so has a special purpose.

Using unsupervised feature engineering tools similar to PolynomialFeatures, ArithmeticFeatures, and AutoFeat increases the necessity for feature selection, but feature selection is mostly performed in any case.

That’s, even when using a supervised feature engineering method similar to FormulaFeatures, it can generally be useful to perform some feature selection after the feature engineering process. In truth, even when the feature engineering process produces no latest features, feature selection is probably going still useful simply to scale back the variety of the unique features utilized in the model.

While FormulaFeatures seeks to attenuate the variety of features created, it doesn’t perform feature selection per se, so can generate more features than will probably be mandatory for any given task. We assume the engineered features will probably be used, typically, for a prediction task, however the relevant features will still depend upon the particular model used, hyperparameters, evaluation metrics, and so forth, which FormulaFeatures cannot predict

What could be relevant is that, using FormulaFeatures, as in comparison with many other feature engineering processes, the feature selection work, if performed, could be a much simpler process, as there will probably be far few features to think about. Feature selection can grow to be slow and difficult when working with many features. For instance, wrapper methods to pick features grow to be intractable.

The tool uses the fit-transform pattern, the identical as that utilized by scikit-learn’s PolynomialFeatures and plenty of other feature engineering tools (including ArithmeticFeatures). As such, it’s easy to substitute this tool for others to find out which is probably the most useful for any given project.

In this instance, we load the iris data set (a toy dataset provided by scikit-learn), split the information into train and test sets, use FormulaFeatures to engineer a set of additional features, and fit a Decision Tree using these.

That is fairly typical example. Using FormulaFeatures requires only making a FormulaFeatures object, fitting it, and reworking the available data. This produces a brand new dataframe that could be used for any subsequent tasks, on this case to coach a classification model.

import pandas as pd
from sklearn.datasets import load_iris
from formula_features import FormulaFeatures

# Load the information
iris = load_iris()
x, y = iris.data, iris.goal
x = pd.DataFrame(x, columns=iris.feature_names)

# Split the information into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)

# Engineer latest features
ff = FormulaFeatures()
ff.fit(x_train, y_train)
x_train_extended = ff.transform(x_train)
x_test_extended = ff.transform(x_test)

# Train a choice tree and make predictions
dt = DecisionTreeClassifier(max_depth=4, random_state=0)
dt.fit(x_train_extended, y_train)
y_pred = dt.predict(x_test_extended)

Setting the tool to execute with verbose=1 or verbose=2 allows viewing the method in greater detail.

The github page also provides a file called demo.py, which provides some examples using FormulaFeatures, though the signature is sort of easy.

Getting the feature scores, which we show in this instance, could also be useful for understanding the features generated and for feature selection.

In this instance, we use the gas-drift dataset from openml (https://www.openml.org/search?type=data&sort=runs&id=1476&status=lively, licensed under Creative Commons).

It largely works the identical because the previous example, but additionally makes a call to the display_features() API, which provides information concerning the features engineered.

data = fetch_openml('gas-drift')
x = pd.DataFrame(data.data, columns=data.feature_names)
y = data.goal

# Drop all non-numeric columns. This will not be mandatory, but is finished here
# for simplicity.
x = x.select_dtypes(include=np.number)

# Divide the information into train and test splits. For a more reliable measure
# of accuracy, cross validation may be used. This is finished here for
# simplicity.
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.33, random_state=42)

ff = FormulaFeatures(
max_iterations=2,
max_original_features=10,
target_type='classification',
verbose=1)
ff.fit(x_train, y_train)
x_train_extended = ff.transform(x_train)
x_test_extended = ff.transform(x_test)

display_df = x_test_extended.copy()
display_df['Y'] = y_test.values
print(display_df.head())

# Test using the prolonged features
extended_score = test_f1(x_train_extended, x_test_extended, y_train, y_test)
print(f"F1 (macro) rating on prolonged features: {extended_score}")

# Get a summary of the features engineered and their scores based
# on 1D models
ff.display_features()

This can produce the next report, listing each feature index, F1 macro rating, and have name:

0:    0.438, V9
1: 0.417, V65
2: 0.412, V67
3: 0.412, V68
4: 0.412, V69
5: 0.404, V70
6: 0.409, V73
7: 0.409, V75
8: 0.409, V76
9: 0.414, V78
10: 0.447, ('V65', 'divide', 'V9')
11: 0.465, ('V67', 'divide', 'V9')
12: 0.422, ('V67', 'subtract', 'V65')
13: 0.424, ('V68', 'multiply', 'V65')
14: 0.489, ('V70', 'divide', 'V9')
15: 0.477, ('V73', 'subtract', 'V65')
16: 0.456, ('V75', 'divide', 'V9')
17: 0.45, ('V75', 'divide', 'V67')
18: 0.487, ('V78', 'divide', 'V9')
19: 0.422, ('V78', 'divide', 'V65')
20: 0.512, (('V67', 'divide', 'V9'), 'multiply', ('V65', 'divide', 'V9'))
21: 0.449, (('V67', 'subtract', 'V65'), 'divide', 'V9')
22: 0.45, (('V68', 'multiply', 'V65'), 'subtract', 'V9')
23: 0.435, (('V68', 'multiply', 'V65'), 'multiply', ('V67', 'subtract', 'V65'))
24: 0.535, (('V73', 'subtract', 'V65'), 'multiply', 'V9')
25: 0.545, (('V73', 'subtract', 'V65'), 'multiply', 'V78')
26: 0.466, (('V75', 'divide', 'V9'), 'subtract', ('V67', 'divide', 'V9'))
27: 0.525, (('V75', 'divide', 'V67'), 'divide', ('V73', 'subtract', 'V65'))
28: 0.519, (('V78', 'divide', 'V9'), 'multiply', ('V65', 'divide', 'V9'))
29: 0.518, (('V78', 'divide', 'V9'), 'divide', ('V75', 'divide', 'V67'))
30: 0.495, (('V78', 'divide', 'V65'), 'subtract', ('V70', 'divide', 'V9'))
31: 0.463, (('V78', 'divide', 'V65'), 'add', ('V75', 'divide', 'V9'))

This includes the unique features (features 0 through 9) for context. In this instance, there may be a gentle increase within the predictive power of the features engineered.

Plotting can be provided. Within the case of regression targets, the tool presents a scatter plot mapping each feature to the goal. Within the case of classification targets, the tool presents a boxplot, giving the distribution of a feature broken down by class label. It is commonly the case that the unique features show little difference in distributions per class, while engineered features can show a definite difference. For instance, one feature generated, (V99 / V47) — (V81 / V5) shows a robust separation:

The separation isn’t perfect, but is cleaner than with any of the unique features.

That is typical of the features engineered; while each has an imperfect separation, each is powerful, often way more so than for the unique features.

Testing was performed on synthetic and real data. The tool performed thoroughly on the synthetic data, though this provides more debugging and testing than meaningful evaluation. For real data, a set of 80 random classification datasets from OpenML were chosen, though only those having a minimum of two numeric features might be included, leaving 69 files. Testing consisted of performing a single train-test split on the information, then training and evaluating a model on the numeric feature each before and after engineering additional features.

Macro F1 was used because the evaluation metric, evaluating a scikit-learn DecisionTreeClassifer with and without the engineered features, setting setting max_leaf_nodes = 10 (corresponding to 10 induced rules) to make sure an interpretable model.

In lots of cases, the tool provided no improvement, or only slight improvements, within the accuracy of the shallow decision trees, as is anticipated. No feature engineering technique will work in all cases. More essential is that the tool led to significant increases inaccuracy a formidable variety of times. That is without tuning or feature selection, which may further improve the utility of the tool.

Using other interpretable models will give different results, possibly stronger or weaker than was found with shallow decision trees, which did have show quite strong results.

In these tests we found higher results limiting max_iterations to 2 in comparison with 3. It is a hyperparameter, and have to be tuned for various datasets. For many datasets, using 2 or 3 works well, while with others, setting higher, even much higher (setting it to None allows the method to proceed as long as it will possibly produce more practical features), can work well.

Generally, the time engineering the brand new features was just seconds, and in all cases was under two minutes, even with lots of the test files having lots of of columns and plenty of 1000’s of rows.

The outcomes were:

Dataset  Rating    Rating
Original Prolonged Improvement
isolet 0.248 0.256 0.0074
bioresponse 0.750 0.752 0.0013
micro-mass 0.750 0.775 0.0250
mfeat-karhunen 0.665 0.765 0.0991
abalone 0.127 0.122 -0.0059
cnae-9 0.718 0.746 0.0276
semeion 0.517 0.554 0.0368
vehicle 0.674 0.726 0.0526
satimage 0.754 0.699 -0.0546
analcatdata_authorship 0.906 0.896 -0.0103
breast-w 0.946 0.939 -0.0063
SpeedDating 0.601 0.608 0.0070
eucalyptus 0.525 0.560 0.0349
vowel 0.431 0.461 0.0296
wall-robot-navigation 0.975 0.975 0.0000
credit-approval 0.748 0.710 -0.0377
artificial-characters 0.289 0.322 0.0328
har 0.870 0.870 -0.0000
cmc 0.492 0.402 -0.0897
segment 0.917 0.934 0.0174
JapaneseVowels 0.573 0.686 0.1128
jm1 0.534 0.544 0.0103
gas-drift 0.741 0.833 0.0918
irish 0.659 0.610 -0.0486
profb 0.558 0.544 -0.0140
adult 0.588 0.588 0.0000
anneal 0.609 0.619 0.0104
credit-g 0.528 0.488 -0.0396
blood-transfusion-service-center 0.639 0.621 -0.0177
qsar-biodeg 0.778 0.804 0.0259
wdbc 0.936 0.947 0.0116
phoneme 0.756 0.743 -0.0134
diabetes 0.716 0.661 -0.0552
ozone-level-8hr 0.575 0.591 0.0159
hill-valley 0.527 0.743 0.2160
kc2 0.683 0.683 0.0000
eeg-eye-state 0.664 0.713 0.0484
climate-model-simulation-crashes 0.470 0.643 0.1731
spambase 0.891 0.912 0.0217
ilpd 0.566 0.607 0.0414
one-hundred-plants-margin 0.058 0.055 -0.0026
banknote-authentication 0.952 0.995 0.0430
mozilla4 0.925 0.924 -0.0009
electricity 0.778 0.787 0.0087
madelon 0.712 0.760 0.0480
scene 0.669 0.710 0.0411
musk 0.810 0.842 0.0326
nomao 0.905 0.911 0.0062
bank-marketing 0.658 0.645 -0.0134
MagicTelescope 0.780 0.807 0.0261
Click_prediction_small 0.494 0.494 -0.0001
page-blocks 0.669 0.816 0.1469
hypothyroid 0.924 0.907 -0.0161
yeast 0.445 0.487 0.0419
CreditCardSubset 0.785 0.803 0.0184
shuttle 0.651 0.514 -0.1368
Satellite 0.886 0.902 0.0168
baseball 0.627 0.701 0.0738
mc1 0.705 0.665 -0.0404
pc1 0.473 0.550 0.0770
cardiotocography 1.000 0.991 -0.0084
kr-vs-k 0.097 0.116 0.0187
volcanoes-a1 0.366 0.327 -0.0385
wine-quality-white 0.252 0.251 -0.0011
allbp 0.555 0.553 -0.0028
allrep 0.279 0.288 0.0087
dis 0.696 0.563 -0.1330
steel-plates-fault 1.000 1.000 0.0000

The model performed higher with, than without, Formula Features feature engineering 49 out of 69 cases. Some noteworthy examples are:

  • Japanese Vowels improved from .57 to .68
  • gas-drift improved from .74 to .83
  • hill-valley improved from .52 to .74
  • climate-model-simulation-crashes improved from .47 to .64
  • banknote-authentication improved from .95 to .99
  • page-blocks improved from .66 to .81

We’ve looked to date primarily at shallow decision trees in this text, and have indicated that FormulaFeatures may generate features useful for other interpretable models. But, this leaves the query of their utility with more powerful predictive models. On the entire, FormulaFeatures will not be useful together with these tools.

For probably the most part, strong predictive models similar to boosted tree models (e.g., CatBoost, LGBM, XGBoost), will have the ability to infer the patterns that FormulaFeatures captures in any case. Though they may capture these patterns in the shape of enormous numbers of decision trees, combined in an ensemble, versus single features, the effect will probably be the identical, and should often be stronger, because the trees are usually not limited to easy, interpretable operators (+, -, *, and /).

So, there might not be an appreciable gain in accuracy using engineered features with strong models, even where they match the true f(x) closely. It may well be price trying FormulaFeatures on this case, and I’ve found it helpful with some projects, but most frequently the gain is minimal.

It’s really with smaller (interpretable) models where tools similar to FormulaFeatures grow to be most useful.

One limitation of feature engineering based on arithmetic operations is that it will possibly be slow where there are a really large variety of original features, and it’s relatively common in data science to come across tables with lots of of features, or more. This affects unsupervised feature engineering methods way more severely, but supervised methods may also be significantly slowed down.

In these cases, creating even pairwise engineered features may invite overfitting, as an unlimited variety of features could be produced, with some performing thoroughly just by probability.

To deal with this, FormulaFeatures limits the variety of original columns considered when the input data has many columns. So, where datasets have large numbers of columns, only probably the most predictive are considered after the primary iteration. The following iterations perform as normal; there may be simply some pruning of the unique features used during this primary iteration.

By default, Formula Features doesn’t incorporate unary functions, similar to square, square root, or log (though it will possibly accomplish that if the relevant parameters are specified). As indicated above, some tools, similar to AutoFeat also optionally support these operations, they usually could be worthwhile at times.

In some cases, it could be that a feature similar to A² / B predicts the goal higher than the equivalent form without the square operator: A / B. Nevertheless, including unary operators can result in misleading features if not substantially correct, and should not significantly increase the accuracy of any models using them.

When working with decision trees, as long as there may be a monotonic relationship between the features with and without the unary functions, there won’t be any change in the ultimate accuracy of the model. And, most unary functions maintain a rank order of values (with exceptions similar to sin and cos, which can reasonably be used where cyclical patterns are strongly suspected). For instance, the values in A can have the identical rank values as A² (assuming all values in A are positive), so squaring won’t add any predictive power — decision trees will treat the features equivalently.

As well, by way of explanatory power, simpler functions can often capture nearly as much of the pattern as can more complex functions: simpler function similar to A / B are generally more comprehensible than formulas similar to A² / B, but still convey the identical idea, that it’s the ratio of the 2 features that’s relevant.

Limiting the set of operators utilized by default also allows the method to execute faster and in a more regularized manner.

An identical argument could also be made for including coefficients in engineered features. A feature similar to 5.3A + 1.4B may capture the connection A and B have with Y higher than the simpler A + B, however the coefficients are sometimes unnecessary, liable to be calculated incorrectly, and inscrutable even where roughly correct.

And, within the case of multiplication and division operations, the coefficients are most certainly irrelevant (a minimum of when used with decision trees). For instance, 5.3A * 1.4B will probably be functionally akin to A * B for many purposes, because the difference is a relentless which could be divided out. Again, there may be a monotonic relationship with and without the coefficients, and thus the features are equivalent when used with models, similar to decision trees, which are concerned only with the ordering of feature values, not their specific values.

Scaling the features generated by FormulaFeatures will not be mandatory if used with decision trees (or similar model types similar to Additive Decision Trees, rules, or decision tables). But, for some model types, similar to SVM, kNN, ikNN, logistic regression, and others (including any that work based on distance calculations between points), the features engineered by Formula Features could also be on quite different scales than the unique features, and can must be scaled. This is simple to do, and is just a degree to recollect.

In this text, we checked out interpretable models, but should indicate, a minimum of quickly, FormulaFeatures may also be useful for what are called explainable models and it could be that this is definitely a more essential application.

To elucidate the thought of explainability: where it’s difficult or inconceivable to create interpretable models with sufficient accuracy, we frequently as an alternative develop black-box models (e.g. boosted models or neural networks), after which create post-hoc explanations of the model. Doing that is known as explainable AI (or XAI). These explanations attempt to make the black-boxes more comprehensible. Technique for this include: feature importances, ALE plots, proxy models, and counterfactuals.

These could be essential tools in lots of contexts, but they’re limited, in that they will provide only an approximate understanding of the model. As well, they might not be permissible in all environments: in some situations (for instance, for safety, or for regulatory compliance), it will possibly be mandatory to strictly use interpretable models: that’s, to make use of models where there aren’t any questions on how the model behaves.

And, even where not strictly required, it’s very often preferable to make use of an interpretable model where possible: it’s often very useful to have a great understanding of the model and of the predictions made by the model.

Having said that, using black-box models and post-hoc explanations may be very often probably the most suitable alternative for prediction problems. As FormulaFeatures produces worthwhile features, it will possibly support XAI, potentially making feature importances, plots, proxy models, or counter-factuals more interpretable.

For instance, it might not be feasible to make use of a shallow decision tree because the actual model, nevertheless it could also be used as a proxy model: a straightforward, interpretable model that approximates the actual model. In these cases, as much as with interpretable models, having a great set of engineered features could make the proxy models more interpretable and more capable of capture the behaviour of the particular model.

The tool uses a single .py file, which could also be simply downloaded and used. It has no dependencies aside from numpy, pandas, matplotlib, and seaborn (used to plot the features generated).

FormulaFeatures is a tool to engineer features based on arithmetic relationships between numeric features. The features could be informative in themselves, but are particularly useful when used with interpretable ML models.

While this tends to not improve the accuracy for all models, it does very often improve the accuracy of interpretable models similar to shallow decision trees.

Consequently, it will possibly be a useful gizmo to make it more feasible to make use of interpretable models for prediction problems — it could allow using interpretable models for problems that may otherwise be limited to black box models. And where interpretable models are used, it could allow these to be more accurate or interpretable. For instance, with a classification decision tree, we may have the ability to realize similar accuracy using fewer nodes, or may have the ability to realize higher accuracy using the identical variety of nodes.

FormulaFeatures can fairly often support interpretable ML well, but there are some limitations. It doesn’t work with categorical or other non-numeric features. And, even with numeric features, some interactions could also be difficult to capture using arithmetic functions. Where there may be a more complex relationship between pairs of features and the goal column, it could be more appropriate to make use of ikNN. This works based on nearest neighbors, so can capture relationships of arbitrary complexity between features and the goal.

We focused on standard decision trees in this text, but for probably the most effective interpretable ML, it will possibly be useful to try other interpretable models. It’s straightforward to see, for instance, how the ideas here will apply on to Genetic Decision Trees, that are similar to straightforward decision trees, simply created using bootstrapping and a genetic algorithm. Similarly for many other interpretable models.

All images are by the writer

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x