Home Artificial Intelligence Interpreting Random Forests

Interpreting Random Forests

0
Interpreting Random Forests

There’s numerous hype about Large Language Models nowadays, but it surely doesn’t mean that old-school ML approaches now deserve extinction. I doubt that ChatGPT can be helpful if you happen to give it a dataset with a whole bunch numeric features and ask it to predict a goal value.

Neural Networks are often the perfect solution in case of unstructured data (for instance, texts, images or audio). But, for tabular data, we will still profit from the nice old Random Forest.

Essentially the most significant benefits of Random Forest algorithms are the next:

  • You simply have to do a bit data preprocessing.
  • It’s reasonably difficult to screw up with Random Forests. You won’t face overfitting issues if you will have enough trees in your ensemble since adding more trees decreases the error.
  • It’s easy to interpret results.

That’s why Random Forest could possibly be a very good candidate to your first model when starting a recent task with tabular data.

In this text, I would really like to cover the fundamentals of Random Forests and undergo approaches to interpreting model results.

We’ll learn how one can find answers to the next questions:

  • What features are necessary, and which of them are redundant and will be removed?
  • How does each feature value affect our goal metric?
  • What are the aspects for every prediction?
  • How one can estimate the boldness of every prediction?

We can be using the Wine Quality dataset. It shows the relation between wine quality and physicochemical test for the various Portuguese “Vinho Verde” wine variants. We’ll attempt to predict wine quality based on wine characteristics.

With decision trees, we don’t have to do numerous preprocessing:

  • We don’t have to create dummy variables because the algorithm can handle it routinely.
  • We don’t have to do normalisation or eliminate outliers because only ordering matters. So, Decision Tree based models are immune to outliers.

Nonetheless, the scikit-learn realisation of Decision Trees can’t work with categorical variables or Null values. So, we now have to handle it ourselves.

Fortunately, there aren’t any missing values in our dataset.

df.isna().sum().sum()

0

And we only need to rework the type variable (‘red’ or ‘white’) from string to integer. We will use pandas Categorical transformation for it.

categories = {}  
cat_columns = ['type']
for p in cat_columns:
df[p] = pd.Categorical(df[p])

categories[p] = df[p].cat.categories

df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
print(categories)

{'type': Index(['red', 'white'], dtype='object')}

Now, df['type'] equals 0 for red wines and 1 for white vines.

The opposite crucial a part of preprocessing is to separate our dataset into train and validation sets. So, we will use a validation set to evaluate our model’s quality.

import sklearn.model_selection

train_df, val_df = sklearn.model_selection.train_test_split(df,
test_size=0.2)

train_X, train_y = train_df.drop(['quality'], axis = 1), train_df.quality
val_X, val_y = val_df.drop(['quality'], axis = 1), val_df.quality

print(train_X.shape, val_X.shape)

(5197, 12) (1300, 12)

We’ve finished the preprocessing step and are able to move on to essentially the most exciting part — training models.

Before jumping into the training, let’s spend a while understanding how Random Forests work.

Random Forest is an ensemble of Decision Trees. So, we should always start with the elementary constructing block — Decision Tree.

In our example of predicting wine quality, we can be solving a regression task, so let’s start with it.

Decision Tree: Regression

Let’s fit a default decision tree model.

import sklearn.tree
import graphviz

model = sklearn.tree.DecisionTreeRegressor(max_depth=3)
# I've limited max_depth mostly for visualization purposes

model.fit(train_X, train_y)

Probably the most significant benefits of Decision Trees is that we will easily interpret these models — it’s only a set of questions. Let’s visualise it.


dot_data = sklearn.tree.export_graphviz(model, out_file=None,
feature_names = train_X.columns,
filled = True)

graph = graphviz.Source(dot_data)

# saving tree to png file
png_bytes = graph.pipe(format='png')
with open('decision_tree.png','wb') as f:
f.write(png_bytes)

Graph by writer

As you’ll be able to see, the Decision Tree consists of binary splits. On each node, we’re splitting our dataset into 2.

Finally, we calculate predictions for the leaf nodes as a median of all data points on this node.

Side note: Because Decision Tree returns a median of all data points for a leaf node, Decision Trees are pretty bad in extrapolation. So, you might want to keep watch over the feature distributions during training and inference.

Let’s brainstorm how one can discover the perfect split for our dataset. We will start with one variable and define the optimal division for it.

Suppose we now have a feature with 4 unique values: 1, 2, 3 and 4. Then, there are three possible thresholds between them.

Graph by writer

We will consequently take each threshold and calculate predicted values for our data as a median value for leaf nodes. Then, we will use these predicted values to get MSE (Mean Square Error) for every threshold. The most effective split can be the one with the bottom MSE. By default, DecisionTreeRegressor from scikit-learn works similarly and uses MSE as a criterion.

Let’s calculate the perfect split for sulphates feature manually to grasp higher how it really works.

def get_binary_split_for_param(param, X, y):
uniq_vals = list(sorted(X[param].unique()))

tmp_data = []

for i in range(1, len(uniq_vals)):
threshold = 0.5 * (uniq_vals[i-1] + uniq_vals[i])

# split dataset by threshold
split_left = y[X[param] <= threshold]
split_right = y[X[param] > threshold]

# calculate predicted values for every split
pred_left = split_left.mean()
pred_right = split_right.mean()

num_left = split_left.shape[0]
num_right = split_right.shape[0]

mse_left = ((split_left - pred_left) * (split_left - pred_left)).mean()
mse_right = ((split_right - pred_right) * (split_right - pred_right)).mean()
mse = mse_left * num_left / (num_left + num_right)
+ mse_right * num_right / (num_left + num_right)

tmp_data.append(
{
'param': param,
'threshold': threshold,
'mse': mse
}
)

return pd.DataFrame(tmp_data).sort_values('mse')

get_binary_split_for_param('sulphates', train_X, train_y).head(5)

| param | threshold | mse |
|:----------|------------:|---------:|
| sulphates | 0.685 | 0.758495 |
| sulphates | 0.675 | 0.758794 |
| sulphates | 0.705 | 0.759065 |
| sulphates | 0.715 | 0.759071 |
| sulphates | 0.635 | 0.759495 |

We will see that for sulphates, the perfect threshold is 0.685 because it gives the bottom MSE.

Now, we will use this function for all features we now have to define the perfect split overall.

def get_binary_split(X, y):
tmp_dfs = []
for param in X.columns:
tmp_dfs.append(get_binary_split_for_param(param, X, y))

return pd.concat(tmp_dfs).sort_values('mse')

get_binary_split(train_X, train_y).head(5)

| param | threshold | mse |
|:--------|------------:|---------:|
| alcohol | 10.625 | 0.640368 |
| alcohol | 10.675 | 0.640681 |
| alcohol | 10.85 | 0.641541 |
| alcohol | 10.725 | 0.641576 |
| alcohol | 10.775 | 0.641604 |

We got absolutely the identical result as our initial decision tree with the primary split on alcohol <= 10.625 .

To construct the entire Decision Tree, we could recursively calculate the perfect splits for every of the datasets alcohol <= 10.625 and alcohol > 10.625 and get the subsequent level of Decision Tree. Then, repeat.

The stopping criteria for recursion could possibly be either the depth or the minimal size of the leaf node. Here’s an example of a Decision Tree with at the very least 420 items within the leaf nodes.

model = sklearn.tree.DecisionTreeRegressor(min_samples_leaf = 420)
Graph by writer

Let’s calculate the mean absolute error on the validation set to grasp how good our model is. I prefer MAE over MSE (Mean Squared Error) since it’s less affected by outliers.

import sklearn.metrics
print(sklearn.metrics.mean_absolute_error(model.predict(val_X), val_y))
0.5890557338155006

Decision Tree: Classification

We’ve checked out the regression example. Within the case of classification, it’s a bit different. Regardless that we won’t go deep into classification examples in this text, it’s still price discussing its basics.

For classification, as a substitute of the common value, we use essentially the most common class as a prediction for every leaf node.

We often use the Gini coefficient to estimate the binary split’s quality for classification. Imagine getting one random item from the sample after which the opposite. The Gini coefficient can be equal to the probability of the situation when items are from different classes.

Let’s say we now have only two classes, and the share of things from the firstclass is the same as p . Then we will calculate the Gini coefficient using the next formula:

If our classification model is ideal, the Gini coefficient equals 0. Within the worst case (p = 0.5), the Gini coefficient equals 0.5.

To calculate the metric for binary split, we calculate Gini coefficients for each parts (left and right ones) and norm them on the variety of samples in each partition.

Then, we will similarly calculate our optimisation metric for various thresholds and use the perfect option.

We’ve trained an easy Decision Tree model and discussed how it really works. Now, we're able to move on to the Random Forests.

Random Forests are based on the concept of Bagging. The concept is to suit a bunch of independent models and use a median prediction from them. Since models are independent, errors aren't correlated. We assume that our models haven't any systematic errors, and the common of many errors ought to be near zero.

How could we get plenty of independent models? It’s pretty straightforward: we will train Decision Trees on random subsets of rows and features. It'll be a Random Forest.

Let’s train a basic Random Forest with 100 trees and the minimal size of leaf nodes equal to 100.

import sklearn.ensemble
import sklearn.metrics

model = sklearn.ensemble.RandomForestRegressor(100, min_samples_leaf=100)
model.fit(train_X, train_y)

print(sklearn.metrics.mean_absolute_error(model.predict(val_X), val_y))
0.5592536196736408

With random forest, we’ve achieved a significantly better quality than with one Decision Tree: 0.5592 vs. 0.5891.

Overfitting

The meaningful query is whether or not Random Forrest could overfit.

Actually, no. Since we're averaging not correlated errors, we cannot overfit the model by adding more trees. Quality will improve asymptotically with the rise within the variety of trees.

Graph by writer

Nonetheless, you may face overfitting if you will have deep trees and never enough of them. It’s easy to overfit one Decision Tree.

Out-of-bag error

Since only a part of the rows is used for every tree in Random Forest, we will use them to estimate the error. For every row, we will select only trees where this row wasn’t used and use them to make predictions. Then, we will calculate errors based on these predictions. Such an approach is named “out-of-bag error”.

We will see that the OOB error is way closer to the error on the validation set than the one for training, which suggests it’s a very good approximation.

# we'd like to specify oob_score = True to give you the option to calculate OOB error
model = sklearn.ensemble.RandomForestRegressor(100, min_samples_leaf=100,
oob_score=True)

model.fit(train_X, train_y)

# error for validation set
print(sklearn.metrics.mean_absolute_error(model.predict(val_X), val_y))
0.5592536196736408

# error for training set
print(sklearn.metrics.mean_absolute_error(model.predict(train_X), train_y))
0.5430398596179975

# out-of-bag error
print(sklearn.metrics.mean_absolute_error(model.oob_prediction_, train_y))
0.5571191870008492

As I discussed at first, the massive advantage of Decision Trees is that it’s easy to interpret them. Let’s try to grasp our model higher.

Feature importances

The calculation of the feature importance is pretty straightforward. We have a look at each decision tree within the ensemble and every binary split and calculate its impact on our metric (squared_error in our case).

Let’s have a look at the primary split by alcohol for one among our initial decision trees.

Then, we will do the identical calculations for all binary splits in all decision trees, add every part up, normalize and get the relative importance for every feature.

When you use scikit-learn, you don’t have to calculate feature importance manually. You'll be able to just take model.feature_importances_.

def plot_feature_importance(model, names, threshold = None):
feature_importance_df = pd.DataFrame.from_dict({'feature_importance': model.feature_importances_,
'feature': names})
.set_index('feature').sort_values('feature_importance', ascending = False)

if threshold will not be None:
feature_importance_df = feature_importance_df[feature_importance_df.feature_importance > threshold]

fig = px.bar(
feature_importance_df,
text_auto = '.2f',
labels = {'value': 'feature importance'},
title = 'Feature importances'
)

fig.update_layout(showlegend = False)
fig.show()

plot_feature_importance(model, train_X.columns)

We will see that crucial features overall are alcohol and volatile acidity .

Graph by writer

Understanding how each feature affects our goal metric is exciting and sometimes useful. For instance, whether quality increases/decreases with higher alcohol or there’s a more complex relation.

We could just get data from our dataset and plot averages by alcohol, but it surely won’t be correct since there could be some correlations. For instance, higher alcohol in our dataset could also correspond to more elevated sugar and higher quality.

To estimate the impact only from alcohol, we will take all rows in our dataset and, using the ML model, predict the standard for every row for various values of alcohol: 9, 9.1, 9.2, etc. Then, we will average results and get the actual relation between alcohol level and wine quality. So, all the information is equal, and we are only various alcohol levels.

This approach could possibly be used with any ML model, not only Random Forest.

We will use sklearn.inspection module to simply plot this relations.

sklearn.inspection.PartialDependenceDisplay.from_estimator(clf, train_X, 
range(12))

We will gain quite numerous insights from these graphs, for instance:

  • wine quality increases with the expansion of free sulfur dioxide as much as 30, but it surely’s stable after this threshold;
  • with alcohol, the upper the extent — the higher the standard.

We will even have a look at relations between two variables. It might probably be pretty complex. For instance, if the alcohol level is above 11.5, volatile acidity has no effect. But, for lower alcohol levels, volatile acidity significantly impacts quality.

sklearn.inspection.PartialDependenceDisplay.from_estimator(clf, train_X, 
[(1, 10)])

Confidence of predictions

Using Random Forests, we may assess how confident each prediction is. For that, we could calculate predictions from each tree within the ensemble and have a look at variance or standard deviation.

val_df['predictions_mean'] = np.stack([dt.predict(val_X.values) 
for dt in model.estimators_]).mean(axis = 0)
val_df['predictions_std'] = np.stack([dt.predict(val_X.values)
for dt in model.estimators_]).std(axis = 0)

ax = val_df.predictions_std.hist(bins = 10)
ax.set_title('Distribution of predictions std')

We will see that there are predictions with low standard deviation (i.e. below 0.15) and those with std above 0.3.

If we use the model for business purposes, we will treat such cases in a different way. For instance, don't consider prediction if std above X or show to the shopper intervals (i.e. percentile 25% and percentile 75%).

How prediction was made?

We may use packages treeinterpreter and waterfallcharts to grasp how each prediction was made. It could possibly be handy in some business cases, for instance, when you might want to tell customers why credit for them was rejected.

We'll have a look at one among the wines for example. It has relatively low alcohol and high volatile acidity.

from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall

row = val_X.iloc[[7]]
prediction, bias, contributions = treeinterpreter.predict(model, row.values)

waterfall(val_X.columns, contributions[0], threshold=0.03,
rotation_value=45, formatting='{:,.3f}');

The graph shows that this wine is best than average. The major factor that increases quality is a low level of volatile acidity, while the major drawback is a low level of alcohol.

Graph by writer

So, there are numerous handy tools that would allow you to to grasp your data and model significantly better.

The opposite cool feature of Random Forest is that we could use it to scale back the variety of features for any tabular data. You'll be able to quickly fit a Random Forest and define an inventory of meaningful columns in your data.

More data doesn’t all the time mean higher quality. Also, it could actually affect your model performance during training and inference.

Since in our initial wine dataset, there have been only 12 features, for this case, we are going to use a rather greater dataset — Online News Popularity.

feature importance

First, let’s construct a Random Forest and have a look at feature importances. 34 out of 59 features have an importance lower than 0.01.

Let’s attempt to remove them and have a look at accuracy.

low_impact_features = feature_importance_df[feature_importance_df.feature_importance <= 0.01].index.values

train_X_imp = train_X.drop(low_impact_features, axis = 1)
val_X_imp = val_X.drop(low_impact_features, axis = 1)

model_imp = sklearn.ensemble.RandomForestRegressor(100, min_samples_leaf=100)
model_imp.fit(train_X_sm, train_y)

  • MAE on validation set for all features: 2969.73
  • MAE on validation set for 25 necessary features: 2975.61

The difference in quality will not be so big, but we could make our model faster within the training and inference stages. We’ve already removed almost 60% of the initial features — good job.

redundant features

For the remaining features, let’s see whether there are redundant (highly correlated) ones. For that, we are going to use a Fast.AI tool:

import fastbook
fastbook.cluster_columns(train_X_imp)

We could see that the next features are close to one another:

  • self_reference_avg_sharess and self_reference_max_shares
  • kw_min_avg and kw_min_max
  • n_non_stop_unique_tokens and n_unique_tokens .

Let’s remove them as well.

non_uniq_features = ['self_reference_max_shares', 'kw_min_max', 
'n_unique_tokens']
train_X_imp_uniq = train_X_imp.drop(non_uniq_features, axis = 1)
val_X_imp_uniq = val_X_imp.drop(non_uniq_features, axis = 1)

model_imp_uniq = sklearn.ensemble.RandomForestRegressor(100,
min_samples_leaf=100)
model_imp_uniq.fit(train_X_imp_uniq, train_y)
sklearn.metrics.mean_absolute_error(model_imp_uniq.predict(val_X_imp_uniq),
val_y)
2974.853274034488

Quality even a bit bit improved. So, we’ve reduced the variety of features from 59 to 22 and increased the error only by 0.17%. It proves that such an approach works.

Yow will discover the total code on GitHub.

In this text, we’ve discussed how Decision Tree and Random Forest algorithms work. Also, we’ve learned how one can interpret Random Forests:

  • How one can use feature importance to get the list of essentially the most significant features and reduce the variety of parameters in your model.
  • How one can define the effect of every feature value on the goal metric using partial dependence.
  • How one can estimate the impact of various features on each prediction using treeinterpreter library.

Thanks lots for reading this text. I hope it was insightful to you. If you will have any follow-up questions or comments, please leave them within the comments section.

Datasets

  • Cortez,Paulo, Cerdeira,A., Almeida,F., Matos,T., and Reis,J.. (2009). Wine Quality. UCI Machine Learning Repository.
    https://doi.org/10.24432/C56S3T
  • Fernandes,Kelwin, Vinagre,Pedro, Cortez,Paulo, and Sernadela,Pedro. (2015). Online News Popularity. UCI Machine Learning Repository. https://doi.org/10.24432/C5NS3V

Sources

This text was inspired by Fast.AI Deep Learning Course

LEAVE A REPLY

Please enter your comment!
Please enter your name here