Home Artificial Intelligence Introduction to Ensemble Methods

Introduction to Ensemble Methods

1
Introduction to Ensemble Methods

An ensemble method is a robust technique that permits you to mix multiple machine learning models into one model, which regularly performs higher than any of the person models alone.

This text explains the rationale behind ensemble methods, reviews the different sorts of ensemble methods, and provides just a few examples.

One might wonder why a mix of various models produces a model that is best than each of the person models.

For instance, should you ask two people the identical query, and each has only 70% probability of supplying you with the right answer, are you able to, by combining their answers, get the right answer greater than 70% of the time? The short answer is: whether their answers are correlated or not.

Assume that we have now an ensemble of n classifiers, each of which has an of ϵ, i.e., the probability that a classifier will make a mistaken prediction on a given sample is ϵ. We also assume that the ensemble uses a majority voting on the classifiers’ predictions to make its final prediction.

Let’s attempt to compute the error rate of the ensemble on this case. First, we denote the variety of classifiers that made a mistaken prediction by k (out of n classifiers).

If the bottom classifiers are (i.e., their errors are uncorrelated), then the variable k has a binomial distribution kB(n, ϵ), since each prediction may be treated as an end result of an experiment with two possible results (correct prediction or incorrect) out of n independent experiments.

The ensemble will make a mistaken prediction provided that at the least half of the bottom classifiers are mistaken (because it uses a majority voting). Due to this fact, in line with the Binomial probability distribution, the error rate of the ensemble is:

For instance, if we have now 25 base classifiers, each with an error rate of 0.25, then the error rate of the ensemble can be:

The ensemble’s error rate becomes much smaller than 0.25!

We are able to actually plot a graph of the ensemble error rate as a function of the bottom error rate ϵ (assuming that we have now 25 base classifiers):

We are able to see that the turning point is ϵ = 0.5, i.e., when the error rate ϵ of every classifier is lower than 0.5, then the ensemble performs higher than the person classifiers. Nevertheless, when the error rate ϵ is higher than 0.5, the ensemble performs worse than the bottom classifiers.

From the above discussion we are able to learn that there are two conditions under which the ensemble performs higher than the person classifiers:

  1. The bottom classifiers needs to be independent of one another. In practice, it’s difficult to make sure total independence between the classifiers. Nevertheless, practice has shown that even when the classifiers are partially correlated, the ensemble can still perform higher than any one among them.
  2. Each base classifier must have an error rate of lower than 0.5, i.e., it should perform higher than a random guesser. A model that performs only barely higher than a random guesser is named a . As we have now just shown, an ensemble of many weak learners can change into a powerful learner.

There are numerous ways during which we are able to create the ensemble:

  1. By utilizing different learning algorithms to coach each base model.
  2. By manipulating the training set, i.e., each model is trained on a distinct a part of the training set.
  3. By manipulating the input features, i.e., each model is trained on a distinct subset of the input features.

A straightforward method to aggregate the predictions of the bottom models within the ensemble is by utilizing voting. Each base model makes a prediction and votes for every sample. Then the ensemble returns the category with the very best votes.

There are two sorts of voting:

  1. (hard voting) — returns the label that represents nearly all of the labels predicted by the bottom classifiers.
  2. — returns the label with the very best total probability, computed because the sum of the prediction probabilities of the bottom classifiers.

To reveal soft voting, let’s assume that we have now 3 classes and an ensemble of three classifiers, and the chances they predict for every class are:

On this case, the ensemble will predict that the label belongs to class 2, because it has the very best total probability.

Soft voting often achieves higher results than hard voting, because it gives more weight to highly confident votes. Nevertheless, soft voting is just possible if all the bottom classifiers provide probability estimates for the category labels (some algorithms like SVM and perceptrons only provide hard labels).

Scikit-Learn provides two classes for making a voting ensemble:

  1. VotingClassifier is used for classification problems. It supports each majority and soft voting.
  2. VotingRegressor is used for regression problems. It averages the predictions of the bottom regressors to form the the ultimate prediction.

For instance, let’s reveal how you can use the VotingClassifier.

First, we’ll use the function make_moons() to generate a toy data set for the classification. This data set consists of two interleaving half circles:

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.3, random_state=0)

Let’s visualize the info:

Data set for the classification task

Next, we now define our base classifiers. In this instance we’ll use three classifiers: Logistic Regression, Gaussian Naive Bayes, and Decision Tree.

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

clf1 = LogisticRegression()
clf2 = GaussianNB()
clf3 = DecisionTreeClassifier()

We now create our VotingClassifier, which gets in its constructor an inventory of the bottom classifiers, and the voting method (the default is ‘hard’):

from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier([('log', clf1), ('knn', clf2), ('dt', clf3)],
voting='soft')

Next, we use cross validation with 5 folds to guage each the bottom classifiers and the ensemble:

from sklearn.model_selection import cross_val_score

names = ['Logistic Regression', 'Gaussian NB', 'Decision Tree', 'Ensemble']
classifiers = [clf1, clf2, clf3, ensemble]

for clf, name in zip(classifiers, names):
scores = cross_val_score(clf, X, y, cv=5)
print(f'Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f}) [{name}]')

And the outcomes are:

Accuracy: 0.830 (+/- 0.051) [Logistic Regression]
Accuracy: 0.840 (+/- 0.046) [Gaussian NB]
Accuracy: 0.840 (+/- 0.041) [Decision Tree]
Accuracy: 0.885 (+/- 0.037) [Ensemble]

We are able to see that the ensemble has achieved a greater result than all of the three base classifiers. As well as, it had less variance across the 5 different training sets (and thus was less overfitting the training set).

Let’s also plot the choice boundaries of the bottom classifiers and the ensemble:

from matplotlib.colours import ListedColormap

def plot_decision_boundaries(clf, X, y, feature_names, ax, title, h=0.02):
colours = ['r', 'b']
cmap = ListedColormap(colours)

# Assign a color to every point within the mesh [x_min, x_max]x[y_min, y_max]
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result right into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.4, cmap=cmap)

# Plot also the sample points
sns.scatterplot(X[:, 0], X[:, 1], hue=y, style=y,
palette=colours, markers=('s', 'o'), edgecolor='black', ax=ax)
ax.set_xlabel(feature_names[0])
ax.set_ylabel(feature_names[1])
ax.set_title(title)
ax.legend()

feature_names = ['$x_1$', '$x_2$']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flat
k = 0

for clf, name in zip(classifiers, names):
clf.fit(X, y)
plot_decision_boundaries(clf, X, y, feature_names, ax=axes[k], title=name)
k += 1

And the result’s:

The choice boundaries of the three base classifiers and the ensemble

We are able to see that the ensemble combined the choices of the three base classifiers, which caused its decision boundary to change into more smooth and fewer overfit to the info.

In bagging, we train the identical base model K times, but on random subsets of the training set. Then we use voting to aggregate the predictions of the K models to form the ensemble’s prediction.

Since we’re using the identical algorithm to coach all of the K models, this enables us to create much larger ensembles than in easy voting, where each model uses a distinct algorithm.

Random Forests is a well-liked machine learning model based on bagging, where the bottom classifiers are decision trees.

In boosting, the bottom models are trained sequentially. Samples which are misclassified by one model are assigned greater weight when used to coach the subsequent model. Each model is thereby focused on examples that were misclassified by the previous ones.

The predictions of the models are combined through a weighted majority vote, where the weights are based on how well each model performed on the training set.

Popular boosting methods include AdaBoost and XGBoost.

Bagging and boosting methods might be covered in additional detail in future articles.

You’ll find the code examples of this text on my github: https://github.com/roiyeho/medium/tree/predominant/ensembles

Thanks for reading!

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here