ENSEMBLE LEARNING
Everyone makes mistakes — even the only decision trees in machine learning. As a substitute of ignoring them, AdaBoost (Adaptive Boosting) algorithm does something different: it learns (or adapts) from these mistakes to recuperate.
Unlike Random Forest, which makes many trees without delay, AdaBoost starts with a single, easy tree and identifies the instances it misclassifies. It then builds recent trees to repair those errors, learning from its mistakes and improving with each step.
Here, we’ll illustrate exactly how AdaBoost makes its predictions, constructing strength by combining targeted weak learners identical to a workout routine that turns focused exercises into full-body power.
AdaBoost is an ensemble machine learning model that creates a sequence of weighted decision trees, typically using shallow trees (often just single-level “stumps”). Each tree is trained on your entire dataset, but with adaptive sample weights that give more importance to previously misclassified examples.
For classification tasks, AdaBoost combines the trees through a weighted voting system, where better-performing trees get more influence in the ultimate decision.
The model’s strength comes from its adaptive learning process — while each easy tree is perhaps a “weak learner” that performs only barely higher than random guessing, the weighted combination of trees creates a “strong learner” that progressively focuses on and corrects mistakes.
Throughout this text, we’ll concentrate on the classic golf dataset for example for classification.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Create and prepare dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]
# Prepare features and goal
X,y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)Fundamental Mechanism
Here’s how AdaBoost works:
- Initialize Weights: Assign equal weight to every training example.
- Iterative Learning: In each step, a straightforward decision tree is trained and its performance is checked. Misclassified examples get more weight, making them a priority for the subsequent tree. Appropriately classified examples stay the identical, and all weights are adjusted so as to add as much as 1.
- Construct Weak Learners: Each recent, easy tree targets the mistakes of the previous ones, making a sequence of specialised weak learners.
- Final Prediction: Mix all trees through weighted voting, where each tree’s vote is predicated on its importance value, giving more influence to more accurate trees.
Here, we’ll follow the SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function) algorithm, the usual approach in scikit-learn that handles each binary and multi-class classification.
1.1. Resolve the weak learner for use. A one-level decision tree (or “stump”) is the default selection.
1.2. Resolve what number of weak learner (on this case the variety of trees) you need to construct (the default is 50 trees).
1.3. Start by giving each training example equal weight:
· Each sample gets weight = 1/N (N is total variety of samples)
· All weights together sum to 1
For the First Tree
2.1. Construct a choice stump while considering sample weights
a. Calculate initial weighted Gini impurity for the basis node
b. For every feature:
· Sort data by feature values (exactly like in Decision Tree classifier)
· For every possible split point:
·· Split samples into left and right groups
·· Calculate weighted Gini impurity for each groups
·· Calculate weighted Gini impurity reduction for this split
c. Pick the split that provides the biggest Gini impurity reduction
d. Create a straightforward one-split tree using this decision
2.2. Evaluate how good this tree is
a. Use the tree to predict the label of the training set.
b. Add up the weights of all misclassified samples to get error rate
c. Calculate tree importance (α) using:
α = learning_rate × log((1-error)/error)
2.3. Update sample weights
a. Keep the unique weights for accurately classified samples
b. Multiply the weights of misclassified samples by e^(α).
c. Divide each weight by the sum of all weights. This normalization ensures all weights still sum to 1 while maintaining their relative proportions.
For the Second Tree
2.1. Construct a brand new stump, but now using the updated weights
a. Calculate recent weighted Gini impurity for root node:
· Will probably be different because misclassified samples now have greater weights
· Appropriately classified samples now have smaller weights
b. For every feature:
· Same process as before, however the weights have modified
c. Pick the split with best weighted Gini impurity reduction
· Often completely different from the primary tree’s split
· Focuses on samples the primary tree got improper
d. Create the second stump
2.2. Evaluate this recent tree
a. Calculate error rate with current weights
b. Calculate its importance (α) using the identical formula as before
2.3. Update weights again — Same process: increase weights for mistakes then normalize.
For the Third Tree onwards
Repeat Step 2.1–2.3 for all remaining trees.
Step 3: Final Ensemble
3.1. Keep all trees and their importance scores
from sklearn.tree import plot_tree
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt# Train AdaBoost
np.random.seed(42) # For reproducibility
clf = AdaBoostClassifier(algorithm='SAMME', n_estimators=50, random_state=42)
clf.fit(X_train, y_train)
# Create visualizations for trees 1, 2, and 50
trees_to_show = [0, 1, 49]
feature_names = X_train.columns.tolist()
class_names = ['No', 'Yes']
# Arrange the plot
fig, axes = plt.subplots(1, 3, figsize=(14,4), dpi=300)
fig.suptitle('Decision Stumps from AdaBoost', fontsize=16)
# Plot each tree
for idx, tree_idx in enumerate(trees_to_show):
plot_tree(clf.estimators_[tree_idx],
feature_names=feature_names,
class_names=class_names,
filled=True,
rounded=True,
ax=axes[idx],
fontsize=12) # Increased font size
axes[idx].set_title(f'Tree {tree_idx + 1}', fontsize=12)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
Testing Step
For predicting:
a. Get each tree’s prediction
b. Multiply each by its importance rating (α)
c. Add all of them up
d. The category with higher total weight shall be the ultimate prediction
Evaluation Step
After constructing all of the trees, we are able to evaluate the test set.
# Get predictions
y_pred = clf.predict(X_test)# Create DataFrame with actual and predicted values
results_df = pd.DataFrame({
'Actual': y_test,
'Predicted': y_pred
})
print(results_df) # Display results DataFrame
# Calculate and display accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"nModel Accuracy: {accuracy:.4f}")
Listed below are the important thing parameters for AdaBoost, particularly in scikit-learn
:
estimator
: That is the bottom model that AdaBoost uses to construct its final solution. The three commonest weak learners are:
a. Decision Tree with depth 1 (Decision Stump): That is the default and hottest selection. Since it only has one split, it is taken into account a really weak learner that’s only a bit higher than random guessing, exactly what is required for enhancing process.
b. Logistic Regression: Logistic regression (especially with high-penalty) can be used here regardless that it will not be really a weak learner. It might be useful for data that has linear relationship.
c. Decision Trees with small depth (e.g., depth 2 or 3): These are barely more complex than decision stumps. They’re still fairly easy, but can handle barely more complex patterns than the choice stump.
n_estimators
: The variety of weak learners to mix, typically around 50–100. Using greater than 100 rarely helps.
learning_rate
: Controls how much each classifier affects the end result. Common starting values are 0.1, 0.5, or 1.0. Lower numbers (like 0.1) and a bit higher n_estimator
normally work higher.
Key differences from Random Forest
As each Random Forest and AdaBoost works with multiple trees, it is straightforward to confuse the parameters involved. The important thing difference is that Random Forest combines many trees independently (bagging) while AdaBoost builds trees one after one other to repair mistakes (boosting). Listed below are another details about their differences:
- No
bootstrap
parameter because AdaBoost uses all data but with changing weights - No
oob_score
because AdaBoost doesn’t use bootstrap sampling learning_rate
becomes crucial (not present in Random Forest)- Tree depth is usually kept very shallow (normally just stumps) unlike Random Forest’s deeper trees
- The main focus shifts from parallel independent trees to sequential dependent trees, making parameters like
n_jobs
less relevant
Pros:
- Adaptive Learning: AdaBoost gets higher by giving more weight to mistakes it made. Each recent tree pays more attention to the hard cases it got improper.
- Resists Overfitting: Though it keeps adding more trees one after the other, AdaBoost normally doesn’t get too focused on training data. It’s because it uses weighted voting, so no single tree can control the ultimate answer an excessive amount of.
- Built-in Feature Selection: AdaBoost naturally finds which features matter most. Each easy tree picks essentially the most useful feature for that round, which implies it routinely selects necessary features because it trains.
Cons:
- Sensitive to Noise: Since it gives more weight to mistakes, AdaBoost can have trouble with messy or improper data. If some training examples have improper labels, it would focus an excessive amount of on these bad examples, making the entire model worse.
- Must Be Sequential: Unlike Random Forest which might train many trees without delay, AdaBoost must train one tree at a time because each recent tree must understand how the previous trees did. This makes it slower to coach.
- Learning Rate Sensitivity: While it has fewer settings to tune than Random Forest, the training rate really affects how well it really works. If it’s too high, it would learn the training data too exactly. If it’s too low, it needs many more trees to work well.
AdaBoost is a key boosting algorithm that many more moderen methods learned from. Its principal idea — improving by specializing in mistakes — has helped shape many modern machine learning tools. While other methods attempt to be perfect from the beginning, AdaBoost tries to point out that sometimes the perfect technique to solve an issue is to learn out of your errors and keep improving.
AdaBoost also works best in binary classification problems and when your data is clean. While Random Forest is perhaps higher for more general tasks (like predicting numbers) or messy data, AdaBoost can provide really good results when utilized in the fitting way. The undeniable fact that people still use it after so a few years shows just how well the core idea works!
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
# Prepare data
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
# Split features and goal
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Train AdaBoost
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # Create base estimator (decision stump)
n_estimators=50, # Typically fewer trees than Random Forest
learning_rate=1.0, # Default learning rate
algorithm='SAMME', # The one currently available algorithm (shall be removed in future scikit-learn updates)
random_state=42
)
ada.fit(X_train, y_train)
# Predict and evaluate
y_pred = ada.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")