trees are a preferred supervised learning algorithm with advantages that include with the ability to be used for each regression and classification in addition to being easy to interpret. Nevertheless, decision trees aren’t essentially the most performant algorithm and are susceptible to overfitting on account of small variations within the training data. This can lead to a very different tree. For this reason people often turn to ensemble models like Bagged Trees and Random Forests. These consist of multiple decision trees trained on bootstrapped data and aggregated to attain higher predictive performance than any single tree could offer. This tutorial includes the next:Â
- What’s Bagging
- What Makes Random Forests Different
- Training and Tuning a Random Forest using Scikit-Learn
- Calculating and Interpreting Feature Importance
- Visualizing Individual Decision Trees in a Random Forest
As all the time, the code utilized in this tutorial is on the market on my GitHub. A video version of this tutorial can also be available on my YouTube channel for individuals who prefer to follow along visually. With that, let’s start!
What’s Bagging (Bootstrap Aggregating)
Random forests could be categorized as bagging algorithms (bootstrap aggregating). Bagging consists of two steps:
1.) Bootstrap sampling: Create multiple training sets by randomly drawing samples with alternative from the unique dataset. These latest training sets, called bootstrapped datasets, typically contain the identical variety of rows as the unique dataset, but individual rows may appear multiple times or by no means. On average, each bootstrapped dataset comprises about 63.2% of the unique rows from the unique data. The remaining ~36.8% of rows are not noted and could be used for out-of-bag (OOB) evaluation. For more on this idea, see my sampling with and without alternative blog post.
2.) Aggregating predictions: Each bootstrapped dataset is used to coach a special decision tree model. The ultimate prediction is made by combining the outputs of all individual trees. For classification, this is usually done through majority voting. For regression, predictions are averaged.
Training each tree on a special bootstrapped sample introduces variation across trees. While this doesn’t fully eliminate correlation—especially when certain features dominate—it helps reduce overfitting when combined with aggregation. Averaging the predictions of many such trees reduces the general variance of the ensemble, improving generalization.
What Makes Random Forests Different

Suppose there’s a single strong feature in your dataset. In bagged trees, each tree may repeatedly split on that feature, resulting in correlated trees and fewer profit from aggregation. Random Forests reduce this issue by introducing further randomness. Specifically, they alter how splits are chosen during training:
1). Create N bootstrapped datasets. Note that while bootstrapping is usually utilized in Random Forests, it just isn’t strictly vital because step 2 (random feature selection) introduces sufficient diversity among the many trees.
2). For every tree, at each node, a random subset of features is chosen as candidates, and one of the best split is chosen from that subset. In scikit-learn, that is controlled by the max_features
parameter, which defaults to 'sqrt'
for classifiers and 1
for regressors (comparable to bagged trees).
3). Aggregating predictions: vote for classification and average for regression.
Note: Random Forests use sampling with alternative for bootstrapped datasets and sampling without alternative for choosing a subset of features.Â

Out-of-Bag (OOB) Rating
Because ~36.8% of coaching data is excluded from any given tree, you should utilize this holdout portion to judge that tree’s predictions. Scikit-learn allows this via the oob_score=True parameter, providing an efficient strategy to estimate generalization error. You’ll see this parameter utilized in the training example later within the tutorial.
Training and Tuning a Random Forest in Scikit-Learn
Random Forests remain a powerful baseline for tabular data because of their simplicity, interpretability, and skill to parallelize since each tree is trained independently. This section demonstrates methods to load data, perform a train test split, train a baseline model, tune hyperparameters using grid search, and evaluate the ultimate model on the test set.
Step 1: Train a Baseline Model
Before tuning, it’s good practice to coach a baseline model using reasonable defaults. This offers you an initial sense of performance and allows you to validate generalization using the out-of-bag (OOB) rating, which is built into bagging-based models like Random Forests. This instance uses the House Sales in King County dataset (CCO 1.0 Universal License), which comprises property sales from the Seattle area between May 2014 and May 2015. This approach allows us to order the test set for final evaluation after tuning.
# Import libraries
# Some imports are only used later within the tutorial
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Dataset: Breast Cancer Wisconsin (Diagnostic)
# Source: UCI Machine Learning Repository
# License: CC BY 4.0
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import tree
# Load dataset
# Dataset: House Sales in King County (May 2014–May 2015)
# License CC0 1.0 Universal
url = 'https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv'
df = pd.read_csv(url)
columns = ['bedrooms',
            'bathrooms',
            'sqft_living',
            'sqft_lot',
             'floors',
             'waterfront',
             'view',
             'condition',
             'grade',
             'sqft_above',
             'sqft_basement',
             'yr_built',
             'yr_renovated',
             'lat',
             'long',
             'sqft_living15',
             'sqft_lot15',
             'price']
df = df[columns]
# Define features and goal
X = df.drop(columns='price')
y = df['price']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Train baseline Random Forest
reg = RandomForestRegressor(
    n_estimators=100,    # variety of trees
    max_features=1/3,    # fraction of features considered at each split
    oob_score=True,     # enables out-of-bag evaluation
    random_state=0
)
reg.fit(X_train, y_train)
# Evaluate baseline performance using OOB rating
print(f"Baseline OOB rating: {reg.oob_score_:.3f}")

Step 2: Tune Hyperparameters with Grid Search
While the baseline model gives a powerful place to begin, performance can often be improved by tuning key hyperparameters. Grid search cross-validation, as implemented by GridSearchCV
, systematically explores combos of hyperparameters and uses cross-validation to judge every one, choosing the configuration with the very best validation performance.Probably the most commonly tuned hyperparameters include:
n_estimators
: The variety of decision trees within the forest. More trees can improve accuracy but increase training time.max_features
: The variety of features to think about when on the lookout for one of the best split. Lower values reduce correlation between trees.max_depth
: The utmost depth of every tree. Shallower trees are faster but may underfit.min_samples_split
: The minimum variety of samples required to separate an internal node. Higher values can reduce overfitting.min_samples_leaf
: The minimum variety of samples required to be at a leaf node. Helps control tree size.bootstrap
: Whether bootstrap samples are used when constructing trees. If False, the entire dataset is used.
param_grid = {
    'n_estimators': [100],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
# Initialize model
rf = RandomForestRegressor(random_state=0, oob_score=True)
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,       # 5-fold cross-validation
    scoring='r2',   # evaluation metric
    n_jobs=-1     # use all available CPU cores
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best R^2 rating: {grid_search.best_score_:.3f}")

Step 3: Evaluate Final Model on Test Set
Now that we’ve chosen the best-performing model based on cross-validation, we will evaluate it on the held-out test set to estimate its generalization performance.
# Evaluate final model on test set
best_model = grid_search.best_estimator_
print(f"Test R^2 rating (final model): {best_model.rating(X_test, y_test):.3f}")

Calculating Random Forest Feature Importance
Certainly one of the important thing benefits of Random Forests is their interpretability — something that giant language models (LLMs) often lack. While LLMs are powerful, they typically function as black boxes and might exhibit biases which can be difficult to discover. In contrast, scikit-learn supports two essential methods for measuring feature importance in Random Forests: Mean Decrease in Impurity and Permutation Importance.
1). Mean Decrease in Impurity (MDI): Also often known as Gini importance, this method calculates the entire reduction in impurity brought by each feature across all trees. That is fast and built into the model via reg.feature_importances_
. Nevertheless, impurity-based feature importances could be misleading, especially for features with high cardinality (many unique values), as these features usually tend to be chosen just because they supply more potential split points.
importances = reg.feature_importances_
feature_names = X.columns
sorted_idx = np.argsort(importances)[::-1]
for i in sorted_idx:
    print(f"{feature_names[i]}: {importances[i]:.3f}")

2). Permutation Importance: This method assesses the decrease in model performance when a single feature’s values are randomly shuffled. Unlike MDI, it accounts for feature interactions and correlation. It’s more reliable but additionally more computationally expensive.
# Perform permutation importance on the test set
perm_importance = permutation_importance(reg, X_test, y_test, n_repeats=10, random_state=0)
sorted_idx = perm_importance.importances_mean.argsort()[::-1]
for i in sorted_idx:
    print(f"{X.columns[i]}: {perm_importance.importances_mean[i]:.3f}")
It is necessary to notice that our geographic features lat and long are also useful for visualization because the plot below shows. It’s likely that firms like Zillow leverage location information extensively of their valuation models.

Visualizing Individual Decision Trees in a Random Forest
A Random Forest consists of multiple decision trees—one for every estimator specified via the n_estimators
parameter. After training the model, you may access these individual trees through the .estimators_ attribute. Visualizing a number of of those trees might help illustrate how in a different way every one splits the info on account of bootstrapped training samples and random feature selection at each split. While the sooner example used a RandomForestRegressor, here we exhibit this visualization using a RandomForestClassifier trained on the Breast Cancer Wisconsin dataset (CC BY 4.0 license) to spotlight Random Forests’ versatility for each regression and classification tasks. This short video demonstrates what 100 trained estimators from this dataset appear like.
Fit a Random Forest Model using Scikit-Learn
# Load the Breast Cancer (Diagnostic) Dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.goal
# Arrange Data into Features Matrix and Goal Vector
X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values
# Split the info into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)
# Random Forests in `scikit-learn` (with N = 100)
rf = RandomForestClassifier(n_estimators=100,
                            random_state=0)
rf.fit(X_train, Y_train)
Plotting Individual Estimators (decision trees) from a Random Forest using Matplotlib
You possibly can now view all the person trees from the fitted model.Â
rf.estimators_

You possibly can now visualize individual trees. The code below visualizes the primary decision tree.
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(rf.estimators_[0],
               feature_names = fn,Â
               class_names=cn,
               filled = True);
fig.savefig('rf_individualtree.png')

Although plotting many trees could be difficult to interpret, you might want to explore the variability across estimators. The next example shows methods to visualize the primary five decision trees within the forest:
# This may occasionally not one of the best strategy to view each estimator because it is small
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(10, 2), dpi=3000)
for index in range(5):
    tree.plot_tree(rf.estimators_[index],
                   feature_names=fn,
                   class_names=cn,
                   filled=True,
                   ax=axes[index])
    axes[index].set_title(f'Estimator: {index}', fontsize=11)
fig.savefig('rf_5trees.png')

Conclusion
Random forests consist of multiple decision trees trained on bootstrapped data so as to achieve higher predictive performance than might be obtained from any of the person decision trees. If you’ve gotten questions or thoughts on the tutorial, be happy to achieve out through YouTube or X.