Stepwise Selection Made Easy: Improve Your Regression Models in Python

To get probably the most out of this tutorial, you need to have already got a solid understanding of how linear regression works and the assumptions behind it. You must also bear in mind that, in practice, multicollinearity is addressed using the Variance Inflation Factor (VIF). As well as, you have to understand what prediction risk means, and be conversant in the fundamentals of Python in addition to its core functions.

At the top of this text, you will see that the code for the stepwise selection procedure used here. The implementation follows two key principles: orthogonality and Don’t Repeat Yourself (DRY), ensuring clean, modular, and reusable code.

Reducing the variety of variables in a regression model shouldn’t be only a technical exercise; it’s a strategic selection that should be guided by the objectives of the evaluation. In a previous work, we demonstrated how easy tools, similar to correlation evaluation and the Variance Inflation Factor (VIF), can already shrink a dataset with a whole bunch of predictors right into a much more compact model. Yet, even after this initial reduction, models often still contain too many variables to work effectively. A smaller model with fewer predictors offers several benefits: it could yield higher predictions than a bigger model, it’s more parsimonious, hence easier to interpret, and it often generalizes higher. As more variables are added, the model’s bias decreases but its variance increases. That is the essence of the bias–variance trade-off: too few variables result in high bias (underfitting), whereas too many result in high variance (overfitting). Good predictive performance requires a balance between the 2.

This raises a fundamental query for anyone working with regression models: how will we determine which variables need to be included within the model? In other words, how can we reduce the dimensionality of our data without losing essential information?

The challenge depends upon the aim of the evaluation. Should the model provide precise estimates of the coefficients? Should it discover which predictors are essential? Or should it maximize predictive accuracy? Each of those objectives calls for various approaches to model selection, and ignoring this distinction can result in misleading conclusions.

In this text, we address the challenge of model selection in regression. We start by outlining the overall framework of linear regression (readers already conversant in it could skip this section). We then review the important scoring criteria used to judge competing models, followed by a discussion of the procedures that allow us to explore subsets of the possible model space. Finally, we illustrate these methods with a Python application using the Communities and Crime dataset.

1. Framework of linear regression.

On this section, we offer a transient overview of the linear regression model. We start by describing the dataset, including the variety of observations and the variety of covariates. We then introduce the model itself and description the assumptions made in regards to the data.

We assume that we’ve a dataset with n observations and p covariates. The response variable is denoted by Y is continuous and the covariates are denoted by X₁, … , X_p. We assume that the connection between the response variable and the covariates is linear, that’s:

for i = 1, …, 𝑛, where β₀ is the intercept, β_j is the coefficient of the 𝑗-th covariate, and εᵢ is the error term. We assume that the error term is independent and identically distributed (i.i.d.) with mean zero and variance σ².

With the regression framework in place, the following step is to face the central challenge of model selection: how will we compare different subsets of variables?

2. Scoring Criteria for Evaluating Competing Models

In model selection, the primary challenge is to assign a rating to every model, where a model is defined by a specific subset of covariates. This section explains how models may be scored.

Allow us to first discuss the issue of scoring models. Let S ⊂ {1, …, p} and let 𝓧_S = {Xⱼ : j ∈ S} denote a subset of the covariates. Let β_Sdenote the coefficients of the corresponding set of covariates and let β̂_S denote the least squares estimate of β_S. Also, let X_S denote the X matrix for this subset of covariates and define r̂_S(x), to be the estimated regression function. The anticipated values from model S are denoted by Ŷᵢ(S) = r̂_S(Xᵢ). The 𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧 𝐫𝐢𝐬𝐤 is defined to be

where Y_i* is the long run statement of Y_i on the covariate X_i.

The goal of model selection is to search out the subset S that minimizes the prediction risk R(S).

With data, we cannot directly compute the prediction risk R(S). In this case, we generally use its estimate based on the available data. The estimation of the prediction risk is used as our scoring criterion.

The naive estimate of the prediction risk that we are able to use is: the training error, which is defined as:

where Y_i is the observed value of the response variable for the i-th statement.

Nonetheless, the training error could be very biased as an estimate of the prediction risk. It’s all the time smaller than the prediction risk . The truth is,

What explains this biaïs is that the information is used twice: once to suit the model and once to compute the training error. Once we fit a fancy model, with many parameters, the covariance 𝐶𝑜𝑣(Ŷᵢ(S), 𝑌ᵢ) might be large and the bias of the training error will worsen. This is the reason we want to make use of a more reliable estimate of the prediction risk.

2.1 Mallow’s C_p statistic

The Mallow’s C_pstatistic is a well-liked method for model selection. It’s defined as:

Here, |𝑆| is the variety of terms in 𝑆, and σ̂² is the estimate of σ², the variance of the error term obtained from the total model with all variables (𝑘). This value represents the training error plus a bias correction. The primary term measures how well the model matches the information, while the second term measures the model’s complexity. The more complex the model, the larger the second term might be, and consequently, the larger the Mallow’s 𝐶ₚ statistic might be.

Mallow’s 𝐶ₚ statistic represents a trade-off between model fit and complexity. Finding a very good model due to this fact requires balancing these two features. The goal is to discover the model that minimizes the Mallow’s 𝐶ₚ statistic.

Beyond Mallows’ C_P, model selection criteria can be derived from likelihood-based estimation with penalty terms. This leads us to the following family of methods.”

2.2 Likelihood and penalization

The approach below to estimate the prediction risk relies on the utmost likelihood estimation of the parameters.

Within the hypothesis that the error term is often distributed, the likelihood function is given by:

For those who compute the utmost likelihood estimate of the parameters β and σ², for the model 𝑆, that has |𝑆| variables, you’re going to get respectively:
β̂(𝑆)_mv = β̂(𝑆)_mco and σ̂(𝑆)²_mv= (1/𝑛) ∑ᵢ₌₁ⁿ (𝑌ᵢ − 𝑌̂ᵢ(𝑆))².

The log-likelihood of the model for the model $S$ which has $|S|$ variables is then given by:

Selecting the model that maximizes the log-likelihood is comparable to selecting the model which have the smallest residual sum of squares (RSS), that’s:

With the intention to minimize a criterion, we use the negative log-likelihood. The criterion is usually defined as:

−2𝓁(𝑆) + 2|𝑆|·𝑓(𝑛)

where 𝑓(𝑛) is a penalty function that depends upon the sample size 𝑛.
This formulation allows us to define the AIC and BIC criteria, given as follows:

2.2.1 Akaike Information Criterion (AIC)

One other approach to model selection is the Akaike Information Criterion (AIC). The goal of the AIC is to discover the model that minimizes information loss. In practice, this implies selecting the subset S that minimizes the AIC, defined as:

AIC(𝑆) = −2𝓁ₛ + 2|𝑆|

where 𝓁ₛ is the log-likelihood of model 𝑆 evaluated at the utmost likelihood estimates of its parameters. Here, 𝑓(𝑛) = 2.

This criterion may be viewed as a mix of goodness of fit and model complexity.
When comparing two models, the one with the lower AIC value is preferred.

2.2.2 Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) is one other method for model selection. It is analogous to the AIC, and BIC is defined as:

BIC(𝑆) = −2𝓁ₛ + |𝑆|·log(𝑛)

where 𝓁ₛ is the log-likelihood of model 𝑆 evaluated at the utmost likelihood estimates of its parameters.

It known as the Information Criterion because it could possibly be derived from a Bayesian perspective. Let 𝑆 = {𝑆₁, …, 𝑆ₘ} denote a set of models. If we assign each model 𝑆ᵢ a previous probability π(𝑆ᵢ) = 1/𝑚, then the posterior probability of model 𝑆ᵢ given the information is proportional to its likelihood. This results in the next expression:

Thus, selecting the model that minimizes the BIC is comparable to selecting the model with the best posterior probability given the information.

BIC also has an interpretation when it comes to minimum description length: it balances model fit against complexity. Because its penalty term is 𝑓(𝑛) = ½·log(𝑛), the BIC applies a stronger penalty than the AIC when 𝑛 > 7. Because of this, BIC typically selects more parsimonious models than AIC, especially because the sample size grows.

As with AIC, when comparing two models, the popular model is the one with the lower BIC.

2.3 Leave-One-Out Cross-Validation (LOOCV) and k-Fold Cross-Validation

One other widely used method for model selection is leave-one-out cross-validation (LOOCV). On this approach, the chance estimator is defined as:

where Ŷ₋ᵢ(𝑆) is the prediction for 𝑌ᵢ using model 𝑆 fitted on all observations except the -th one, and 𝑌ᵢ is the actual response for the -th statement.

It may be shown that:

where hᵢᵢ(𝑆) is the -th diagonal element of the hat matrix: H_S = X_S (X_Sᵀ X_S)⁻¹ X_Sᵀ for the model S.

This formulation shows that it’s unnecessary to refit the model repeatedly by leaving out one statement at a time. As a substitute, LOOCV may be computed directly using the fitted values and the hat matrix.

A natural extension of LOOCV is k-fold cross-validation, where the information is partitioned into folds, and the model is trained on folds and validated on the remaining fold. This process is repeated across all folds, and the outcomes are averaged to estimate prediction error.

K-Fold Cross-Validation

On this approach, the information is split into 𝑘 groups, or (commonly 𝑘 = 5 or 𝑘 = 10). One fold is ignored, and the model is fitted on the remaining 𝑘 − 1 folds. The fitted model is then used to predict the responses within the omitted fold. The chance for that fold is estimated as:

where the sum is taken over all observations within the omitted fold. This procedure is repeated for every of the 𝑘 folds, and the general risk estimate is obtained by averaging the 𝑘 individual risk values.
This method is especially suitable when the first goal of regression is prediction. On this setting, alternative performance measures [such as the Mean Absolute Error (MAE) or the Root Mean Squared Error (RMSE)] can be used to judge predictive accuracy.

2.4 Other criteria

Within the literature, along with the factors discussed above, several other measures are commonly used for model selection. One widely used option is the adjusted coefficient of determination, defined as:

One other approach is to make use of nested model tests, similar to the F-test. The F-test compares two nested models: a smaller model 𝑆₁, whose covariates form a subset of those in a bigger model 𝑆₂. The null hypothesis states that the extra variables in 𝑆₂ don’t significantly improve the fit relative to 𝑆₁.

Overall, the methods presented above primarily address the 2 central objectives of linear regression: parameter estimation and variable selection.

Having defined several ways to attain a model, the remaining query is the way to search the set of candidates to search out the model with the perfect rating.

3. Selection Procedure

Once models may be scored, the following step is to go looking either the whole space of possible models or a particular subset to discover the one with the perfect rating. With 𝑘 covariates, there are 2^k−1 possible models [a number that quickly becomes impractical for large 𝑘 (for instance, more than one million models when 𝑘 = 20]. In such cases, exhaustive search is computationally infeasible, and heuristic methods are preferred. Broadly, model selection strategies fall into two categories: exhaustive search and stepwise search.

3.1 Exhaustive Search

This approach evaluates every possible model and selects the one with the perfect rating. It is possible only when 𝑘 is small, because the computational burden becomes prohibitive with a lot of covariates.

3.2 Stepwise Search

Stepwise methods aim to discover a —a model that performs higher than its immediate neighbors. These methods are generally really helpful only when exhaustive search shouldn’t be feasible (e.g., when each 𝑛 and 𝑝 are large).

3.2.1 Forward Stepwise Selection

Select a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
Start with an empty model.
At each step, add the variable that gives the best improvement within the criterion.
Proceed until no variable improves the rating or all variables are included within the model.

3.2.2 Backward Stepwise Selection

Select a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
Start with the total model containing all variables.
At each step, remove the variable whose removal yields the best improvement within the criterion.
Proceed until no further improvement is feasible or only the essential variables remain.

3.2.3 Stepwise Selection (Mixed Method)

Select a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
Start with an empty model and add variables one after the other, as in forward selection, until no variable further improves the rating.
Then proceed as in backward selection, removing variables one after the other if doing so improves the criterion.
Stop when no additional improvement may be achieved or when all variables are included.

The following section is the way to apply it on real data.

4. Application

In practice, before applying model selection techniques, it is crucial to be sure that the covariates usually are not highly correlated. The next procedure may be applied:

Preliminary filtering: Remove covariates which can be clearly irrelevant to the response variable (based on expert judgment, treatment of missing values, etc.).
Correlation with the response variable: Define a threshold for the correlation between each covariate and the response variable (e.g., 0.6). Covariates below this threshold could also be excluded.
Correlation amongst covariates: Define a threshold for pairwise correlations between covariates (e.g., 0.7). Compute the correlation matrix; if two covariates exceed the edge, keep the one with the strongest correlation with the response variable or the one with greater interpretability from a website perspective.
Variance Inflation Factor (VIF): Compute the VIF for all remaining covariates. If a covariate’s VIF exceeds 5 or 10, it is taken into account highly collinear with others and needs to be removed.
Model selection: Apply the chosen model selection techniques. On this case, we’ll use Mallows’ 𝐶ₚ because the scoring criterion and backward stepwise selection because the variable selection method.

Finally, we’ll implement a stepwise selection procedure that may incorporate all the factors discussed above (AIC, BIC, Mallows’ 𝐶ₚ, etc.) under either forward or backward strategies. This unified approach will allow us to check models and choose the one which best balances goodness of fit and complexity.

For instance the procedure, allow us to now present the dataset that might be used for the evaluation.

4.1 Presentation of the Dataset

We use the Communities and Crime dataset from the UCI Machine Learning Repository, which accommodates socio-economic and demographic details about U.S. communities. The dataset includes greater than 100 variables. The response variable is the variety of violent crimes per population (violentCrimesPerPop). Our goal is to use the model selection techniques discussed above to discover the covariates most strongly related to this response.

4.2 Handling Missing Values

For this evaluation, we remove all rows containing missing values.
An alternate strategy can be to:

Drop variables with a high proportion of missingness (e.g., >10%), and
Assess whether the remaining missing values are Missing At Random (MAR), Missing Completely at Random (MCAR), or Missing Not at Random (MNAR), applying an appropriate imputation method if crucial.

Here, nonetheless, we adopt the simpler approach of discarding all incomplete rows. After this step, the dataset accommodates no missing values and includes 103 variables in total: the response variable violentCrimesPerPop plus the covariates.

4.3 Number of Relevant Variables Using Expert Judgment

We then apply expert judgment to evaluate the relevance of every variable and determine whether its correlation with the response is meaningful. This requires consultation between the statistician and domain experts to grasp the context and importance of every covariate.

For this dataset, we remove:

communityname (a categorical variable with many levels), and
fold (a technical variable used just for cross-validation).

After this filtering step, we retain 101 variables: the response violentCrimesPerPop and 100 covariates.

4.4 Reducing Covariates Using a Correlation Threshold

To further reduce dimensionality, we compute the correlation matrix of the covariates and the response. When several covariates are highly correlated with one another (correlation > 0.6), we retain only the one with the strongest correlation to the response. This procedure reduces redundancy while mitigating multicollinearity.

After applying this filtering and computing the Variance Inflation Factor (VIF), we retain a final set of 19 covariates with VIF values below 5.

**Table 1 : Variance inflation factor of covariates.**

These preprocessing steps are explained in greater detail in my article: Feature Selection. Now, allow us to apply our selection procedure to discover probably the most relevant variables

4.5 Model Selection with Stepwise Selection

With 19 variables, the overall variety of possible models is 2¹⁹-1 = 524,287, which may be computationally infeasible for a lot of systems. To cut back the search space, we use a stepwise selection procedure. We implement a function, stepwise_selection, that identifies probably the most relevant variables based on a selected selection criterion and method (forward, backward, or mixed). In this instance, we use Mallows’ 𝐶ₚ as the choice criterion and apply each forward and backward stepwise selection methods.

4.5.1 Backward Stepwise Selection Using Mallows’ 𝐶ₚ

Applying backward selection with Mallows’ 𝐶ₚ, we proceed as follows:

Step 1: Remove pctWFarmSelf. Its exclusion reduces the criterion to 𝐶ₚ = 41.74, lower than the total model.
Step 2: Remove PctWOFullPlumb. This further decreases 𝐶ₚ to 41.69895.
Step 3: Remove indianPerCap. The criterion is reduced again to 𝐶ₚ = 41.66073.

In total, three variables are removed, yielding the ultimate model.

4.5.2 Forward Stepwise Selection Using Mallows’ 𝐶ₚ

Forward stepwise selection is usually really helpful when the variety of variables is large, because it is less computationally demanding than backward selection. Ranging from an empty model, variables are added sequentially, one after the other, in response to the criterion improvement.

In this instance, forward selection identifies the identical set of variables as backward selection. The Figure 1 below illustrates the sequence of variables added to the model, together with their corresponding 𝐶ₚ values. The method begins with PctKids2Par, followed by PctWorkMom, LandArea, and continues until the ultimate model is reached, achieving a criterion value of 𝐶ₚ = 41.66.

**Figure 1. Variable selection based on Mallows’ Cp criterion. The corresponding Python implementation is provided within the appendix.**

Warning! This doesn’t yet address the query of which variables are causes of the independent variable.

Conclusion

In this text, we addressed the query of model selection. The core principle of the procedure is to assign a rating to every model in an effort to measure its quality, after which to go looking through the set of possible models to discover the one with the perfect rating. This rating is defined by balancing each the standard of fit and the complexity of the model.

Among the many available procedures, we presented the stepwise forward and backward methods, which we implemented in Python. We applied them using different evaluation criteria: AIC, BIC, and Mallows’ C_P.

These methods, nonetheless, have a limitation: they explore a subset of all possible models. Because of this, the chosen models may sometimes represent oversimplifications of reality. Nevertheless, they continue to be very useful when the variety of variables is large and exhaustive approaches develop into computationally too expensive.

Finally, when coping with regression for predictive purposes, it is crucial to separate the dataset into two parts: training and test sets. Variable selection should be carried out exclusively on the training set; and never on the test set in an effort to ensure an honest evaluation of the model’s predictive performance.

Image Credits

All images and visualizations in this text were created by the writer using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

References

Wasserman, L. (2013). . Springer Science & Business Media.

Redmond, M. (2002). Communities and Crime [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C53W3X.

Cornillon, P. A., Hengartner, N., Matzner-Løber, E., & Rouvière, L. (2023). Régression avec R: 3ème édition. In . EDP sciences.

Data & Licensing

The dataset utilized in this text is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This license allows anyone to share and adapt the dataset for any purpose, including business use, provided that proper attribution is given to the source.

For more details, see the official license text: CC BY 4.0.

Disclaimer

Any remaining errors or inaccuracies are the writer’s responsibility. Feedback and corrections are welcome.

Codes

import numpy as np
import statsmodels.api as sm

def compute_score(y, X, vars_to_test, metric, full_model_mse=None):
    X_train = sm.add_constant(X[vars_to_test])
    model = sm.OLS(y, X_train).fit()
    n = len(y)
    p = len(vars_to_test) + 1  # +1 pour la constante

    if metric == 'AIC':
        return model.aic

    elif metric == 'BIC':
        return model.bic

    elif metric == 'Cp':
        if full_model_mse is None:
            raise ValueError("full_model_mse doit être fourni pour calculer Cp Mallows.")
        rss = sum(model.resid ** 2)
        return rss + 2 * p * full_model_mse

    elif metric == 'R2_adj':
        return -model.rsquared_adj  # négatif pour maximiser

    else:
        raise ValueError("Métrique inconnue. Utilisez 'AIC', 'BIC', 'Cp' ou 'R2_adj'.")

def get_best_candidate(y, X, chosen, candidates, metric, strategy, full_model_mse=None):
    scores_with_candidates = []
    for candidate in candidates:
        vars_to_test = chosen + [candidate] if strategy == 'forward' else [var for var in selected if var != candidate]
        rating = compute_score(y, X, vars_to_test, metric, full_model_mse)
        scores_with_candidates.append((rating, candidate, vars_to_test))

    scores_with_candidates.sort()
    print("Suppressions testées:", [(v, round(s, 2)) for s, v, _ in scores_with_candidates])
    return scores_with_candidates[0] if scores_with_candidates else (None, None, None)

def stepwise_selection(df, goal, strategy='forward', metric='AIC', verbose=True):
    if df.isnull().values.any():
        raise ValueError("Des valeurs manquantes sont présentes dans le DataFrame.")

    X = df.drop(columns=[target])
    y = df[target]
    variables = list(X.columns)

    chosen = [] if strategy == 'forward' else variables.copy()
    remaining = variables.copy() if strategy == 'forward' else []

    # Calcul préalable du MSE du modèle complet pour Cp Mallows
    if metric == 'Cp':
        X_full = sm.add_constant(X)
        full_model = sm.OLS(y, X_full).fit()
        full_model_mse = sum(full_model.resid ** 2) / (len(y) - len(variables) - 1)
    else:
        full_model_mse = None

    current_score = np.inf
    history = []
    step = 0

    while True:
        step += 1
        candidates = remaining if strategy == 'forward' else chosen
        best_score, best_candidate, vars_to_test = get_best_candidate(y, X, chosen, candidates, metric, strategy, full_model_mse)

        if best_candidate is None:
            if verbose:
                print("Aucun candidat disponible.")
            break

        if verbose:
            motion = "ajouter" if strategy == 'forward' else "retirer"
            print(f"nÉtape {step}: Meilleure variable à {motion} : {best_candidate} (rating={round(best_score,5)})")


        improvement = best_score < current_score - 1e-6

        if improvement:
            if strategy == 'forward':
                chosen.append(best_candidate)
                remaining.remove(best_candidate)
            else:
                chosen.remove(best_candidate)

            current_score = best_score
            history.append({
                'step': step,
                'chosen': chosen.copy(),
                'rating': current_score,
                'modified': best_candidate
            })
        else:
            if verbose:
                print("Aucune amélioration supplémentaire du rating.")
            break

    X_final = sm.add_constant(X[selected])
    best_model = sm.OLS(y, X_final).fit()

    if verbose:
        print("nVariables sélectionnées :", chosen)
        final_score = best_model.aic if metric == 'AIC' else best_model.bic
        if metric == 'Cp':
            final_score = compute_score(y, X, chosen, metric, full_model_mse)
        elif metric == 'R2_adj':
            final_score = -compute_score(y, X, chosen, metric)
        print(f"Rating final ({metric}): {round(final_score,5)}")

    return chosen, best_model, history

import matplotlib.pyplot as plt


def plot_stepwise_crosses(history, all_vars, metric="AIC", title=None):
    """
    Affiche le graphique stepwise type heatmap à croix :
    - X : variables explicatives modifiées à au moins une étape (ordre d'apparition)
    - Y : rating (AIC/BIC) à chaque étape (de l'historique)
    - Croix noire : variable modifiée à chaque étape
    - Vide ailleurs
    - Courbe du rating
    """
    n_steps = len(history)
    scores = [h['score'] for h in history]
    
    # Extraire la liste ordonnée des variables effectivement modifiées
    modified_vars = []
    for h in history:
        var = h['modified']
        if var not in modified_vars and var shouldn't be None:
            modified_vars.append(var)
    
    n_mod_vars = len(modified_vars)
    
    # Construction des positions X pour les croix (selon modified_vars)
    mod_pos = [modified_vars.index(h['modified']) if h['modified'] in modified_vars else None for h in history]

    fig, ax = plt.subplots(figsize=(min(1.3 * n_mod_vars, 8), 6))
    # Placer la croix noire à chaque étape
    for i, x in enumerate(mod_pos):
        if x shouldn't be None:
            ax.scatter(x, scores[i], color='black', marker='x', s=100, zorder=3)
    # Tracer la courbe du rating
    ax.plot(range(n_steps), scores, color='gray', alpha=0.7, linewidth=2, zorder=1)
    # Axe X : labels verticaux, police réduite (uniquement variables modifiées)
    ax.set_xticks(range(n_mod_vars))
    ax.set_xticklabels(modified_vars, rotation=90, fontsize=10)
    ax.set_xlabel("Variables modifiées")
    ax.set_ylabel(metric)
    ax.set_title(title or f"Stepwise ({metric}) – Variables modifiées à chaque étape")
    ax.grid(True, axis='y', alpha=0.2)
    plt.tight_layout()
    plt.show()

Stepwise Selection Made Easy: Improve Your Regression Models in Python

1. Framework of linear regression.