Why MissForest Fails in Prediction Tasks: A Key Limitation You Have to Keep in Mind

-

The of this text is to elucidate that, in predictive settings, imputations must at all times be estimated on the training set and the resulting parameters or models saved. These should then be applied unchanged to the test, out-of-time, or application data, with a view to avoid data leakage and ensure an unbiased assessment of generalization performance.

I would like to thank everyone who took the time to read and have interaction with my article. Your support and feedback are greatly appreciated.

In practice, most real-world datasets contain missing values, making missing data one of the common challenges in statistical modeling. If it shouldn’t be handled properly, it will possibly result in biased coefficient estimates, reduced statistical power, and ultimately incorrect conclusions (Van Buuren, 2018). In predictive modeling, ignoring missing data by performing complete case evaluation or by excluding predictor variables with missing values can limit the applicability of the model and end in biased or suboptimal performance.

The Three Missing-Data Mechanisms

To deal with this issue, statisticians classify missing data into three mechanisms that describe how and why values go missing. MCAR (Missing Completely at Random) refers to cases where the missingness occurs entirely at random and is independent of each observed and unobserved variables. MAR (Missing at Random) implies that the probability of missingness will depend on the observed variables but not on the missing value itself. MNAR (Missing Not at Random) describes probably the most complex case, by which the probability of missingness will depend on the unobserved value itself.

Classical Approaches and Their Limits to take care of missing data

Under the MAR assumption, it is feasible to make use of the data contained within the observed variables to predict the missing values. Classical approaches based on this concept include regression-based imputation, k-nearest neighbors (kNN) imputation, and multiple imputation by chained equations (MICE). These methods are considered multivariate because they explicitly condition the imputation on the observed variables.These approaches explicitly condition the imputation on the observed data, but have a major limitation: they don’t handle mixed databases (continuous + categorical) well and have difficulty capturing nonlinear relationships and complicated interactions.

The Rise of MissForest implemented in R

It’s to beat these limitations that MissForest (Stekhoven & Bühlmann, 2012) has established itself as a benchmark method. Based on random forests, MissForest can capture nonlinear relationships and complicated interactions between variables, often outperforming traditional imputation techniques. Nonetheless, when working on a project that required a generalizable modeling process — with a correct train/test split and out-of-time validation — we encountered a major limitation. The R implementation of the missForest package doesn’t store the imputation model parameters once fitted.

A Critical Limitation of MissForest in Prediction Settings

This creates a practical challenge: it’s unattainable to coach the imputation model on the training set after which apply the very same parameters to the test set. This limitation introduces a risk of data leakage during model evaluation or a degradation in the standard and consistency of imputations.

Existing solutions and Their Risks

While in search of another solution that might allow consistent imputation in a predictive modeling setting, we asked ourselves a straightforward but critical query:

How can we impute the test data in a way that is still fully consistent with the imputations learned on the training data?

Exploring this query led us to a discussion on CrossValidated, where one other user was facing the very same issue and asked:

Two essential solutions were suggested to beat this limitation. The primary consists of merging the training and test data before running the imputation. This approach often improves the standard of the imputations since the algorithm has more data to learn from, nevertheless it introduces data leakage, because the test set influences the imputation model. The second approach imputes the test set individually from the training set, which prevents information leakage but forces the algorithm to construct a completely recent imputation model using only the test data, which is commonly much smaller. This will result in less stable imputations and a possible drop in predictive performance.

Even the well-known tutorial by Liam Morgan arrives at an analogous workaround. His proposed solution involves imputing the training set, fitting a predictive model, then combining the training and test data for a final imputation step:

# 1) Impute the training set
imp_train_X <- missForest(train_X)$ximp

# 2) Construct the predictive model
rf <- randomForest(x = imp_train_X, y = train$creditability)

# 3) Mix train and test, then re-impute
train_test_X <- rbind(test_X, imp_train_X)
imp_test_X <- missForest(train_test_X)$ximp[1:nrow(test_X), ]

Although this approach often may improves imputation quality, it suffers from the identical weakness as Method 1: the test data not directly take part in the educational process, which can inflate model performance metrics and creates an excessively optimistic estimate of generalization.

These examples highlight a fundamental dilemma:

  • How can we impute missing values without biasing model evaluation?
  • How can we be certain that the imputations applied to the test set are consistent with those learned on the training set?

Research Query and Motivation

These questions motivated our exploration of a more robust solution that preserves generalization, avoids data leakage, and produces stable imputations suitable for predictive modeling pipelines.

This paper is organized into 4 essential sections:

  • Section 1 introduces the means of identifying and characterizing missing values, including methods to detect, quantify, and describe them.
  • Section 2 discusses the MCAR (Missing Completely at Random) mechanism and presents methods for handling missing data under this assumption.
  • Section 3 focuses on the MAR (Missing at Random) mechanism, outlining appropriate imputation strategies and addressing the critical query:
  • Section 4 examines the MNAR (Missing Not at Random) mechanism and explores strategies for coping with missing data when the mechanism will depend on the unobserved values themselves.

1. Identification and Characterization of Missing Values

This step is critical and ought to be carried out in close collaboration with all stakeholders: model developers, domain experts, and future users of the model. The goal is to discover all missing values and mark them.

In Python, and particularly when using libraries resembling Pandas, NumPy, and Scikit-Learn, missing values are represented as NaN. Values marked as NaN are ignored by many operations resembling sum() and count(). You may mark missing values using the replace() function on the relevant subset of columns in a Pandas DataFrame.

Once the missing values have been marked, the following step is to guage their distribution for every variable. The isnull() function may be used to discover all NaN values as True, and combined with sum() to count the variety of missing values per column.

Understanding the distribution of missing values is crucial. With this information, stakeholders can assess whether the patterns of missingness are reasonable. It also means that you can define acceptable thresholds of missingness depending on the character of every variable. As an illustration, you would possibly resolve that as much as 10% missing values is appropriate for continuous variables, while the brink for categorical variables should remain at 0%.

After choosing the relevant variables for modeling, including those containing missing values after they are essential for prediction, it is important to separate the dataset into three samples:

  • Training set to estimate parameters and train the models,
  • Test set to guage model performance on unseen data,
  • Out-of-Time (OOT) set to validate the temporal robustness of the model.

This split ought to be performed to preserve the statistical representativeness of every subsample — for instance, by utilizing stratified sampling if the goal variable is imbalanced.

The evaluation of missing values should then be conducted exclusively on the training set:

  • Discover their mechanism (MCAR, MAR, MNAR) using statistical tests,
  • Select the suitable imputation method,
  • Train the imputation models on the training set.

The imputation parameters and models obtained on this step must then be applied to the test set and to the Out-of-Time set. This step is important to avoid information leakage and to make sure an accurate evaluation of the model’s generalization performance.

In the following section, we'll examine the MCAR mechanism intimately and present the imputation methods which can be best fitted to such a missing data.

2. Understanding MCAR and Selecting the Right Imputation Methods

In easy terms, MCAR (Missing Completely at Random) describes a situation where the indisputable fact that a price is missing is entirely unrelated to either the worth itself or another variables within the dataset. In mathematical terms, which means that the probability of a knowledge point being missing doesn't rely on the variable’s value nor on the values of another variables: the missingness is totally random.

Before formally defining the MCAR mechanism, allow us to introduce the notations that will likely be utilized in this section and throughout the article:

  • Consider an independent and identically distributed sample of n observations:

yi = (yi1, . . ., yip)T, i = 1, 2, . . ., n

where p is the variety of variables with missing values and n is the sample size.

  • Y Rnxp represents the variables that will contain missing values. That is the set on which we want to perform imputation.
  • We denote the observed entries and missing entries of Y as Yo and Ym,
  • X Rnxq represents the fully observed variables, meaning they contain no missing values.
  • To point which components of yi are observed or missing, we define the indicator vector:

ri = (ri1, . . ., rip)T, i = 1, 2, . . ., n

with rik = 1 if yik is observed, and 0 otherwise.

  • Stacking these vectors yields the whole matrix of presence/absence indicators:

R = (r1, . . ., rn)T

Then the MCAR assumption is defined as :

Pr(R|Ym ,Yo, X) = Pr(R). (1)

which means that the missing indicators are completely independent of each the missing data, Ym, and the observed data, Yo. Note that here R can be independent of covariates X. Before presenting methods for handling missing values under the MCAR assumption, we'll first introduce just a few easy techniques to evaluate whether the MCAR assumption is prone to hold.

2.1 Assessing the MCAR Assumption

On this section, we'll simulate a dataset with 10,000 observations and 4 variables under the MCAR assumption:

  • One continuous variable containing 20% missing values and one categorical variable with two levels (0 and 1) containing 10% missing values.
  • One continuous variable and one categorical variable which can be fully observed, with no missing values.
  • Finally, a binary goal variable named goal, taking values 0 and 1.
import numpy as np
import pandas as pd

# --- Reproducibility ---
np.random.seed(42)

# --- Parameters ---
n = 10000

# --- Utility Functions ---
def generate_continuous(mean, std, size, missing_rate=0.0):
    """Generate a continuous variable with optional MCAR missingness."""
    values = np.random.normal(loc=mean, scale=std, size=size)
    if missing_rate > 0:
        mask = np.random.rand(size) < missing_rate
        values[mask] = np.nan
    return values

def generate_categorical(levels, probs, size, missing_rate=0.0):
    """Generate a categorical variable with optional MCAR missingness."""
    values = np.random.choice(levels, size=size, p=probs).astype(float)
    if missing_rate > 0:
        mask = np.random.rand(size) < missing_rate
        values[mask] = np.nan
    return values

# --- Variable Generation ---
variables = {
    "cont_mcar": generate_continuous(mean=100, std=20, size=n, missing_rate=0.20),
    "cat_mcar": generate_categorical(levels=[0, 1], probs=[0.7, 0.3], size=n, missing_rate=0.10),
    "cont_full": generate_continuous(mean=50, std=10, size=n),
    "cat_full": generate_categorical(levels=[0, 1], probs=[0.6, 0.4], size=n),
    "goal": np.random.selection([0, 1], size=n, p=[0.5, 0.5])
}

# --- Construct DataFrame ---
df = pd.DataFrame(variables)

# --- Display Summary ---
print(df.head())
print("nMissing value counts:")
print(df.isnull().sum())

Before performing any evaluation, it is important to separate the dataset into two parts: a training set and a test set.

2.1.1 Preparing Train and Test Data for Evaluation the MCAR

It is important to separate the dataset into training and test sets while ensuring representativeness. This guarantees that each the model and the imputation methods are learned exclusively on the training set after which evaluated on the test set. Doing so prevents data leakage and provides an unbiased estimate of the model’s ability to generalize to unseen data.

from sklearn.model_selection import train_test_split
import pandas as pd

def stratified_split(df, strat_vars, test_size=0.3, random_state=None):
    """
    Split a DataFrame into train and test sets with stratification
    based on one or multiple variables.

    Parameters
    ----------
    df : pandas.DataFrame
        The input dataset.
    strat_vars : list or str
        Column name(s) used for stratification.
    test_size : float, default=0.3
        Proportion of the dataset to incorporate within the test split.
    random_state : int, optional
        Random seed for reproducibility.

    Returns
    -------
    train_df : pandas.DataFrame
        Training set.
    test_df : pandas.DataFrame
        Test set.
    """
    # Ensure strat_vars is an inventory
    if isinstance(strat_vars, str):
        strat_vars = [strat_vars]

    # Create a combined stratification key
    strat_key = df[strat_vars].astype(str).fillna("MISSING").agg("_".join, axis=1)

    # Perform stratified split
    train_df, test_df = train_test_split(
        df,
        test_size=test_size,
        stratify=strat_key,
        random_state=random_state
    )

    return train_df, test_df


# --- Application ---
# Stratification sur cat_mcar, cat_full et goal
train_df, test_df = stratified_split(df, strat_vars=["cat_mcar", "cat_full", "target"], test_size=0.3, random_state=42)

print(f"Train size: {train_df.shape[0]}  ({len(train_df)/len(df):.1%})")
print(f"Test size:  {test_df.shape[0]}  ({len(test_df)/len(df):.1%})")

2.1.1 Evaluation MCAR Assumption for continuous variables with missing values

Step one is to create a binary indicator R (where 1 indicates an observed value and 0 indicates a missing value) and compare the distributions of Yo​, Ym​, and X across the 2 groups (observed vs. missing).

Allow us to illustrate this process using the variable cont_mcar for example. We'll compare the distribution of cont_full between observations where cont_mcar is missing and where it's observed, using each a boxplot and a Kolmogorov–Smirnov test. We'll then perform an analogous evaluation for the explicit variable cat_full, comparing proportions across the 2 groups with a bar plot and a chi-squared test.

import matplotlib.pyplot as plt
import seaborn as sns

# --- Step 1: Train/Test Split with Stratification ---
train_df, test_df = stratified_split(
    df,
    strat_vars=["cat_mcar", "cat_full", "target"],
    test_size=0.3,
    random_state=42
)

# --- Step 2: Create the R indicator on the training set ---
train_df = train_df.copy()
train_df["R_cont_mcar"] = np.where(train_df["cont_mcar"].isnull(), 0, 1)

# --- Step 3: Prepare the information for comparison ---
df_obs = pd.DataFrame({
    "cont_full": train_df.loc[train_df["R_cont_mcar"] == 1, "cont_full"],
    "Group": "Observed (R=1)"
})

df_miss = pd.DataFrame({
    "cont_full": train_df.loc[train_df["R_cont_mcar"] == 0, "cont_full"],
    "Group": "Missing (R=0)"
})

df_all = pd.concat([df_obs, df_miss])

# --- Step 4: KS Test before plotting ---
from scipy.stats import ks_2samp
stat, p_value = ks_2samp(
    train_df.loc[train_df["R_cont_mcar"] == 1, "cont_full"],
    train_df.loc[train_df["R_cont_mcar"] == 0, "cont_full"]
)

# --- Step 5: Visualization with KS result ---
plt.figure(figsize=(8, 6))
sns.boxplot(
    x="Group", 
    y="cont_full", 
    data=df_all,
    palette="Set2",
    width=0.6,
    fliersize=3
)

# Add red diamonds for means
means = df_all.groupby("Group")["cont_full"].mean()
for i, m in enumerate(means):
    plt.scatter(i, m, color="red", marker="D", s=50, zorder=3, label="Mean" if i == 0 else "")

# Title and KS test result
plt.title("Distribution of cont_full by Missingness of cont_mcar (Train Set)",
          fontsize=14, weight="daring")

# Add KS result as text box
textstr = f"KS Statistic = {stat:.3f}nP-value = {p_value:.3f}"
plt.gca().text(
    0.05, 0.95, textstr,
    transform=plt.gca().transAxes,
    fontsize=10,
    verticalalignment='top',
    bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8)
)

plt.ylabel("cont_full", fontsize=12)
plt.xlabel("")
sns.despine()
plt.legend()
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

# --- Step 1: Construct contingency table on the TRAIN set ---
contingency_table = pd.crosstab(train_df["R_cont_mcar"], train_df["cat_full"])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

# --- Step 2: Compute proportions for every group ---
# --- Recompute proportions but flip the axes ---
props = contingency_table.div(contingency_table.sum(axis=1), axis=0)

# Transform for plotting: Group (R) on x-axis, Category as hue
df_props = props.reset_index().melt(
    id_vars="R_cont_mcar",
    var_name="Category",
    value_name="Proportion"
)

# Map R values to clear labels
df_props["Group"] = df_props["R_cont_mcar"].map({1: "Observed (R=1)", 0: "Missing (R=0)"})

# --- Plot: Group on x-axis, bars show proportions of every category ---
sns.set_theme(style="whitegrid")
plt.figure(figsize=(8,6))

sns.barplot(
    x="Group", y="Proportion", hue="Category",
    data=df_props, palette="Set2"
)

# Title and Chi² result
plt.title("Proportion of cat_full by Observed/Missing Status of cont_mcar (Train Set)",
          fontsize=14, weight="daring")

# Add Chi² result as a text box
textstr = f"Chi² = {chi2:.3f}, p = {p_value:.3f}"
plt.gca().text(
    0.05, 0.95, textstr,
    transform=plt.gca().transAxes,
    fontsize=10,
    verticalalignment='top',
    bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8)
)

plt.xlabel("Observed / Missing Group (R)")
plt.ylabel("Proportion")
plt.legend(title="cat_full Category")
sns.despine()
plt.show()

The 2 figures above show that, under the MCAR assumption, the distribution of 𝑌​, 𝑌ₘ​, and 𝑋 stays unchanged whatever the value of R (1 = observed, 0 = missing). These results are further supported by the Kolmogorov–Smirnov and Chi-squared tests, which confirm the absence of serious differences between the observed and missing groups.

For categorical variables, the identical analyses may be performed as described above. While these univariate checks may be time-consuming, they're useful when the variety of variables is small, as they supply a fast and intuitive first have a look at the missing data mechanism. For larger datasets, nonetheless, multivariate methods ought to be considered.

2.1.3 Multivariate Evaluation of the MCAR Assumption

To the perfect of my knowledge, just one multivariate statistical test is widely used to evaluate the MCAR assumption on the dataset level: Little’s chi2 for test MCAR assumption called mcartest. This test, developed in R language, compares the distributions of observed variables across different missing-data patterns and computes a worldwide test statistic that follows a Chi-squared distribution.

Nonetheless, its essential limitation is that it shouldn't be well fitted to categorical variables, because it relies on the strong assumption that the variables are normally distributed. We now turn to the methods for imputing missing values under the MCAR assumption.

2.2 Methods to take care of missing data under MCAR.

Under the MCAR assumption, the missingness indicators R are independent of Yo, Ym, and X. Because the data are missing completely at random, dropping incomplete observations doesn't introduce bias. Nonetheless, this approach becomes inefficient when the proportion of missing values is high.

In such cases, easy imputation methods, replacing missing values with the mean, median, or most frequent category, are sometimes preferred. They're easy to implement, require little computational effort, and may be managed over time without adding complexity for modelers. While these methods don't create bias, they have an inclination to underestimate variance and will distort relationships between variables.

Against this, advanced methods resembling regression-based imputation, kNN, or multiple imputation can improve statistical efficiency and help preserve information when the proportion of missing data is substantial. Their essential drawback lies of their algorithmic complexity, higher computational cost, and the greater effort required to take care of them in production settings.

To impute missing values under the MCAR assumption for prediction purposes, proceed as follows:

  1. Learn imputation values from the training set only, using the mean for continuous variables and probably the most frequent category for categorical variables.
  2. Apply these values to interchange missing data in each the training and the test sets.
  3. Evaluate the model on the test set, ensuring that no information from the test set was used throughout the imputation process.
import pandas as pd

def compute_impute_values(df, cont_vars, cat_vars):
    """
    Compute imputation values (mean for continuous, mode for categorical)
    from the training set only.
    """
    impute_values = {}
    for col in cont_vars:
        impute_values[col] = df[col].mean()
    for col in cat_vars:
        impute_values[col] = df[col].mode().iloc[0]
    return impute_values

def apply_imputation(train_df, test_df, impute_values, vars_to_impute):
    """
    Apply the learned imputation values to each train and test sets.
    """
    train_df[vars_to_impute] = train_df[vars_to_impute].fillna(value=impute_values)
    test_df[vars_to_impute] = test_df[vars_to_impute].fillna(value=impute_values)
    return train_df, test_df

# --- Example usage ---
train_df, test_df = stratified_split(
    df,
    strat_vars=["cat_mcar", "cat_full", "target"],
    test_size=0.3,
    random_state=42
)

# Variables to impute
cont_vars = ["cont_mcar"]
cat_vars = ["cat_mcar"]
vars_to_impute = cont_vars + cat_vars

# 1. Learn imputation values on TRAIN
impute_values = compute_impute_values(train_df, cont_vars, cat_vars)
print("Imputation values learned from train:", impute_values)

# 2. Apply them consistently to TRAIN and TEST
train_df, test_df = apply_imputation(train_df, test_df, impute_values, vars_to_impute)

# 3. Check
print("Remaining missing values in train:n", train_df[vars_to_impute].isnull().sum())
print("Remaining missing values in test:n", test_df[vars_to_impute].isnull().sum())

This section on understanding MCAR and choosing the suitable imputation method provides a transparent foundation for approaching similar strategies under the MAR assumption.

3. Understanding MAR and Selecting the Right Imputation Methods

The MAR assumption is defined as :

Pr(R|Ym ,Yo, X) = Pr(R|Yo, X) (2)

In other words, the distribution of the missing indicators depends only on the observed data. Even within the case where R depends only on the covariates X,

Pr(R|Ym ,Yo, X) = Pr(R|X) (3)

This still falls under the MAR assumption.

3.1 Evaluation MAR Assumption for variables with missing values

Under the MAR assumption, the missingness indicators 𝑅 depend only on the observed variables Yo and X, but not on the missing data 𝑌.
To not directly assess the plausibility of this assumption, common statistical tests (Student’s -test, Kolmogorov–Smirnov, Chi-squared, etc.) may be applied by comparing the distributions of observed variables between groups with and without missing values.

For multivariate evaluation, one may additionally use the mcartest implemented in R, which extends Little’s test of MCAR to guage assumption (3), namely Pr(R|Ym ,Yo, X) = Pr(R|X), under the idea of multivariate normality of the variables.

If this test shouldn't be rejected, the missing-data mechanism can reasonably be considered MAR (assumption 3) given the auxiliary variables X .

We will now turn to the query of methods to impute such a missing data.

3.2 Methods to take care of missing data under MAR.

Under the MAR assumption, the probability of missingness R depends only on the observed variables Yo​ and covariates X. On this setting, variables Yk with missing values may be explained using the opposite available variables Yo and X, which motivates the usage of advanced imputation methods based on supervised learning.

These approaches involve constructing a predictive model by which the unfinished variable Yk serves because the goal, and the opposite observed variables Yo and X act as predictors. The model is trained on complete cases ([Yk]o of Y) after which applied to estimate the missing values [Yk​]m of Yk.

Probably the most commonly used imputation methods within the literature include:

  • k-nearest neighbors (KNNimpute, Troyanskaya et al., 2001), primarily applied to continuous data;
  • the saturated multinomial model (Schafer, 1997), designed for categorical data;
  • multivariate imputation by chained equations (MICE, Van Buuren & Oudshoorn, 1999), suitable for mixed datasets but depending on tuning parameters and the specification of a parametric model.

All of those approaches depend on assumptions concerning the underlying data distribution or on the power of the chosen model to adequately capture relationships between variables.

More recently, MissForest (Stekhoven & Bühlmann, 2012) has emerged as a nonparametric alternative based on random forests, well-suited to mixed data types and robust to each interactions and nonlinear relationships.

The MissForest algorithm relies on random forests (RF) to impute missing values. The authors propose the next procedure:

MissForest algorithm
Source: [2] Stekhoven et al.(2012)

As defined, the MissForest algorithm can't be used directly for prediction purposes. For every variable, between steps 6 and seven, the random forest model Ms used to predict ymis(s)​ from xmis(s)​shouldn't be saved. Consequently, it's neither likely nor desirable for practitioners to depend on MissForest as a predictive model in production.

The absence of stored models Ms​ or imputation parameters (here on the training set) makes it difficult to guage generalization performance on recent data. Although some have attempted to work around this issue by following Liam Morgan‘s approach, the challenge stays unresolved.

Moreover, this limitation increases algorithmic complexity and computational cost, since the complete algorithm should be rerun from scratch for every recent dataset (as an example, when working with separate training and test sets).

What ought to be done? Should the MissForest algorithm still be used?

If the goal is to develop a model for classification or evaluation solely on the available dataset, with no intention of applying it to recent data, then MissForest is strongly really helpful, because it offers high accuracy and robustness.

Nonetheless, if the aim is to construct a predictive model that will likely be applied to recent datasets, MissForest ought to be avoided for the explanations discussed above. In such cases, it's preferable to make use of an algorithm that explicitly stores the imputation models or the parameters estimated from the training set.

Fortunately, an adapted version now exists: MissForestPredict, available since 2024 in each R and Python, specifically designed for predictive tasks. For further details, we refer the reader to Albu, Elena, et al. (2024).

Using the MissForestPredict algorithm for prediction consists of applying the usual MissForest procedure to the training data. Unlike the unique MissForest, nonetheless, this version returns and stores the person models Ms related to each variable, which makes it possible to reuse them for imputing missing values in recent datasets.

MissForestPredict Based Imputation with Model Saving
Source: [4] Albu et al. (2024).

The algorithm below illustrates methods to apply MissForestPredict to recent observations, whether or not they come from the test set, an out-of-time sample, or an application dataset.

Illustration of MissForestPredict Applied to a Latest Statement

We now have all the weather needed to handle the problems raised within the introduction. Allow us to turn to the ultimate mechanism, MNAR, before moving on to the conclusion.

4. Understanding MNAR

Missing Not At Random (MNAR) occurs when the missing data mechanism depends directly on the unobserved values themselves. In other words, if a variable Y incorporates missing values, then the indicator variable R (with R=1 if Y is observed and R=0 otherwise) depends only on the missing component Ym.

There isn't any universal statistical method to handle such a mechanism, because the information needed to model the dependency is precisely what's missing. In such cases, the really helpful approach is to depend on domain expertise to know the explanations behind the nonresponse and to define context-specific strategies for analyzing and addressing the missing values.

It is necessary to emphasise, nonetheless, that MAR and MNAR cannot generally be distinguished empirically based on the observed data alone.

Conclusion

The target of this text was to indicate methods to impute missing values for predictive purposes without biasing the evaluation of model performance. To this end, we presented the essential mechanisms that generate missing data (MCAR, MAR, MNAR), the statistical tests used to evaluate their plausibility, and the imputation methods best suited to every.

Our evaluation highlights that, under MCAR, easy imputation methods are generally preferable, as they supply substantial time savings without introducing bias. In practice, nonetheless, missing data mechanisms are most frequently MAR. On this setting, advanced imputation approaches resembling MissForest, based on machine learning models, are particularly appropriate.

Nevertheless, when the goal is to construct predictive models, it is important to make use of methods that store the imputation parameters or models learned from the training data after which replicate them consistently on the test, out-of-time, or application datasets. That is precisely the contribution of MissForestPredict (introduced in 2024 and available in each R and Python), which addresses the limitation of the unique MissForest (2012), a way not originally designed for predictive tasks.

Using MissForest for prediction without adaptation may subsequently result in biased results, unless corrective measures are implemented. It might be highly precious for practitioners who've deployed MissForest in production to share the strategies they developed to beat this limitation.

References

[1] Audigier, V., White, I. R., Jolani, S., Debray, T. P., Quartagno, M., Carpenter, J., … & Resche-Rigon, M. (2018). Multiple imputation for multilevel data with continuous and binary variables.

[2] Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. , (1), 112-118.

[3] Li, C. (2013). Little’s test of missing completely at random. , (4), 795-809.

[4] Albu, E., Gao, S., Wynants, L., & Van Calster, B. (2024). missForestPredict–Missing data imputation for prediction settings. .

Image Credits

All images and visualizations in this text were created by the writer using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

Disclaimer

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x