This text is the third a part of a series I made a decision to jot down on how one can construct a strong and stable credit scoring model over time.
The primary article focused on how one can construct a credit scoring dataset, while the second explored exploratory data evaluation (EDA) and how one can higher understand borrower and loan characteristics before modeling.
, my final 12 months at an engineering school. As a part of a credit scoring project, a bank provided us with data about individual customers. In a previous article, I explained how the sort of dataset is normally constructed.
The goal of the project was to develop a scoring model that might predict a borrower’s credit risk over a one-month horizon. As soon as we received the information, step one was to perform an exploratory data evaluation. In my previous article, I briefly explained why exploratory data evaluation is important for understanding the structure and quality of a dataset.
The dataset provided by the bank contained greater than 300 variables and over a million observations, covering two years of historical data. The variables were each continuous and categorical. As is common with real-world datasets, some variables contained missing values, some had outliers, and others showed strongly imbalanced distributions.
Since we had little experience with modeling on the time, several methodological questions quickly got here up.
The primary query was in regards to the data preparation process. Should we apply preprocessing steps to the whole dataset first after which split it into training, test, and OOT (out-of-time) sets? Or should we split the information first after which apply all preprocessing steps individually?
This query is significant. A scoring model is built for prediction, which implies it must have the option to generalize to recent observations, reminiscent of recent bank customers. Because of this, every step in the information preparation pipeline, including variable preselection, should be designed with this objective in mind.
One other query was in regards to the role of domain experts. At what stage should they be involved in the method? Should they participate early during data preparation, or only later when interpreting the outcomes? We also faced more technical questions. For instance, should missing values be imputed before treating outliers, or the opposite way around?
In this text, we deal with a key step within the modeling process: handling extreme values (outliers) and missing values. This step can sometimes also contribute to reducing the dimensionality of the issue, especially when variables with poor data quality are removed or simplified during preprocessing.
I previously described a related process in one other article on variable preprocessing for linear regression. In practice, the best way variables are processed often will depend on the style of model used for training. Some methods, reminiscent of regression models, are sensitive to outliers and customarily require explicit treatment of missing values. Other approaches can handle these issues more naturally.
As an example the steps presented here, we use the identical dataset introduced within the previous article on exploratory data evaluation. This dataset is an open-source dataset available on Kaggle: the Credit Scoring Dataset. It incorporates 32,581 observations and 12 variables describing loans issued by a bank to individual borrowers.
Although this instance involves a comparatively small variety of variables, the preprocessing approach described here can easily be applied to much larger datasets, including those with several hundred variables.
Finally, it is vital to do not forget that the sort of evaluation only is smart if the dataset is top quality and representative of the issue being studied. In practice, data quality is one of the critical aspects for constructing robust and reliable credit scoring models.
This post is an element of a series dedicated to understanding how one can construct robust and stable credit scoring models. The primary article focused on how credit scoring datasets are constructed. The second article explored exploratory data evaluation for credit data. In the next section, we turn to a practical and essential step: handling outliers and missing values using an actual credit scoring dataset.
Making a Time Variable
Our dataset doesn’t contain a variable that directly captures the time dimension of the observations. That is problematic since the goal is to construct a prediction model that may estimate whether recent borrowers will default. And not using a time variable, it becomes difficult to obviously illustrate how one can split the information into training, test, and out-of-time (OOT) samples. As well as, we cannot easily assess the steadiness or monotonic behavior of variables over time.
To deal with this limitation, we create a man-made time variable, which we call 12 months.
We construct this variable using cb_person_cred_hist_length, which represents the length of a borrower’s credit history. This variable has 29 distinct values, starting from 2 to 30 years. Within the previous article, after we discretized it into quartiles, we observed that the default rate remained relatively stable across intervals, around 21%.
This is strictly the behavior we wish for our 12 months variable: a comparatively stationary default rate, meaning that the default rate stays stable across different time periods.
To construct this variable, we make the next assumption. We arbitrarily suppose that borrowers with a 2-year credit history entered the portfolio in 2022, those with 3 years of history in 2021, and so forth. For instance, a worth of 10 years corresponds to an entry in 2014. Finally, all borrowers with a credit history greater than or equal to 11 years are grouped right into a single category corresponding to an entry in 2013.
This approach gives us a dataset covering an approximate historical period from 2013 to 2022, providing about ten years of historical data. This reconstructed timeline enables more meaningful train, test, and out-of-time splits when developing the scoring model. And likewise to check the steadiness of the danger driver distribution over time.
Training and Validation Datasets
This section addresses a crucial methodological query: should we split the information before performing data treatment and variable preselection, or after?
In practice, machine learning methods are commonly used to develop credit scoring models, especially when a sufficiently large dataset is offered and covers the total scope of the portfolio. The methodology used to estimate model parameters should be statistically justified and based on sound evaluation criteria. Particularly, we must account for potential estimation biases brought on by overfitting or underfitting, and choose an appropriate level of model complexity.
Model estimation should ultimately depend on its ability to generalize, meaning its capability to accurately rating recent borrowers who weren’t a part of the training data. To properly evaluate this ability, the dataset used to measure model performance should be independent from the dataset used to coach the model.
In statistical modeling, three sorts of datasets are typically used to realize this objective:
- Training (or development) dataset used to estimate and fit the parameters of the model.
- Validation / Test dataset (in-time) used to guage the standard of the model fit on data that weren’t used during training.
- Out-of-time (OOT) validation dataset used to evaluate the model’s performance on data from a different time period, which helps evaluate whether the model stays stable over time.
Other validation strategies are also commonly utilized in practice, reminiscent of k-fold cross-validation or leave-one-out validation.
Dataset Definition
On this section, we present an example of how one can create the datasets utilized in our evaluation: train, test, and OOT.
The event dataset (train + test) covers the period from 2013 to 2021. Inside this dataset:
- 70% of the observations are assigned to the training set
- 30% are assigned to the test set
The OOT dataset corresponds to 2022.

train_test_df = df[df["year"] <= 2021].copy()
oot_df = df[df["year"] == 2022].copy()
train_test_df.to_csv("train_test_data.csv", index=False)
oot_df.to_csv("oot_data.csv", index=False)
Preserving Model Generalization
To preserve the model’s ability to generalize, once the dataset has been split into train, test, and OOT, the test and OOT datasets must remain completely untouched during model development.
In practice, they needs to be treated as in the event that they were locked away and only used after the modeling strategy has been defined and the candidate models have been trained. These datasets will later allow us to match model performance and choose the ultimate model.
One necessary point to take into accout is that each one preprocessing steps applied to the training dataset should be replicated exactly on the test and OOT datasets. This includes:
- handling outliers
- imputing missing values
- discretizing variables
- and applying every other preprocessing transformations.
Splitting the Development Dataset into Train and Test
To coach and evaluate the several models, we split the event dataset (2013–2021) into two parts:
- a training set (70%)
- a test set (30%)
To make sure that the distributions remain comparable across these two datasets, we perform a stratified split. The stratification variable combines the default indicator and the 12 months variable:
def_year = def + 12 months
This variable allows us to preserve each the default rate and the temporal structure of the information when splitting the dataset.
Before performing the stratified split, it is vital to first examine the distribution of the brand new variable def_year to confirm that stratification is possible. If some groups contain too few observations, stratification will not be possible or may require adjustments.
In our case, the smallest group defined by def_year incorporates greater than 300 observations, which implies that stratification is perfectly feasible. We will due to this fact split the dataset into train and test sets, save them, and proceed the preprocessing steps using only the training dataset. The identical transformations will later be replicated on the test and OOT datasets.
from sklearn.model_selection import train_test_split
train_test_df["def_year"] = train_test_df["def"].astype(str) + "_" + train_test_df["year"].astype(str)
train_df, test_df = train_test_split(train_test_df, test_size=0.2, random_state=42, stratify=train_test_df["def_year"])
# sauvegarde des bases
train_df.to_csv("train_data.csv", index=False)
test_df.to_csv("test_data.csv", index=False)
oot_df.to_csv("oot_data.csv", index=False)

In the next sections, all analyses are performed using the training data.
Outlier Treatment
We start by identifying and treating outliers, and we validate these treatments with domain experts. In practice, this step is simpler for experts to evaluate than missing value imputation. Experts often know the plausible ranges of variables, but they could not all the time know why a worth is missing. Performing this step first also helps reduce the bias that extreme values could introduce in the course of the imputation process.
To treat extreme values, we use the IQR method (Interquartile Range method). This method is often used for variables that roughly follow a standard distribution. Before applying any treatment, it is vital to visualise the distributions using boxplots and density plots.
In our dataset, we have now six continuous variables. Their boxplots and density plots are shown below.


The table below presents, for every variable, the lower certain and upper certain, defined as:
Lower Sure = Q1 – 1.5 x IQR
Upper Sure = Q3 + 1.5 x IQR
where IQR = Q3 – Q1 and (Q1) and (Q3) correspond to the first and third quartiles, respectively.

On this study, this treatment method is affordable since it doesn't significantly alter the central tendency of the variables. To further validate this approach, we will confer with the previous article and examine which quantile ranges the lower and upper bounds fall into, and analyze the default rate of borrowers inside these intervals.
When treating outliers, it is vital to proceed fastidiously. The target is to cut back the influence of utmost values without changing the scope of the study.
From the table above, we observe that the IQR method would cap the age of borrowers at 51 years. This result is appropriate provided that the study population was originally defined with a maximum age of 51. If this restriction was not a part of the initial scope, the edge needs to be discussed with domain experts to find out an affordable upper certain for the variable.
Suppose, for instance, that borrowers as much as 60 years old are considered a part of the portfolio. In that case, the IQR method would not be appropriate for treating outliers within the person_age variable, because it might artificially truncate valid observations.
Two alternatives can then be considered. First, domain experts may specify a maximum plausible age, reminiscent of 100 years, which might define the suitable range of the variable. One other approach is to make use of a method called winsorization.
Winsorization follows the same idea to the IQR method: it limits the range of a continuous variable, however the bounds are typically defined using extreme quantiles or expert-defined thresholds. A typical approach is to limit the variable to a spread reminiscent of:
Observations falling outside this restricted range are then replaced by the closest boundary value (the corresponding quantile or a worth determined by experts).
This approach might be applied in two ways:
- Unilateral winsorization, where just one side of the distribution is capped.
- Bilateral winsorization, where each the lower and upper tails are truncated.

In this instance, all observations with values below €6 are replaced with €6 for the variable of interest. Similarly, all observations with values above €950 are replaced with €950.
We compute the ninetieth, ninety fifth, and 99th percentiles of the person_age variable to verify whether the IQR method is acceptable. If not, we might use the 99th percentile because the upper certain for a winsorization approach.

On this case, the 99th percentile is the same as the IQR upper certain (51). This confirms that the IQR method is acceptable for treating outliers on this variable.
def apply_iqr_bounds(train, test, oot, variables):
train = train.copy()
test = test.copy()
oot = oot.copy()
bounds = []
for var in variables:
Q1 = train[var].quantile(0.25)
Q3 = train[var].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
bounds.append({
"Variable": var,
"Lower Sure": lower,
"Upper Sure": upper
})
for df in [train, test, oot]:
df[var] = df[var].clip(lower, upper)
bounds_table = pd.DataFrame(bounds)
return bounds_table, train, test, oot
bounds_table, train_clean_outlier, test_clean_outlier, oot_clean_outlier = apply_iqr_bounds(
train_df,
test_df,
oot_df,
variables
)
One other approach that may often be useful when coping with outliers in continuous variables is discretization, which I'll discuss in a future article.
Imputing Missing Values
The dataset incorporates two variables with missing values: loan_int_rate and person_emp_length. Within the training dataset, the distribution of missing values is summarized within the table below.

The incontrovertible fact that only two variables contain missing values allows us to research them more fastidiously. As a substitute of immediately imputing them with an easy statistic reminiscent of the mean or the median, we first try to know whether there may be a pattern behind the missing observations.
In practice, when coping with missing data, step one is usually to seek the advice of domain experts. They could provide insights into why certain values are missing and suggest reasonable ways to impute them. This helps us higher understand the mechanism generating the missing values before applying statistical tools.
An easy strategy to explore this mechanism is to create indicator variables that take the worth 1 when a variable is missing and 0 otherwise. The thought is to ascertain whether the probability that a worth is missing will depend on the opposite observed variables.
Case of the Variable person_emp_length
The figure below shows the boxplots of the continual variables depending on whether person_emp_length is missing or not.

Several differences might be observed. For instance, observations with missing values are likely to have:
- lower income compared with observations where the variable is observed,
- smaller loan amounts,
- lower rates of interest,
- and higher loan-to-income ratios.
These patterns suggest that the missing observations should not randomly distributed across the dataset. To verify this intuition, we will complement the graphical evaluation with statistical tests, reminiscent of:
- Kolmogorov–Smirnov or Kruskal–Wallis tests for continuous variables,
- Cramér’s V test for categorical variables.
These analyses would typically show that the probability of a missing value will depend on the observed variables. This mechanism is referred to as MAR (Missing At Random).
Under MAR, several imputation methods might be considered, including machine learning approaches reminiscent of k-nearest neighbors (KNN).
Nonetheless, in this text, we adopt a conservative imputation strategy, which is often utilized in credit scoring. The thought is to assign missing values to a category related to a better probability of default.
In our previous evaluation, we observed that borrowers with the best default rate belong to the primary quartile of employment length, corresponding to customers with lower than two years of employment history. To stay conservative, we due to this fact assign missing values for person_emp_length to 0, meaning no employment history.
Case of the Variable loan_int_rate
After we analyze the connection between loan_int_rate and the opposite continuous variables, the graphical evaluation suggests no clear differences between observations with missing values and people without.

In other words, borrowers with missing rates of interest appear to behave similarly to the remainder of the population when it comes to the opposite variables. This statement may also be confirmed using statistical tests.
This kind of mechanism is normally known as MCAR (Missing Completely At Random). On this case, the missingness is independent of each the observed and unobserved variables.
When the missing data mechanism is MCAR, an easy imputation strategy is mostly sufficient. On this study, we elect to impute the missing values of loan_int_rate using the median, which is strong to extreme values.
When you would love to explore missing value imputation techniques in additional depth, I like to recommend reading this text.
The code below shows how one can impute the train, test, and OOT datasets while preserving the independence between them. This approach ensures that each one imputation parameters are computed using the training dataset only after which applied to the opposite datasets. By doing so, we limit potential biases that might otherwise affect the model’s ability to generalize to recent data.
def impute_missing_values(train, test, oot,
emp_var="person_emp_length",
rate_var="loan_int_rate",
emp_value=0):
"""
Impute missing values using statistics computed on the training dataset.
Parameters
----------
train, test, oot : pandas.DataFrame
Datasets to process.
emp_var : str
Variable representing employment length.
rate_var : str
Variable representing rate of interest.
emp_value : int or float
Value used to impute employment length (conservative strategy).
Returns
-------
train_imp, test_imp, oot_imp : pandas.DataFrame
Imputed datasets.
"""
# Copy datasets to avoid modifying originals
train_imp = train.copy()
test_imp = test.copy()
oot_imp = oot.copy()
# ----------------------------
# Compute statistics on TRAIN
# ----------------------------
rate_median = train_imp[rate_var].median()
# ----------------------------
# Create missing indicators
# ----------------------------
for df in [train_imp, test_imp, oot_imp]:
df[f"{emp_var}_missing"] = df[emp_var].isnull().astype(int)
df[f"{rate_var}_missing"] = df[rate_var].isnull().astype(int)
# ----------------------------
# Apply imputations
# ----------------------------
for df in [train_imp, test_imp, oot_imp]:
df[emp_var] = df[emp_var].fillna(emp_value)
df[rate_var] = df[rate_var].fillna(rate_median)
return train_imp, test_imp, oot_imp
## Application de l'imputation
train_imputed, test_imputed, oot_imputed = impute_missing_values(
train=train_clean_outlier,
test=test_clean_outlier,
oot=oot_clean_outlier,
emp_var="person_emp_length",
rate_var="loan_int_rate",
emp_value=0
)
We now have now treated each outliers and missing values. To maintain the article focused and avoid making it long, we'll stop here and move on to the conclusion. At this stage, the train, test, and OOT datasets might be safely saved.
train_imputed.to_csv("train_imputed.csv", index=False)
test_imputed.to_csv("test_imputed.csv", index=False)
oot_imputed.to_csv("oot_imputed.csv", index=False)
In the following article, we'll analyze correlations amongst variables to perform robust variable selection. We can even introduce the discretization of continuous variables and study two necessary properties for credit scoring models: monotonicity and stability over time.
Conclusion
This text is an element of a series dedicated to constructing credit scoring models which can be each robust and stable over time.
In this text, we highlighted the importance of handling outliers and missing values in the course of the preprocessing stage. Properly treating these issues helps prevent biases that might otherwise distort the model and reduce its ability to generalize to recent borrowers.
To preserve this generalization capability, all preprocessing steps should be calibrated using only the training dataset, while maintaining strict independence from the test and out-of-time (OOT) datasets. Once the transformations are defined on the training data, they have to then be replicated exactly on the test and OOT datasets.
In the following article, we'll analyze the relationships between the goal variable and the explanatory variables, following the identical methodological principle, that's, preserving the independence between the train, test, and OOT datasets.
Image Credits
All images and visualizations in this text were created by the creator using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.
References
[1]Â Lorenzo Beretta and Alessandro Santaniello.
National Library of Medicine, 2016.
[2]Â Nexialog Consulting.
Working paper, 2022.
[3]Â John T. Hancock and Taghi M. Khoshgoftaar.
Journal of Big Data, 7(28), 2020.
[4]Â Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
International Journal of Methods in Psychiatric Research, 2011.
[5]Â Majid Sarmad.
Department of Mathematical Sciences, University of Durham, England, 2006.
[6] Daniel J. Stekhoven and Peter Bühlmann.
Bioinformatics, 2011.
[7]Â Supriyanto Wibisono, Anwar, and Amin.
Journal of Physics: Conference Series, 2021.
Data & Licensing
The dataset utilized in this text is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This license allows anyone to share and adapt the dataset for any purpose, including business use, provided that proper attribution is given to the source.
For more details, see the official license text:Â CC0: Public Domain.
Disclaimer
Any remaining errors or inaccuracies are the creator’s responsibility. Feedback and corrections are welcome.
