Exploratory Data Evaluation for Credit Scoring with Python

project, it is commonly tempting to leap to modeling. Yet step one and crucial one is to know the info.

In our previous post, we presented how the databases used to construct credit scoring models are constructed. We also highlight the importance of asking right questions:

Who’re the purchasers?
What kinds of loans are they granted?
What characteristics appear to elucidate default risk?

In this text, we illustrate this foundational step using an open-source dataset available on Kaggle: the Credit Scoring Dataset. This dataset incorporates 32,581 observations and 12 variables describing loans issued by a bank to individual borrowers.

These loans cover a variety of financing needs — medical, personal, educational, and skilled — in addition to debt consolidation operations. Loan amounts range from $500 to $35,000.

The variables capture two dimensions:

contract characteristics (loan amount, rate of interest, purpose of financing, credit grade, and time elapsed since loan origination),
borrower characteristics (age, income, years of skilled experience, and housing status).

The model’s goal variable is default, which takes the worth 1 if the shopper is in default and 0 otherwise.

Today, many tools and an increasing variety of AI agents are able to routinely generating statistical descriptions of datasets. Nevertheless, performing this evaluation manually stays a superb exercise for beginners. It builds a deeper understanding of the info structure, helps highlight potential anomalies, and supports the identification of variables that could be predictive of risk.

In this text, we take a straightforward instructional approach to statistically describing each variable within the dataset.

For categorical variables, we analyze the variety of observations and the default rate for every category.
For continuous variables, we discretize them into 4 intervals defined by the quartiles:
- ]min; Q1], ]Q1; Q2], ]Q2; Q3] and ]Q3; max]

We then apply the identical descriptive evaluation to those intervals as for categorical variables. This segmentation is bigoted and could possibly be replaced by other discretization methods. The goal is solely to get an initial read on how risk behaves across the several loan and borrower characteristics.

Descriptive Statistics of the Modeling Dataset

Distribution of the Goal Variable (loan_status)

This variable indicates whether the loan granted to a counterparty has resulted in a repayment default. It takes two values: 0 if the shopper is just not in default, and 1 if the shopper is in default.

Over 78% of shoppers haven’t defaulted. The dataset is imbalanced, and it is vital to account for this imbalance during modeling.

The subsequent relevant variable to investigate could be a temporal one. It will allow us to check how the default rate evolves over time, confirm its stationarity, and assess its stability and its predictability.

Unfortunately, the dataset incorporates no temporal information. We have no idea when each statement was recorded, which makes it unimaginable to find out whether the loans were issued during a period of economic stability or during a downturn.

This information is nonetheless essential in credit risk modeling. Borrower behavior can vary significantly depending on the macroeconomic environment. As an illustration, during financial crises — reminiscent of the 2008 subprime crisis or the COVID-19 pandemic — default rates typically rise sharply in comparison with more favorable economic periods.

The absence of a temporal dimension on this dataset due to this fact limits the scope of our evaluation. Particularly, it prevents us from studying how risk dynamics evolve over time and from evaluating the potential robustness of a model against economic cycles.

We do, nevertheless, have access to the variable cb_person_cred_hist_length, which represents the length of a customer’s credit history, expressed in years.

Distribution by Credit History Length (`cb_person_cred_hist_length`)

This variable has 29 distinct values, starting from 2 to 30 years. We are going to treat it as a continuous variable and discretize it using quantiles.

Several observations could be drawn from the table above. First, greater than 56% of borrowers have a credit history of 4 years or less, indicating that a big proportion of clients within the dataset have relatively short credit histories.

Second, the default rate appears fairly stable across intervals, hovering around 21%. That said, borrowers with shorter credit histories are likely to exhibit barely riskier behavior than those with longer ones, as reflected of their higher default rates.

Distribution by Previous Default (`cb_person_default_on_file`)

This variable indicates whether the borrower has previously defaulted on a loan. It due to this fact provides useful information concerning the past credit behavior of the client.

It has two possible values:

Y: the borrower has defaulted up to now
N: the borrower has never defaulted

On this dataset, greater than 80% of borrowers don’t have any history of default, suggesting that the vast majority of clients have maintained a satisfactory repayment record.

Nevertheless, a transparent difference in risk emerges between the 2 groups. Borrowers with a previous default history are significantly riskier, with a default rate of about 38%, compared with around 18% for borrowers who’ve never defaulted.

This result’s consistent with what is often observed in credit risk modeling: past repayment behavior is commonly one in all the strongest predictors of future default.

Distribution by Age

The presence of the age variable on this dataset indicates that the loans are granted to individual borrowers (retail clients) fairly than corporate entities. To raised analyze this variable, we group borrowers into age intervals based on quartiles.

The dataset includes borrowers across a big selection of ages. Nevertheless, the distribution is strongly skewed toward younger individuals: greater than 70% of borrowers are under 30 years old.

The evaluation of default rates across the age groups reveals that the highest risk is concentrated in the primary quartile, followed by the second quartile. In other words, younger borrowers look like the riskiest segment on this dataset.

Distribution by Annual Income

Borrowers’ annual income on this dataset ranges from $4,000 to $6,000,000. To investigate its relationship with default risk, we divide income into 4 intervals based on quartiles.

The outcomes show that the best default rates are concentrated amongst borrowers with the bottom incomes, particularly in the primary quartile ($4,000–$385,00) and the second quartile ($385,00–$55,000).

As income increases, the default rate step by step decreases. Borrowers within the third quartile ($55,000–$792,000) and the fourth quartile ($792,000–$600,000) exhibit noticeably lower default rates.

Overall, this pattern suggests an inverse relationship between annual income and default risk, which is consistent with standard credit risk expectations: borrowers with higher incomes typically have greater repayment capability and financial stability, making them less prone to default.

Distribution by Home Ownership

This variable describes the borrower’s housing status. The categories include RENT (tenant), MORTGAGE (homeowner with a mortgage), OWN (homeowner with no mortgage), and OTHER (other housing arrangements).

On this dataset, roughly 50% of borrowers are renters, 40% are homeowners with a mortgage, 8% own their home outright, and about 2% fall into the “OTHER” category.

The evaluation reveals that the best default rates are observed amongst renters (RENT) and borrowers classified as “OTHER.” In contrast, homeowners with no mortgage (OWN) exhibit the bottom default rates, followed by borrowers with a mortgage (MORTGAGE).

Distributionby person employment length person_emp_length

This variable measures the borrower’s employment length in years. To investigate its relationship with default risk, borrowers are grouped into 4 intervals based on quartiles: the first quartile (0–2 years), the second quartile (2–4 years), the third quartile (4–7 years), and the fourth quartile (7 years or more).

The evaluation shows that the best default rates are concentrated amongst borrowers with the shortest employment histories, particularly those in the primary quartile (0–2 years) and the second quartile (2–4 years).

As employment length increases, the default rate tends to say no. Borrowers within the third quartile (4–7 years) and the fourth quartile (7 years or more) exhibit lower default rates.

Overall, this pattern suggests an inverse relationship between employment length and default risk, indicating that borrowers with longer employment histories may profit from greater income stability and financial security, which reduces their likelihood of default.

Distribution by loan intent

This categorical variable describes the purpose of the loan requested by the borrower. The categories include EDUCATION, MEDICAL, VENTURE (entrepreneurship), PERSONAL, DEBTCONSOLIDATION, and HOMEIMPROVEMENT.

The variety of borrowers is fairly balanced across the several loan purposes, with a rather higher share of loans used for education (EDUCATION) and medical expenses (MEDICAL).

Nevertheless, the evaluation reveals notable differences in risk across categories. Borrowers searching for loans for debt consolidation (DEBTCONSOLIDATION) and medical purposes (MEDICAL) exhibit higher default rates. In contrast, loans intended for education (EDUCATION) and entrepreneurial activities (VENTURE) are related to lower default rates.

Overall, these results suggest that the purpose of the loan could also be a crucial risk indicator, as different financing needs can reflect various levels of monetary stability and repayment capability.

Distribution by loan grade

This categorical variable represents the loan grade assigned to every borrower, typically based on an assessment of their credit risk profile. The grades range from A to G, where A corresponds to the lowest-risk loans and G to the highest-risk loans.

On this dataset, greater than 80% of borrowers are assigned grades A, B, or C, indicating that the vast majority of loans are considered relatively low risk. In contrast, grades D, E, F, and G correspond to borrowers with higher credit risk, and these categories account for a much smaller share of the observations.

The distribution of default rates across the grades shows a transparent pattern: the default rate increases because the loan grade deteriorates. In other words, borrowers with lower credit grades are likely to exhibit higher probabilities of default.

This result’s consistent with the aim of the grading system itself, as loan grades are designed to summarize the borrower’s creditworthiness and associated risk level.

Distribution by Loan Amount

This variable represents the loan amount requested by the borrower. On this dataset, loan amounts range from $500 to $35,000, which corresponds to relatively small consumer loans.

The evaluation of default rates across the quartiles shows that the best risk is concentrated amongst borrowers within the upper range of loan amounts, particularly within the fourth quartile ($20,000–$35,000), where default rates are higher.

Distribution by loan rate of interest (loan_int_rate)

This variable represents the rate of interest applied to the loan granted to the borrower. On this dataset, rates of interest range from 5% to 24%.

To investigate the connection between rates of interest and default risk, we group the observations into quartiles. The outcomes show that the best default rates are concentrated within the upper range of rates of interest, particularly within the fourth quartile (roughly 13%–24%).

Distribution by loan percent income

This variable measures the proportion of a borrower’s annual income allocated to loan repayment. It indicates the financial burdenassociated with the loan relative to the borrower’s income.

The evaluation shows that the best default rates are concentrated within the upper quartile, where borrowers allocate between 20% and 100% of their income to loan repayment.

Conclusion

On this evaluation, now we have described each of the 12 variables within the dataset. This exploratory step allowed us to construct a transparent understanding of the info and quickly summarize its key characteristics within the introduction.

Prior to now, such a evaluation was often time-consuming and typically required the collaboration of several data scientists to perform the statistical exploration and produce the ultimate reporting. While the interpretations of various variables may sometimes appear repetitive, such detailed documentation is commonly required in regulated environments, particularly in fields like credit risk modeling.

Today, nevertheless, the rise of artificial intelligence is transforming this workflow. Tasks that previously required several days of labor can now be accomplished in lower than half-hour, under the supervision of a statistician or data scientist. On this setting, the expert’s role shifts from manually performing the evaluation to guiding the method, validating the outcomes, and ensuring their reliability.

In practice, it is feasible to design two specialized AI agents at this stage of the workflow. The primary agent assists with data preparation and dataset construction, while the second performs the exploratory evaluation and generates the descriptive reporting presented in this text.

Several years ago, it was already really useful to automate these tasks every time possible. On this post, the tables used throughout the evaluation were generated routinely using the Python functions presented at the top of this text.

In the following article, we’ll take the evaluation a step further by exploring variable treatment, detecting and handling outliers, analyzing relationships between variables, and performing an initial feature selection.

Image Credits

All images and visualizations in this text were created by the creator using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

References

[1] Lorenzo Beretta and Alessandro Santaniello.

National Library of Medicine, 2016.

[2] Nexialog Consulting.

Working paper, 2022.

[3] John T. Hancock and Taghi M. Khoshgoftaar.

Journal of Big Data, 7(28), 2020.

[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.

International Journal of Methods in Psychiatric Research, 2011.

[5] Majid Sarmad.

Department of Mathematical Sciences, University of Durham, England, 2006.

[6] Daniel J. Stekhoven and Peter Bühlmann.
Bioinformatics, 2011.

[7] Supriyanto Wibisono, Anwar, and Amin.

Journal of Physics: Conference Series, 2021.

Data & Licensing

The dataset utilized in this text is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This license allows anyone to share and adapt the dataset for any purpose, including industrial use, provided that proper attribution is given to the source.

For more details, see the official license text: CC0: Public Domain.

Disclaimer

Any remaining errors or inaccuracies are the creator’s responsibility. Feedback and corrections are welcome.

Codes

import pandas as pd
from typing import Optional, Union


def build_default_summary(
    df: pd.DataFrame,
    category_col: str,
    default_col: str,
    category_label: Optional[str] = None,
    include_na: bool = False,
    sort_by: str = "count",
    ascending: bool = False,
) -> pd.DataFrame:
    """
    Construit un tableau de synthèse pour une variable catégorielle.

    Paramètres
    ----------
    df : pd.DataFrame
        DataFrame source.
    category_col : str
        Nom de la variable catégorielle.
    default_col : str
        Colonne binaire indiquant le défaut (0/1 ou booléen).
    category_label : str, optionnel
        Libellé à afficher pour la première colonne.
        Par défaut : category_col.
    include_na : bool, default=False
        Si True, conserve les valeurs manquantes comme catégorie.
    sort_by : str, default="count"
        Colonne de tri logique parmi {"count", "defaults", "prop", "default_rate", "category"}.
    ascending : bool, default=False
        Sens du tri.

    Retour
    ------
    pd.DataFrame
        Tableau prêt à exporter.
    """

    if category_col not in df.columns:
        raise KeyError(f"La colonne catégorielle '{category_col}' est introuvable.")
    if default_col not in df.columns:
        raise KeyError(f"La colonne défaut '{default_col}' est introuvable.")

    data = df[[category_col, default_col]].copy()

    # Validation minimale sur la cible
    # On convertit bool -> int ; sinon on suppose 0/1 documenté
    if pd.api.types.is_bool_dtype(data[default_col]):
        data[default_col] = data[default_col].astype(int)

    # Gestion des NA de la variable catégorielle
    if include_na:
        data[category_col] = data[category_col].astype("object").fillna("Missing")
    else:
        data = data[data[category_col].notna()].copy()

    grouped = (
        data.groupby(category_col, dropna=False)[default_col]
        .agg(count="size", defaults="sum")
        .reset_index()
    )

    total_obs = grouped["count"].sum()
    total_def = grouped["defaults"].sum()

    grouped["prop"] = grouped["count"] / total_obs if total_obs > 0 else 0.0
    grouped["default_rate"] = grouped["defaults"] / grouped["count"]

    sort_mapping = {
        "count": "count",
        "defaults": "defaults",
        "prop": "prop",
        "default_rate": "default_rate",
        "category": category_col,
    }
    if sort_by not in sort_mapping:
        raise ValueError(
            "sort_by doit être parmi {'count', 'defaults', 'prop', 'default_rate', 'category'}."
        )

    grouped = grouped.sort_values(sort_mapping[sort_by], ascending=ascending).reset_index(drop=True)

    total_row = pd.DataFrame(
        {
            category_col: ["Total"],
            "count": [total_obs],
            "defaults": [total_def],
            "prop": [1.0 if total_obs > 0 else 0.0],
            "default_rate": [total_def / total_obs if total_obs > 0 else 0.0],
        }
    )

    summary = pd.concat([grouped, total_row], ignore_index=True)

    

    summary = summary.rename(
        columns={
            category_col: category_label or category_col,
            "count": "Nb of obs",
            "defaults": "Nb def",
            "prop": "Prop",
            "default_rate": "Default rate",
        }
    )
    summary = summary[[category_label or category_col, "Nb of obs", "Prop", "Nb def", "Default rate"]]
    return summary


def export_summary_to_excel(
    summary: pd.DataFrame,
    output_path: str,
    sheet_name: str = "Summary",
    title: str = "All perimeters",
) -> None:
    """
    Exporte le tableau de synthèse dans un fichier Excel avec mise en forme.
    Nécessite le moteur xlsxwriter.
    """

    with pd.ExcelWriter(output_path, engine="xlsxwriter") as author:
        #

        workbook = author.book
        worksheet = workbook.add_worksheet(sheet_name)

        nrows, ncols = summary.shape
        total_excel_row = 2 + nrows  # +1 implicite Excel automotive index 0-based côté xlsxwriter pour set_row
        # Détail :
        # ligne 0 : titre fusionné
        # ligne 2 : header
        # données commencent ligne 3 (Excel visuel), mais xlsxwriter manipule en base 0

        # -------- Formats --------
        border_color = "#4F4F4F"
        header_bg = "#D9EAF7"
        title_bg = "#CFE2F3"
        total_bg = "#D9D9D9"
        white_bg = "#FFFFFF"

        title_fmt = workbook.add_format({
            "daring": True,
            "align": "center",
            "valign": "vcenter",
            "font_size": 14,
            "border": 1,
            "bg_color": title_bg,
        })

        header_fmt = workbook.add_format({
            "daring": True,
            "align": "center",
            "valign": "vcenter",
            "border": 1,
            "bg_color": header_bg,
        })

        text_fmt = workbook.add_format({
            "border": 1,
            "align": "left",
            "valign": "vcenter",
            "bg_color": white_bg,
        })

        int_fmt = workbook.add_format({
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "num_format": "# ##0",
            "bg_color": white_bg,
        })

        pct_fmt = workbook.add_format({
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "num_format": "0.00%",
            "bg_color": white_bg,
        })

        total_text_fmt = workbook.add_format({
            "daring": True,
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "bg_color": total_bg,
        })

        total_int_fmt = workbook.add_format({
            "daring": True,
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "num_format": "# ##0",
            "bg_color": total_bg,
        })

        total_pct_fmt = workbook.add_format({
            "daring": True,
            "border": 1,
            "align": "center",
            "valign": "vcenter",
            "num_format": "0.00%",
            "bg_color": total_bg,
        })

        # -------- Titre fusionné --------
        worksheet.merge_range(0, 0, 0, ncols - 1, title, title_fmt)

        # -------- Header --------
        worksheet.set_row(2, 28)
        for col_idx, col_name in enumerate(summary.columns):
            worksheet.write(1, col_idx, col_name, header_fmt)

        # -------- Largeurs de colonnes --------
        column_widths = {
            0: 24,  # catégorie
            1: 14,  # Nb of obs
            2: 12,  # Nb def
            3: 10,  # Prop
            4: 14,  # Default rate
        }
        for col_idx in range(ncols):
            worksheet.set_column(col_idx, col_idx, column_widths.get(col_idx, 15))

        # -------- Mise en forme cellule par cellule --------
        last_row_idx = nrows - 1

        for row_idx in range(nrows):
            excel_row = 2 + row_idx  # données à partir de la ligne 3 (0-based xlsxwriter)

            is_total = row_idx == last_row_idx

            for col_idx, col_name in enumerate(summary.columns):
                value = summary.iloc[row_idx, col_idx]

                if col_idx == 0:
                    fmt = total_text_fmt if is_total else text_fmt
                elif col_name in ["Nb of obs", "Nb def"]:
                    fmt = total_int_fmt if is_total else int_fmt
                elif col_name in ["Prop", "Default rate"]:
                    fmt = total_pct_fmt if is_total else pct_fmt
                else:
                    fmt = total_text_fmt if is_total else text_fmt

                worksheet.write(excel_row, col_idx, value, fmt)

        # Optionnel : figer le header
        #worksheet.freeze_panes(3, 1)

        worksheet.set_default_row(24)


def generate_categorical_report_excel(
    df: pd.DataFrame,
    category_col: str,
    default_col: str,
    output_path: str,
    sheet_name: str = "Summary",
    title: str = "All perimeters",
    category_label: Optional[str] = None,
    include_na: bool = False,
    sort_by: str = "count",
    ascending: bool = False,
) -> pd.DataFrame:
    """
    
    1. calcule le tableau
    2. l'exporte vers Excel
    3. renvoie aussi le DataFrame récapitulatif
    """
    summary = build_default_summary(
        df=df,
        category_col=category_col,
        default_col=default_col,
        category_label=category_label,
        include_na=include_na,
        sort_by=sort_by,
        ascending=ascending,
    )

    export_summary_to_excel(
        summary=summary,
        output_path=output_path,
        sheet_name=sheet_name,
        title=title,
    )

    return summary

def discretize_variable_by_quartiles(
    df: pd.DataFrame,
    variable: str,
    new_var: str | None = None
) -> pd.DataFrame:
    """
    Discretize a continuous variable into 4 intervals based on its quartiles.

    The function computes Q1, Q2 (median), and Q3 of the chosen variable and
    creates 4 bins corresponding to the next intervals:

        ]min ; Q1], ]Q1 ; Q2], ]Q2 ; Q3], ]Q3 ; max]

    Parameters
    ----------
    df : pd.DataFrame
        Input dataframe containing the variable to discretize.

    variable : str
        Name of the continual variable to be discretized.

    new_var : str, optional
        Name of the brand new categorical variable created. If None,
        the name "_quartile" is used.

    Returns
    -------
    pd.DataFrame
        A duplicate of the dataframe with the brand new quartile-based categorical variable.
    """

    # Create a duplicate of the dataframe to avoid modifying the unique dataset
    data = df.copy()

    # If no name is provided for the brand new variable, create one routinely
    if new_var is None:
        new_var = f"{variable}_quartile"

    # Compute the quartiles of the variable
    q1, q2, q3 = data[variable].quantile([0.25, 0.50, 0.75])

    # Retrieve the minimum and maximum values of the variable
    vmin = data[variable].min()
    vmax = data[variable].max()

    # Define the bin edges
    # A small epsilon is subtracted from the minimum value to make sure it's included
    bins = [vmin - 1e-9, q1, q2, q3, vmax]

    # Define human-readable labels for every interval
    labels = [
        f"]{vmin:.2f} ; {q1:.2f}]",
        f"]{q1:.2f} ; {q2:.2f}]",
        f"]{q2:.2f} ; {q3:.2f}]",
        f"]{q3:.2f} ; {vmax:.2f}]",
    ]

    # Use pandas.cut to assign each statement to a quartile-based interval
    data[new_var] = pd.cut(
        data[variable],
        bins=bins,
        labels=labels,
        include_lowest=True
    )

    # Return the dataframe with the brand new discretized variable
    return data

Example of application for a continuous variable

# Distribution by age (person_age)
# Discretize the variable into quartiles

df_with_age_bins = create_quartile_bins(
    df,
    variable="person_age",
    new_var="age_quartile"
)

summary = generate_categorical_report_excel(
    df=df_with_age_bins,
    category_col="age_quartile",
    default_col="def",
    output_path="age_quartile_report.xlsx",
    sheet_name="Age Quartiles",
    title="Distribution by Age (Quartiles)",
    category_label="Age Quartiles",
    sort_by="default_rate",
    ascending=False
)

Exploratory Data Evaluation for Credit Scoring with Python

Descriptive Statistics of the Modeling Dataset

Distribution of the Goal Variable (loan_status)

Distribution by Credit History Length (`cb_person_cred_hist_length`)