Is Your Training Data Representative? A Guide to Checking with PSI in Python

-

To get essentially the most out of this tutorial, you must have a solid understanding of the right way to compare two distributions. For those who don’t, I like to recommend testing this excellent article by @matteo-courthoud.

We automated the evaluation and exported the outcomes to an Excel file using Python. For those who already know the fundamentals of Python and the right way to write to Excel, that may make things even easier.

I would love to thank everyone who took the time to read and have interaction with my article. Your support and feedback mean so much.

, whether academic or skilled, the query of information representativeness between two samples arises incessantly.

By representativeness, we mean the degree to which two samples resemble one another or share the identical characteristics. This idea is crucial, because it directly determines the accuracy of statistical conclusions or the performance of a predictive model.

At each stage of a model’s life cycle, the problem of information representativeness takes specific forms :

  • During the development phase: that is where all of it begins. You gather the information, clean it, split it into training, test, and out-of-time samples, estimate the parameters, and punctiliously document every decision. You be sure that the test and the out-of-time samples are representative of the training data.
  • In The appliance phase: once the model is built, it should be confronted with reality. And here an important query arises: do the brand new datasets truly resemble those used during construction? If not, much of the previous work may quickly lose its value.
  • In the monitoring phase, or backtesting: over time, populations evolve. The model must subsequently be recurrently challenged. Do its predictions remain valid? Is the representativeness of the goal portfolio still ensured?

Representativeness is subsequently not a one-off constraint, but a problem that accompanies the model throughout its development.

To reply the query of representativeness between two samples, essentially the most common approach is to check their distributions, proportions, and structures. This involves the usage of visual tools like density functions, histograms, boxplots, supplemented by statistical tests equivalent to the Student’s t-test, the Kruskal-Wallis test, the Wilcoxon test, or the Kolmogorov-Smirnov test. On this subject, @matteo-courthoud has published an important article, complete with practical codes, to which we refer the reader for further information.

In this text, we are going to concentrate on two practical tools often utilized in credit risk management to envision whether two datasets are comparable:

  • The Population Stability Index (PSI) shows how much a distribution shifts, either over time or between two samples.
  • Cramér’s V measures the strength of association between categories, helping us see if two populations share the same structure.

We’ll then explore how these tools may help engineers and decision-makers by transforming statistical comparisons into clear data for faster and more reliable decisions.

In Section 1 of this text, we present two concrete examples where questions of representativeness between samples may arise. In Section 2, we evaluate representativeness between two datasets using PSI and Cramér’s V. Finally, in Section 3, we reveal the right way to implement and automate these analyses in Python, exporting the outcomes into an Excel file.

1. Two real-world examples of the representativeness challenge

The problem of representativeness becomes vital when a model is applied to a site apart from the one for which it was developed. Two typical situations illustrate this challenge:

1.1 When a model is applied to a latest scope of clients

Imagine a bank developing a scoring model for small businesses. The model performs well and is recognized internally. Encouraged by this success, the leadership decides to increase its use to large corporations. Your manager asks in your opinion on the approach. What steps do you’re taking before responding?

For the reason that development and application populations differ, using the model on the brand new population extends its scope. It’s subsequently crucial to substantiate that this application is valid.

The statistician has several tools to handle this query, specifically representativeness evaluation comparing the event population with the applying population. This could be done by examining their characteristics variable by variable, for instance through tests of mean equality, tests of distribution equality, or by comparing the distribution of categorical variables.

1.2 When two banks merge and wish to align their risk models

Now consider Bank A, a big institution with a considerable balance sheet and a proven model to evaluate client default risk. Bank A is studying the potential for merging with Bank B. Bank B, nonetheless, operates in a weaker economic environment and has not developed its own internal model.

Suppose Bank A’s management approaches you, because the statistician answerable for its internal models. The strategic query is: would it not be appropriate to use Bank A’s internal models to Bank B’s portfolio within the event of a merger?

Before applying Bank A’s internal model to Bank B’s portfolio, it’s crucial to check the distributions of key variables across each portfolios. The model can only be transferred with confidence if the 2 populations are truly representative of one another.

We now have just presented two concrete cases where verifying representativeness is crucial for sound decision-making. In the following section, we address the right way to analyze representativeness between two portfolios by introducing two statistical tools: the Population Stability Index (PSI) and Cramér’s V.

2. Comparing Distributions to Assess Representativeness Between Two Populations Using the Population Stability Index (PSI) and V-Cramer.

In practice, the study of representativeness between two datasets consists of comparing the characteristics of the observed variables in each samples. This comparison relies on each statistical measures and visual tools.

From a statistical perspective, analysts often examine measures of central tendency (mean, median) and dispersion (variance, standard deviation), in addition to more granular indicators equivalent to quantiles.

On the visual side, common tools include histograms, boxplots, cumulative distribution functions, density curves, and QQ-plots. These visualizations help detect potential differences in shape, location, or dispersion between two distributions.

Such graphical analyses provide a necessary first step: they guide the investigation and help formulate hypotheses. Nonetheless, they have to be complemented by statistical tests to substantiate observations and reach rigorous conclusions. These tests include:

  • Parametric tests, equivalent to Student’s -test (comparison of means),
  • Nonparametric tests, equivalent to the Kolmogorov–Smirnov test (comparison of distributions), the chi-squared test (for categorical variables), and Welch’s test (for unequal variances).

These approaches are well presented within the article by @matteo-courthoud. Beyond them, two indicators are particularly relevant in credit risk evaluation for assessing distributional drift between populations and supporting decision-making: the Population Stability Index (PSI) and Cramér’s V

2.1. The Population Stability Index (PSI)

The PSI is a fundamental tool within the credit industry. It measures the difference between two distributions of the identical variable:

  • for instance, between the training dataset and a newer application dataset,
  • or between a reference dataset at time T0 and one other at time T1.

In other words, the PSI quantifies how much a population has drifted over time or across different scopes.

Here’s how it really works in practice:

  • For a categorical variable, we compute the proportion of observations in each category for each datasets.
  • For a continuous variable, we first discretize it into bins. In practice, deciles are sometimes used to acquire a balanced distribution.

The PSI then compares, bin by bin, the proportions observed within the reference dataset versus the goal dataset. The ultimate indicator aggregates these differences using a logarithmic formula:

Here, and represent the proportions in bin for the reference dataset and the goal dataset, respectively. The PSI could be computed easily in an Excel file:

Computation Framework for the Population Stability Index (PSI).

The interpretation is extremely intuitive:

  • A smaller PSI means the 2 distributions are closer.
  • A PSI of 0 means the distributions are an identical.
  • A really large PSI (tending toward infinity) means the 2 distributions are fundamentally different.

In practice, industry guidelines often use the next thresholds:

  • PSI < 0.1: the population is stable,
  • 0.1 ≤ PSI < 0.25: the shift is noticeable—monitor closely,
  • PSI ≥ 0.25: the shift is critical—the model may not be reliable.

2.2. Cramér’s V

When assessing the representativeness of a categorical variable (or a discretized continuous variable) between two datasets, a natural start line is the Chi-square test of independence.

We construct a contingency table crossing:

  • the categories (modalities) of the variable of interest, and
  • an indicator variable for dataset membership (Dataset 1 / Dataset 2).

The test is predicated on the next statistic:

where Oij are the observed counts and Eij are the expected counts under the belief of independence.

  • Null hypothesis H0: the variable has the identical distribution in each datasets (independence).
  • Alternative hypothesis H1 : the distributions differ.

If H0 is rejected, we conclude that the variable doesn’t follow the identical distribution across the 2 datasets.

Nonetheless, the Chi-square test has a significant limitation: it only provides a binary answer (reject / don’t reject), and its power is extremely sensitive to sample size. With very large datasets, even tiny differences can appear statistically significant.

To deal with this limitation, we use Cramér’s V, which rescales the Chi-square statistic to provide a normalized measure of association bounded between 0 and 1:

where n is the entire sample size, r is the variety of rows, and c is the variety of columns within the contingency table.

The interpretation is intuitive:

  • V≈0    ⇒ The distributions are very similar; representativeness is robust.
  • V→1    ⇒ The difference between distributions is large; the datasets are structurally different.

Unlike the Chi-square test, which simply answers “yes” or “no,” Cramér’s V provides a graded measure of the strength of the difference. This enables us to evaluate whether the difference is negligible, moderate, or substantial.

We use the identical thresholds as those applied for the PSI to attract our conclusions. For the PSI and Cramér’s V indicators, if the distribution of a number of variables differs significantly between the 2 datasets, we conclude that they should not representative.

3. Measuring Representativeness with PSI and Cramér’s V in Python.

In a previous article, we applied different variable selection methods to cut back the  dataset to simply 16 explanatory variables. This step was essential to simplify the model while keeping essentially the most relevant information.
This dataset also features a variable called fold, which splits the information into 10 subsamples. These folds are commonly utilized in cross-validation: they permit us to check the robustness of a model by training it on one a part of the information and validating it on one other. For cross-validation to be reliable, each fold must be representative of the worldwide dataset:

  1. To make sure valid performance estimates.
  2. To forestall bias: a non-representative fold can distort model results
  3. To support generalization: representative folds provide a greater indication of how the model will perform on latest data.

In this instance, we are going to concentrate on checking whether fold 1 is representative of the worldwide dataset using our two indicators: PSI and Cramer’s V by comparing the distribution of 16 variables across the 2 samples. We’ll proceed in two steps:

Step 1: Start with the Goal Variable

We start with the goal variable. The concept is easy: compare its distribution between fold 1 and the whole dataset. To quantify this difference, we’ll use two complementary indicators:

  • the Population Stability Index (PSI), which measures distributional shifts,
  • Cramér’s V, which measures the strength of association between two categorical variables.

Step 2: Automating the Evaluation for All Variables

After illustrating the approach with the goal, we extend it to all features. We’ll construct a Python function that computes PSI and Cramér’s V for every of the 16 explanatory variables, in addition to for the goal variable.

To make the outcomes easy to interpret, we’ll export the whole lot into an Excel file with:

  • one sheet per variable, showing the detailed comparison by segment,
  • Summary tab, aggregating results across all variables.

3.1 Comparing the goal variable ViolentCrimesPerPop between the worldwide dataset (reference) and fold 1 (goal)

Before applying statistical tests or constructing decision indicators, it is crucial to conduct a descriptive and graphical evaluation. There should not just formalities; they supply an early intuition concerning the differences between populations and help interpreting the outcomes. In practice, a well-chosen chart often reveals the conclusions that indicators like PSI or Cramér’s V will later confirm (or challenge).

For visualization, we proceed in three steps:

1. Comparing continuous distributions We start with graphical tools equivalent to boxplots, cumulative distribution functions, and probability density plots. These visualizations provide an intuitive approach to examine differences within the goal variable’s distribution between the 2 datasets.

2. Discretization into quantiles Next, we discretize the variable within the reference dataset using quartiles (Q1, Q2, Q3, Q4), which creates five classes (Q1 through Q5). We then apply the very same cut-off points to the goal dataset, ensuring that every commentary is mapped to intervals defined from the reference. This guarantees comparability between the 2 distributions.

3. Comparing categorical distributions Finally, once the variable has been discretized, we are able to use visualization methods suited to categorical data — equivalent to bar charts — to check how frequencies are distributed across the 2 datasets.

The method relies on the sort of variable:

For a continuous variable:

  • Start with standard visualizations (boxplots, cumulative distributions, and density plots).
  • Next, split the variable into segments (Q1 to Q5) based on the reference dataset’s quantiles.
  • Finally, treat these segments as categories and compare their distributions.

For a categorical variable:

  • No discretization is required — it’s already in categorical form.
  • Go straight to comparing category distributions, for instance with a bar chart.

The code below prepares the 2 datasets we would like to check after which visualizes the goal variable with a boxplot, showing its distribution in each the worldwide dataset and in fold 1.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, ks_2samp

data = pd.read_csv("communities_data.csv")
# filter sur fold =1

data_ref = data
data_target = data[data["fold"] == 1]

# compare the 2 distribution of "ViolentCrimesPerPop" within the reference and goal datasets with boxplots



# Construct datasets with a "Group" column
df_ref = pd.DataFrame({
    "ViolentCrimesPerPop": data_ref["ViolentCrimesPerPop"],
    "Group": "Reference"
})

df_target = pd.DataFrame({
    "ViolentCrimesPerPop": data_target["ViolentCrimesPerPop"],
    "Group": "Goal"
})

# Merge them
df_all = pd.concat([df_ref, df_target])


plt.figure(figsize=(8, 6))

# Boxplot with each distributions overlayed
sns.boxplot(
    x="Group", 
    y="ViolentCrimesPerPop", 
    data=df_all,
    palette="Set2",
    width=0.6,
    fliersize=3
)


# Add mean points
means = df_all.groupby("Group")["ViolentCrimesPerPop"].mean()
for i, m in enumerate(means):
    plt.scatter(i, m, color="red", marker="D", s=50, zorder=3, label="Mean" if i == 0 else "")

# Title tells the story
plt.title("Violent Crimes Per Population by Group", fontsize=14, weight="daring")
plt.suptitle("Each groups show nearly an identical distributions", 
             fontsize=10, color="gray")

plt.ylabel("Violent Crimes (Per Pop)", fontsize=12)
plt.xlabel("")

# Cleaner look
sns.despine()
plt.grid(axis="y", linestyle="--", alpha=0.5, visible=False)
plt.legend()

plt.show()


print(len(data.columns))

The figure above suggests that each groups share similar distributions for the ViolentCrimesPerPop variable. To take a better look, we are able to use Kernel Density Estimation (KDE) plots, which offer a smooth view of the underlying distribution and make it easier to identify subtle differences.

plt.figure(figsize=(8, 6))

# KDE plots with higher styling
sns.kdeplot(
    data=df_all,
    x="ViolentCrimesPerPop",
    hue="Group",
    fill=True,         # use shading for overlap
    alpha=0.4,         # transparency to point out overlap
    common_norm=False,
    palette="Set2",
    linewidth=2
)

# KS-test for distribution difference
g1 = df_all[df_all["Group"] == df_all["Group"].unique()[0]]["ViolentCrimesPerPop"]
g2 = df_all[df_all["Group"] == df_all["Group"].unique()[1]]["ViolentCrimesPerPop"]
stat, pval = ks_2samp(g1, g2)

# Add annotation
plt.text(df_all["ViolentCrimesPerPop"].mean(),
         plt.ylim()[1]*0.9,
         f"KS-test p-value = {pval:.3f}nNo significant difference observed",
         ha="center", fontsize=10, color="black")

# Titles with story
plt.title("Kernel Density Estimation of Violent Crimes Per Population", fontsize=14, weight="daring")
plt.suptitle("Distributions overlap almost completely between groups", fontsize=10, color="gray")

plt.xlabel("Violent Crimes (Per Pop)")
plt.ylabel("Density")

sns.despine()
plt.grid(False)
plt.show()

The KDE graph confirms that the 2 distributions are very similar, showing a high degree of overlap. The Kolmogorov-Smirnov (KS) statistical test of 0.976 also indicates that there is no such thing as a significant difference between the 2 groups. To increase the evaluation, we are able to now examine the cumulative distribution of the goal variable.

# Cumulative distribution
plt.figure(figsize=(9, 6))
sns.histplot(
    data=df_all,
    x="ViolentCrimesPerPop",
    hue="Group",
    stat="density",
    common_norm=False,
    fill=False,
    element="step",
    bins=len(df_all),
    cumulative=True,
)

# Titles tell the story
plt.title("Cumulative Distribution of Violent Crimes Per Population", fontsize=14, weight="daring")
plt.suptitle("ECDFs overlap extensively; central tendencies are nearly an identical", fontsize=10)

# Labels & cleanup
plt.xlabel("Violent Crimes (Per Pop)")
plt.ylabel("Cumulative proportion")
plt.grid(visible=False)
plt.show()

The cumulative distribution plot provides additional evidence that the 2 groups are very similar. The curves overlap almost completely, suggesting that their distributions are nearly an identical in each central tendency and spread.

As a next step, we’ll discretize the variable into quantiles within the reference dataset after which apply the identical cut-off points to the goal dataset (fold 1). The code below demonstrates the right way to do that. Finally, we’ll compare the resulting distributions using a bar chart.

def bin_numeric(ref, tgt, n_bins=5):
    """
    Discretize a numeric variable into quantile bins (ex: quintiles).
    - Quantile thresholds are computed only on the reference dataset.
    - Extend bins with -inf and +inf to cover all possible values.
    - Returns:
        * ref binned
        * tgt binned
        * bin labels (Q1, Q2, ...)
    """
    edges = np.unique(ref.dropna().quantile(np.linspace(0, 1, n_bins + 1)).values)
    if len(edges) < 3:  # if variable is nearly constant
        edges = np.array([-np.inf, np.inf])
    else:
        edges[0], edges[-1] = -np.inf, np.inf
    labels = [f"Q{i}" for i in range(1, len(edges))]
    return (
        pd.cut(ref, bins=edges, labels=labels, include_lowest=True),
        pd.cut(tgt, bins=edges, labels=labels, include_lowest=True),
        labels
    )

# Apply binning
ref_binned, tgt_binned, bin_labels = bin_numeric(data_ref["ViolentCrimesPerPop"], data_target["ViolentCrimesPerPop"], n_bins=5)




# Effectifs par segment pour Reference et Goal
ref_counts = ref_binned.value_counts().reindex(bin_labels, fill_value=0)
tgt_counts = tgt_binned.value_counts().reindex(bin_labels, fill_value=0)

# Convertir en proportions
ref_props = ref_counts / ref_counts.sum()
tgt_props = tgt_counts / tgt_counts.sum()

# Construire un DataFrame pour seaborn
df_props = pd.DataFrame({
    "Segment": bin_labels,
    "Reference": ref_props.values,
    "Goal": tgt_props.values
})

# Restructurer en format long
df_long = df_props.melt(id_vars="Segment", 
                        value_vars=["Reference", "Target"], 
                        var_name="Source", 
                        value_name="Proportion")

# Style sobre
sns.set_theme(style="whitegrid")

# Barplot avec proportions
plt.figure(figsize=(8,6))
sns.barplot(
    x="Segment", y="Proportion", hue="Source",
    data=df_long, palette=["#4C72B0", "#55A868"]  # bleu & vert sobres
)

# Titre et légende
# Titles with story
plt.title("Proportion Comparison by Segment (ViolentCrimesPerPop)", fontsize=14, weight="daring")
plt.suptitle("Across all quantile segments (Q1–Q5), proportions are nearly an identical", fontsize=10, color="gray")

plt.xlabel("Quantile Segment (Q1 - Q5)")
plt.ylabel("Proportion")
plt.legend(title="Dataset", loc="upper right")
plt.grid(False)
plt.show()

As before, we reach the identical conclusion: the distributions within the reference and goal datasets are very similar. To maneuver beyond visual inspection, we are going to now compute the Population Stability Index (PSI) and Cramér’s V statistic. These metrics allow us to quantify the differences between distributions; each for all variables basically and for the goal variable ViolentCrimesPerPop specifically.

3.2 Automating the Evaluation for All Variables

As mentioned earlier, the outcomes of the distribution comparisons for every variable between the 2 datasets, calculated using PSI and Cramér’s V, are presented in separate sheets inside a single Excel file.

For example, we start by examining the outcomes for the goal variable when comparing the worldwide dataset (reference) with fold 1 (goal). The table 1 below summarizes how each PSI and Cramér’s V are computed.

Table 1: PSI and Cramér’s V for : Global Dataset (Reference) vs. Fold 1 (goal)

Since each PSI and Cramér’s V are below 0.1, we are able to conclude that the goal variable follows the identical distribution in each datasets.

The code that generated this table is shown below. The identical code can be used to provide results for all variables and export them into an Excel file called representativity.xlsx.

EPS = 1e-12  # A really small constant to avoid division by zero or log(0)

# ============================================================
# 1. Basic functions
# ============================================================

def safe_proportions(counts):
    """
    Convert raw counts into proportions in a protected way.
    - If the entire count = 0, return all zeros (to avoid division by zero).
    - Clip values so no proportion is strictly 0 or 1 (numerical stability).
    """
    total = counts.sum()
    if total == 0:
        return np.zeros_like(counts, dtype=float)
    p = counts / total
    return np.clip(p, EPS, 1.0)

def calculate_psi(p_ref, p_tgt):
    """
    Compute the Population Stability Index (PSI) between two distributions.

    PSI = sum( (p_ref - p_tgt) * log(p_ref / p_tgt) )

    Interpretation:
    - PSI < 0.1  → stable
    - 0.1–0.25   → moderate shift
    - > 0.25     → major shift
    """
    p_ref = np.clip(p_ref, EPS, 1.0)
    p_tgt = np.clip(p_tgt, EPS, 1.0)
    return float(np.sum((p_ref - p_tgt) * np.log(p_ref / p_tgt)))

def calculate_cramers_v(contingency):
    """
    Compute Cramér's V statistic for association between two categorical variables.
    - Input: a 2 x K contingency table (counts).
    - Uses Chi² test.
    - Normalizes the result to [0, 1].
      * 0   → no association
      * 1   → perfect association
    """
    chi2, _, _, _ = chi2_contingency(contingency, correction=False)
    n = contingency.sum()
    r, c = contingency.shape
    if n == 0 or min(r - 1, c - 1) == 0:
        return 0.0
    return np.sqrt(chi2 / (n * (min(r - 1, c - 1))))

# ============================================================
# 2. Preparing variables
# ============================================================

def bin_numeric(ref, tgt, n_bins=5):
    """
    Discretize a numeric variable into quantile bins (ex: quintiles).
    - Quantile thresholds are computed only on the reference dataset.
    - Extend bins with -inf and +inf to cover all possible values.
    - Returns:
        * ref binned
        * tgt binned
        * bin labels (Q1, Q2, ...)
    """
    edges = np.unique(ref.dropna().quantile(np.linspace(0, 1, n_bins + 1)).values)
    if len(edges) < 3:  # if variable is nearly constant
        edges = np.array([-np.inf, np.inf])
    else:
        edges[0], edges[-1] = -np.inf, np.inf
    labels = [f"Q{i}" for i in range(1, len(edges))]
    return (
        pd.cut(ref, bins=edges, labels=labels, include_lowest=True),
        pd.cut(tgt, bins=edges, labels=labels, include_lowest=True),
        labels
    )

def prepare_counts(ref, tgt, n_bins=5):
    """
    Prepare frequency counts for one variable.
    - If numeric: discretize into quantile bins.
    - If categorical: take all categories present in either dataset.
    Returns:
      segments, counts in reference, counts in goal
    """
    if pd.api.types.is_numeric_dtype(ref) and pd.api.types.is_numeric_dtype(tgt):
        ref_b, tgt_b, labels = bin_numeric(ref, tgt, n_bins)
        segments = labels
    else:
        segments = sorted(set(ref.dropna().unique()) | set(tgt.dropna().unique()))
        ref_b, tgt_b = ref.astype(str), tgt.astype(str)

    ref_counts = ref_b.value_counts().reindex(segments, fill_value=0)
    tgt_counts = tgt_b.value_counts().reindex(segments, fill_value=0)
    return segments, ref_counts, tgt_counts

# ============================================================
# 3. Evaluation per variable
# ============================================================

def analyze_variable(ref, tgt, n_bins=5):
    """
    Analyze a single variable between two datasets.
    Steps:
    - Construct counts by segment (bin for numeric, category for categorical).
    - Compute PSI by segment and Global PSI.
    - Compute Cramér's V from the contingency table.
    - Return:
        DataFrame with details
        Summary dictionary (psi, v_cramer)
    """
    segments, ref_counts, tgt_counts = prepare_counts(ref, tgt, n_bins)
    p_ref, p_tgt = safe_proportions(ref_counts.values), safe_proportions(tgt_counts.values)

    # PSI
    psi_global = calculate_psi(p_ref, p_tgt)
    psi_by_segment = (p_ref - p_tgt) * np.log(p_ref / p_tgt)

    # Cramér's V
    contingency = np.vstack([ref_counts.values, tgt_counts.values])
    v_cramer = calculate_cramers_v(contingency)

    # Construct detailed results table
    df = pd.DataFrame({
        "Segment": segments,
        "Count Reference": ref_counts.values,
        "Count Goal": tgt_counts.values,
        "Percent Reference": p_ref,
        "Percent Goal": p_tgt,
        "PSI by Segment": psi_by_segment
    })

    # Add summary lines at the underside of the table
    df.loc[len(df)] = ["Global PSI", np.nan, np.nan, np.nan, np.nan, psi_global]
    df.loc[len(df)] = ["Cramer's V", np.nan, np.nan, np.nan, np.nan, v_cramer]

    return df, {"psi": psi_global, "v_cramer": v_cramer}

# ============================================================
# 4. Excel reporting utilities
# ============================================================

def apply_traffic_light(ws, wb, first_row, last_row, col, low, high):
    """
    Apply conditional formatting (traffic light colours) to a numeric column in Excel:
    - green  if value < low
    - orange if low <= value <= high
    - red    if value > high

    Note: first_row, last_row, and col are zero-based indices (xlsxwriter convention).
    """
    green  = wb.add_format({"bg_color": "#C6EFCE", "font_color": "#006100"})
    orange = wb.add_format({"bg_color": "#FCD5B4", "font_color": "#974706"})
    red    = wb.add_format({"bg_color": "#FFC7CE", "font_color": "#9C0006"})

    if last_row < first_row:
        return  # nothing to paint

    ws.conditional_format(first_row, col, last_row, col,
        {"type": "cell", "criteria": "<", "value": low, "format": green})
    ws.conditional_format(first_row, col, last_row, col,
        {"type": "cell", "criteria": "between", "minimum": low, "maximum": high, "format": orange})
    ws.conditional_format(first_row, col, last_row, col,
        {"type": "cell", "criteria": ">", "value": high, "format": red})

def representativity_report(ref_df, tgt_df, variables, output="representativity.xlsx",
                            n_bins=5, psi_thresholds=(0.10, 0.25),
                            v_thresholds=(0.10, 0.25), color_summary=True):
    """
    Construct a representativity report across multiple variables and export to Excel.

    For every variable:
      - Create a sheet with detailed PSI by segment, Global PSI, and Cramer's V.
      - Apply traffic light colours for easier interpretation.

    Create one "Résumé" sheet with overall Global PSI and Cramer's V for all variables.
    """
    summary = []

    with pd.ExcelWriter(output, engine="xlsxwriter") as author:
        wb = author.book
        fmt_header = wb.add_format({"daring": True, "bg_color": "#0070C0",
                                    "font_color": "white", "align": "center"})
        fmt_pct   = wb.add_format({"num_format": "0.00%"})
        fmt_ratio = wb.add_format({"num_format": "0.000"})
        fmt_int   = wb.add_format({"num_format": "0"})

        for var in variables:
            # Analyze variable
            df, meta = analyze_variable(ref_df[var], tgt_df[var], n_bins)
            sheet = var[:31]  # Excel sheet names are limited to 31 characters
            df.to_excel(author, sheet_name=sheet, index=False)
            ws = author.sheets[sheet]

            # Format headers and columns
            for j, col in enumerate(df.columns):
                ws.write(0, j, col, fmt_header)
            ws.set_column(0, 0, 18)
            ws.set_column(1, 2, 16, fmt_int)
            ws.set_column(3, 4, 20, fmt_pct)
            ws.set_column(5, 5, 18, fmt_ratio)

            nrows = len(df)   # number of information rows (excluding header)
            col_psi = 5       # "PSI by Segment" column index

            # PSI by Segment rows
            apply_traffic_light(ws, wb, first_row=1, last_row=max(1, nrows-2),
                                col=col_psi, low=psi_thresholds[0], high=psi_thresholds[1])

            # Global PSI row (second to last)
            apply_traffic_light(ws, wb, first_row=nrows-1, last_row=nrows-1,
                                col=col_psi, low=psi_thresholds[0], high=psi_thresholds[1])

            # Cramer's V row (last row) 
            apply_traffic_light(ws, wb, first_row=nrows, last_row=nrows,
                                col=col_psi, low=v_thresholds[0], high=v_thresholds[1])

            # Add summary info for Résumé sheet
            summary.append({"Variable": var,
                            "Global PSI": meta["psi"],
                            "Cramer's V": meta["v_cramer"]})

        # Résumé sheet
        df_sum = pd.DataFrame(summary)
        df_sum.to_excel(author, sheet_name="Résumé", index=False)
        ws = author.sheets["Résumé"]
        for j, col in enumerate(df_sum.columns):
            ws.write(0, j, col, fmt_header)
        ws.set_column(0, 0, 28)
        ws.set_column(1, 2, 16, fmt_ratio)

        # Apply traffic light to summary sheet
        if color_summary and len(df_sum) > 0:
            last = len(df_sum)
            # PSI column
            apply_traffic_light(ws, wb, 1, last, 1, psi_thresholds[0], psi_thresholds[1])
            # Cramer's V column
            apply_traffic_light(ws, wb, 1, last, 2, v_thresholds[0], v_thresholds[1])

    return output

# ============================================================
# Example
# ============================================================

if __name__ == "__main__":
    # columns namees privées de fold
    columns = [x for x in data.columns if x != "fold"]

    # Generate the report
    path = representativity_report(data_ref, data_target, columns, output="representativity.xlsx")
    print(f" Report generated: {path}")

inally, Table 2 shows the last sheet of the file, titled , which brings together the outcomes for all variables of interest.

PSI and Cramér’s V summary for s: Global Dataset vs. Fold 1

This synthesis provides an overall view of representativeness between the 2 datasets, making interpretation and decision-making much easier. Since each PSI and Cramér’s V are below 0.1, we are able to conclude that every one variables follow the identical distribution in the worldwide dataset and in fold 1. Due to this fact, fold 1 could be considered representative of the worldwide dataset.

Conclusion

On this post, we explored the right way to study representativeness between two datasets by comparing the distributions of their variables. We introduced two key indicators Population stability index(PSI) and Cramér’s V, which can be each easy to make use of, easy to interpret, and highly invaluable for decision-making.

We also showed how these analyses could be automated, with results saved directly into an Excel file.

The most important takeaway is that this: in case you construct a model and find yourself with overfitting, one possible reason could also be that your training and test sets should not representative of one another. A straightforward approach to prevent that is to all the time run a representativity evaluation between datasets. Variables that show representativity issues can then guide you in stratifying your data when splitting it into training and test sets. What about you? In what situations do you study representativeness between two data sets, for what reasons, and using what methods?

References

Yurdakul, B. (2018). . Western Michigan University.

Redmond, M. (2002). Communities and Crime [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C53W3X.

Data & Licensing

The dataset utilized in this text is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This license allows anyone to share and adapt the dataset for any purpose, including industrial use, provided that proper attribution is given to the source.

For more details, see the official license text: CC BY 4.0.

Disclaimer

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x