going through a deep transformation driven by technological progress. These changes affect all sectors, especially the banking industry. Data professionals must quickly adapt to grow to be more efficient, productive, and competitive.
For knowledgeable professionals with strong foundations in mathematics, statistics, and operational practice, this transition may be natural. Nevertheless, it could be tougher for beginners who haven’t yet fully mastered these fundamental skills.
In the sector of credit risk, developing these skills requires a transparent understanding of bank exposures and the mechanisms used to administer the associated risks.
My next articles will focus mainly on credit risk management inside a regulatory framework. The European Central Bank (ECB) allows banks to make use of internal models to evaluate the credit risk of their different exposures. These exposures may include loans granted to corporations to finance long-term projects or loans granted to households to finance real estate projects.
These models aim to estimate several key parameters:
- PD (Probability of Default): the probability that a borrower might be unable to satisfy its payment obligation.
- EAD (Exposure at Default): the exposure amount on the time of default.
- LGD (Loss Given Default): is the severity of the loss within the event of default.
We are able to subsequently distinguish between PD models, EAD models, and LGD models. On this series, I’ll mainly deal with PD models. These models are used to assign rankings to borrowers and to contribute to the calculation of regulatory capital requirements, which protect banks against unexpected losses.
In this primary article, I’ll deal with defining and constructing the modeling scope.
Definition of default
The development of information modeling requires a transparent understanding of the modeling objective and a precise definition of default. Assessing the probability of default of a counterparty involves observing the transition from a healthy state to a state of default over a given horizon h. In what follows, we’ll assume that this horizon is about at one 12 months (h = 1).
The definition of default was harmonized and brought under regulatory supervision following the 2008 financial crisis. The target was to ascertain a standardized definition applicable to all banking institutions.
This definition relies on several criteria, including:
- a big deterioration within the counterparty’s financial situation,
- the existence of past-due amounts,
- situations of forbearance,
- contagion effects inside a gaggle of exposures.
Historically, there was the previous definition of default (ODOD), which step by step evolved into the brand new definition of default (NDOD) that’s currently in place.
For instance, a counterparty is taken into account in default when the debtor has payment arrears of greater than 90 days on a cloth credit obligation.
Once the definition of default has been clearly established, the institution can apply it to all of its clients. It could then face a potentially heterogeneous portfolio composed of huge corporations, small and medium-sized enterprises (SMEs), retail clients, and sovereign entities.
To administer risk more effectively, it is important to discover these different categories and create homogeneous sub-portfolios. This segmentation then allows each portfolio to be modeled in a more relevant and accurate way.
Definition of filters
Defining filters makes it possible to find out the modeling scope and retain only homogeneous counterparties for evaluation. Filters are variables used to delimit this scope.
These variables may be identified through statistical methods, similar to clustering techniques, or defined by material experts based on business knowledge.
For instance, when specializing in large corporations, revenue can function a relevant size variable to ascertain a threshold. One may select to incorporate only counterparties with annual revenue above €30 million.
Additional variables can then be used to further characterize this segment, similar to industry sector, geographic region, financial ratios, or ESG indicators.
One other modeling scope may focus exclusively on retail clients who’ve taken loans to finance personal projects. On this case, income may be used as a filtering variable, while other relevant characteristics may include employment status, variety of collateral, and loan type.
Once the target is clearly defined, the default definition is well specified, and the scope has been properly structured through appropriate filters, constructing the modeling dataset becomes a natural next step.
Construction of the Modeling Dataset
Because the objective is to predict the probability of default over a one-year horizon, for every year (N), we must retain all healthy counterparties, meaning people who didn’t default at any time during 12 months (N) (from 01/01/N to 12/31/N).
On December 31, N, the characteristics of those healthy counterparties are observed and recorded. For instance, if we deal with corporate entities, then as of 12/31/N, the values of the next variables for every counterparty are collected: turnover, industry sector, and financial ratios.
To construct the default variable for every of those counterparties, we then take a look at 12 months (N+1). The variable takes the worth 1 if the counterparty defaults not less than once throughout the 12 months (N+1), and 0 otherwise.
This variable, denoted Y or def, is the goal variable of the model. The chart below illustrates the method described above.
In summary, for every fixed 12 months (N), we obtain an oblong dataset where:
- Each row corresponds to a counterparty that was healthy as of 12/31/N,
- The columns include all explanatory variables measured at that date, denoted (Xi) for counterparty (i),
- The ultimate column corresponds to the goal variable (Yi), which indicates whether counterparty (i) defaults not less than once throughout the 12 months (N+1) (1) or not (0).
For instance, if (N = 2015), the explanatory variables are measured as of 12/31/2015, and the goal variable is observed over the 12 months 2016.
The regulator requires modeling datasets to be built using not less than five years of historical data as a way to capture different economic cycles. Because the models are calibrated over multiple periods, the regulator also requires regulatory models to be Through-the-Cycle (TTC), meaning they needs to be relatively insensitive to short-term macroeconomic fluctuations.
Suppose we’ve got client data covering six years, from 01/01/2015 to 12/31/2020. By applying the procedure described above for every year (N) between 2015 and 2019, five successive datasets may be constructed.
The primary dataset, corresponding to the 12 months 2015, includes all counterparties that remained acting from 01/01/2015 to 12/31/2015. Their explanatory variables ( Xi, …, Xk ) are measured as of 12/31/2015, while the default variable ( Y ) is observed over the 12 months 2016. It takes the worth 1 if the counterparty defaults not less than once during 2016, and 0 otherwise.
The identical process is repeated for the next years as much as the 2019 dataset. This final dataset includes all counterparties that remained acting from 01/01/2019 to 12/31/2019. Their explanatory variables (X1, …, Xk) are measured as of 12/31/2019, and the default variable (Y) is observed in 2020. It takes the worth 1 if the counterparty defaults at any point during 2020, and 0 otherwise.
The ultimate modeling scope corresponds to the vertical concatenation of all datasets constructed as of 12/31/N. In our example, N ranges from 2015 to 2019. The resulting dataset may be illustrated by the oblong table below.

Each statistical statement is identified by a pair consisting of the counterparty identifier and the 12 months (ID x 12 months) during which the explanatory variables are measured (as of 12/31/N). And the variety of lines denotes the variety of observations.
For instance, the counterparty with identifier (ID = 1) may appear in each 2015 and 2018. These correspond to 2 distinct and independent observations within the dataset, identified respectively by the pairs (1 x 2015) and (1 x 2018).
This approach offers several benefits. Specifically, it prevents temporal overlap amongst obligors and reduces implicit autocorrelation between observations, since each record is uniquely identified by the (id x 12 months) pair.
As well as, it increases the likelihood of constructing a more robust and representative dataset. By pooling observations across multiple years, the variety of default events becomes sufficiently large to support reliable model estimation. This is especially vital when analyzing portfolios of huge corporations, where default events are sometimes relatively rare.
Finally, the financial institution must implement appropriate organizational measures to make sure effective data management and security throughout your entire data lifecycle. To this end, the ECB requires financial entities to comply with common regulatory standards, similar to the Digital Operational Resilience Act (DORA).
Institutions should establish a comprehensive strategic framework for information security management, in addition to a dedicated data security framework specifically covering data utilized in internal models.
Furthermore, human oversight must remain central to those processes. Procedures should subsequently be thoroughly documented, and clear guidelines should be established to clarify how and when human judgment needs to be applied.
Conclusion
Defining the model development and application scope, in addition to properly documenting them, are essential steps in reducing model risk, not only on the design stage, but throughout your entire model lifecycle.
The important thing objective is to be sure that the event scope is representative of the intended portfolio and, when needed, to obviously discover any extensions, restrictions, or approximations made when applying the model in comparison with its original design.
Preparing a standardized document that clearly defines the variables used to ascertain the scope is taken into account good practice. At a minimum, the next information needs to be easily identifiable: the technical name of the variable, its format, and its source.
In my next article, I’ll use a credit risk dataset as an instance the best way to predict the probability of default for various counterparties. I’ll explain the steps required to properly understand the available dataset and, where possible, describe the best way to handle and process different variables.
References
European Central Bank. (2025). . European Central Bank. https://www.bankingsupervision.europa.eu/ecb/pub/pdf/ssm.supervisory_guide202507.en.pdf
Image Credits
All images and visualizations in this text were created by the writer using Python (pandas, matplotlib, seaborn, and plotly) and Excel, unless otherwise stated.
Disclaimer
