Machine Learning Meets Panel Data: What Practitioners Must Know

-

Authors: Augusto Cerqua, Marco Letta, Gabriele Pinto

learning (ML) has gained a central role in economics, the social sciences, and business decision-making. In the general public sector, ML is increasingly used for so-called prediction policy problems: settings where policymakers aim to discover units most prone to a negative end result and intervene proactively; as an example, targeting public subsidies, predicting local recessions, or anticipating migration patterns. Within the private sector, similar predictive tasks arise when firms seek to forecast customer churn, or optimize credit risk assessment. In each domains, higher predictions translate into more efficient allocation of resources and simpler interventions.

To realize these goals, ML algorithms are increasingly applied to panel data, characterised by repeated observations of the identical units over multiple time periods. Nevertheless, ML models weren’t originally designed to be used with panel data, which feature distinctive cross-sectional and longitudinal dimensions. When ML is applied to panel data, there’s a high risk of a subtle but significant issue: data leakage. This happens when information unavailable at prediction time by chance enters the model training process, inflating predictive performance. In our paper “” (Cerqua, Letta, and Pinto, 2025), recently published within the Oxford Bulletin of Economics and Statistics, we offer the primary systematic assessment of information leakage in ML with panel data, propose clear guidelines for practitioners, and illustrate the implications through an empirical application with publicly available U.S. county data.

The Leakage Problem

Panel data mix two structures: a temporal dimension (units observed across time) and a cross-sectional dimension (multiple units, resembling regions or firms). Standard ML practice, splitting the sample randomly into training and testing sets, implicitly assumes independent and identically distributed (i.i.d.) data. This assumption is violated when default ML procedures (resembling a random split) are applied to panel data, creating two primary forms of leakage:

  •  Temporal leakage: future information leaks into the model throughout the training phase, making forecasts look unrealistically accurate. Moreover, past information can find yourself within the testing set, making ‘forecasts’ retrospective.
  • Cross-sectional leakage: the identical or very similar units appear in each training and testing sets, meaning the model has already “seen” many of the cross-sectional dimension of the info.

Figure 1 shows how different splitting strategies affect the chance of leakage. A random split on the unit–time level (Panel A) is essentially the most problematic, because it introduces each temporal and cross-sectional leakage. Alternatives resembling splitting by units (Panel B), by groups (Panel C), or by time (Panel D), mitigate one sort of leakage but not the opposite. Consequently, no strategy completely eliminates the issue: the suitable alternative relies on the duty at hand (see below), since in some cases one type of leakage might not be an actual concern.

Figure 1  |  Training and testing sets under different splitting rules

Two Kinds of Prediction Policy Problems

A key insight of the study is that researchers must clearly define their prediction goal ex-ante. We distinguish two broad classes of prediction policy problems:

1. Cross-sectional prediction: The duty is to map outcomes across units in the identical period. For instance, imputing missing data on GDP per capita across regions when just some regions have reliable measurements. The most effective split here is on the unit level: different units are assigned to training and testing sets, while all time periods are kept. This eliminates cross-sectional leakage, although temporal leakage stays. But since forecasting just isn’t the goal, this just isn’t an actual issue.

2. Sequential forecasting: The goal is to predict future outcomes based on historical data—for instance, predicting county-level income declines one 12 months ahead to trigger early interventions. Here, the right split is by time: earlier periods for training, later periods for testing. This avoids temporal leakage but not cross-sectional leakage, which just isn’t an actual concern because the same units are being forecasted across time.

The improper approach in each cases is the random split by unit-time (Panel A of Figure 1), which contaminates results with each forms of leakage and produces misleadingly high performance metrics.

Practical Guidelines

To assist practitioners, we summarize a set of do’s and don’ts for applying ML to panel data:

  • Select the sample split based on the research query: unit-based for cross-sectional problems, time-based for forecasting.
  • Temporal leakage can occur not only through observations, but in addition through predictors. For forecasting, only use lagged or time-invariant predictors. Using contemporaneous variables (e.g., using unemployment in 2014 to predict income in 2014) is conceptually improper and creates temporal data leakage.
  • Adapt cross-validation to panel data. Random k-fold CV present in most ready-to-use software packages is inappropriate, because it mixes future and past information. As an alternative, use rolling or expanding windows for forecasting, or stratified CV by units/groups for cross-sectional prediction.
  • Be certain that out-of-sample performance is tested on truly unseen data, not on data already encountered during training.

Empirical Application

For instance these issues, we analyze a balanced panel of three,058 U.S. counties from 2000 to 2019, focusing exclusively on sequential forecasting. We consider two tasks: a regression problem—forecasting per capita income—and a classification problem—forecasting whether income will decline in the following 12 months.

We run lots of of models, various split strategies, use of contemporaneous predictors, inclusion of lagged outcomes, and algorithms (Random Forest, XGBoost, Logit, and OLS). This comprehensive design allows us to quantify how leakage inflates performance. Figure 2 below reports our primary findings.

Panel A of Figure 2 shows forecasting performance for classification tasks. Random splits yield very high accuracy, but that is illusory: the model has already seen similar data during training.

Panel B shows forecasting performance for regression tasks. Once more, random splits make models look much better than they are surely, while correct time-based splits show much lower, yet realistic, accuracy.

Figure 2  |  Temporal leakage within the forecasting problem

      Panel A – Classification task

      Panel B – Regression task

Within the paper, we also show that the overestimation of model accuracy becomes significantly more pronounced during years marked by distribution shifts and structural breaks, resembling the Great Recession, making the outcomes particularly misleading for policy purposes.

Why It Matters

Data leakage is greater than a technical pitfall; it has real-world consequences. In policy applications, a model that seems highly accurate during validation may collapse once deployed, resulting in misallocated resources, missed crises, or misguided targeting. In business settings, the identical issue can translate into poor investment decisions, inefficient customer targeting, or false confidence in risk assessments. The danger is particularly acute when machine learning models are intended to function early-warning systems, where misplaced trust in inflated performance can lead to costly failures.

Against this, properly designed models, even when less accurate on paper, provide honest and reliable predictions that may meaningfully inform decision-making.

Takeaway

ML has the potential to rework decision-making in each policy and business, but provided that applied appropriately. Panel data offer wealthy opportunities, yet are especially vulnerable to data leakage. To generate reliable insights, practitioners should align their ML workflow with the prediction objective, account for each temporal and cross-sectional structures, and use validation strategies that prevent overoptimistic assessments and an illusion of high accuracy. When these principles are followed, models avoid the trap of inflated performance and as an alternative provide guidance that genuinely helps policymakers allocate resources and businesses make sound strategic decisions. Given the rapid adoption of ML with panel data in each private and non-private domains, addressing these pitfalls is now a pressing priority for applied research.

References

A. Cerqua, M. Letta, and G. Pinto, “On the (Mis)Use of Machine Learning With Panel Data”,  (2025): 1–13, https://doi.org/10.1111/obes.70019.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x