Why Your ML Model Works in Training But Fails in Production

, I worked on real-time fraud detection systems and suggestion models for product corporations that looked excellent during development. Offline metrics were strong. AUC curves were stable across validation windows. Feature importance plots told a clean, intuitive story. We shipped with confidence.

Just a few weeks later, our metrics began to drift.

Click-through rates on recommendations began to slip. Fraud models behaved inconsistently during peak hours. Some decisions felt overly confident, others oddly blind. The models themselves had not degraded. There have been no sudden data outages or broken pipelines. What failed was our understanding of how the system behaved once it met time, latency, and delayed truth in the actual world.

This text is about those failures. The quiet, unglamorous problems that show up only when machine learning systems collide with reality. Not optimizer decisions or the newest architecture. The issues that don’t appear in notebooks, but surface at 3 a.m. dashboards.

My message is straightforward: most production ML failures are data and time problems, not modeling problems. Should you don’t design explicitly for a way information arrives, matures, and changes, the system will quietly make those assumptions for you.

Time Travel: An Assumption Leak

Time travel is probably the most common production ML failure I actually have seen, and likewise the least discussed in concrete terms. Everyone nods while you mention leakage. Only a few teams can point to the precise row where it happened.

Let me make it explicit.

Imagine a fraud dataset with two tables:

transactions: when the payment happened

(Image by writer, generated using synthetic data for illustration)

chargebacks: when the fraud final result was reported

The feature we would like is user_chargeback_count_last_30_days.

The batch job runs at the top of the day, just before midnight, and computes chargeback counts for the last 30 days. For user U123, the count is 1. As of midnight, that’s factually correct.

Now have a look at the ultimate joined training dataset.

Morning transactions at 9:10 AM and 11:45 AM already carry a chargeback count of 1. On the time those payments were made, the chargeback had not yet been reported. However the training data doesn’t know that. Time has been flattened.

That is where the model cheats.

From the model’s perspective, dangerous looking transactions already include confirmed fraud signals. Offline recall improves dramatically. Nothing looks fallacious at this point.

But in production, the model is purported to never sees the long run.

When deployed, those early transactions shouldn’t have a chargeback count yet. The signal disappears and performance collapses.

This just isn’t a modeling mistake. It’s an assumption leak.

The hidden assumption is that a every day batch feature is valid for all events on that day. It just isn’t. A feature is simply valid if it could have existed at the precise moment the prediction was made.

Every feature must answer one query:

“Could this value have existed at the precise moment the prediction was made?”

If the reply just isn’t a confident yes, the feature is invalid.

Feature Defaults That Grow to be Signals

After time travel, it is a quite common failure reason that I actually have seen in production systems. Unlike leakage, this one doesn’t depend on the long run. It relies on silence.

Most engineers treat missing values as a hygiene problem. Fill them with average, median or another imputation technique after which move on.

These defaults feel harmless. Something secure enough so the model can keep running.

That assumption seems to be expensive.

In real systems, missing rarely means random. Missing often means recent, unknown, not yet observed, or not yet trusted. Once we collapse all of that right into a single default value, the model doesn’t see a spot. It sees a pattern.

Let me make this concrete.

I first bumped into this in an actual time fraud system where we used a feature called avg_transaction_amount_last_7_days. For energetic users, this value was well behaved. For brand new or inactive users, the feature pipeline returned a default value of zero.

As an instance how the default value became a powerful proxy for user status, I computed the observed fraud rate grouped by the feature’s value:

data.groupby("avg_txn_amount_last_7_days")["is_fraud"].mean()

As shown, users with a worth of zero exhibit a markedly lower fraud rate—not because zero spending is inherently secure, but since it implicitly encodes “recent or inactive user.”

All users with a mean transaction amount of zero are non fraud. Not because zero is inherently secure, but because those users are recent/inactive. The model doesn’t learn “low spending is secure”. It learns “missing history means secure”.

The default has turn into a signal.

During training, this looks good as precision improves. Then production traffic changes.

A downstream service starts timing out during peak hours. Suddenly, energetic users temporarily lose their history features. Their avg_transaction_amount_last_7_days flips to zero. The model confidently marks them as low risk.

Experienced teams handle this otherwise. They separate absence from value, track feature availability explicitly. Most significantly, they never allow silence to masquerade as information.

Population Shift Without Distribution Shift

This failure mode took me for much longer to acknowledge, mostly because all the standard alarms stayed silent.

When people discuss data drift, they sometimes mean distribution shift. Feature histograms move. Percentiles change. KS tests light up dashboards. Everyone understands what to do next. Investigate upstream data, retrain, recalibrate.

Population shift without distribution shift is different. Here, the feature distributions remain stable. Summary statistics barely move. Monitoring dashboards look reassuring. And yet, model behavior degrades steadily.

I first encountered this in a big scale payments risk system that operated across multiple user segments. The model consumed transaction level features like amount, time of day, device signals, velocity counters, and merchant category codes. All of those features were heavily monitored. Their distributions barely modified month over month.

Still, fraud rates began creeping up in a really specific slice of traffic. What modified was not the info. It was who the info represented.

Over time, the product expanded into recent user cohorts. Recent geographies with different payment habits. Recent merchant categories with unfamiliar transaction patterns. Promotional campaigns that brought in users who behaved otherwise but still fell inside the same numeric ranges. From a distribution perspective, nothing looked unusual. However the underlying population had shifted.

The model had been trained totally on mature users with long behavioral histories. Because the user base grew, a bigger fraction of traffic got here from newer users whose behavior looked statistically similar but semantically different. A transaction amount of two,000 meant something very different for an extended tenured user than for somebody on their first day. The model didn’t know that, because we had not taught it to care.

See this figure above. It shows why this failure mode is difficult to detect in practice. The primary two plots show transaction amount and short-term velocity distributions for mature and recent users. From a monitoring perspective, these features appear stable with the overlap. If this were the one signal available, most teams would conclude that the info pipeline and model inputs remain healthy.

The third plot reveals the actual problem. Though the feature distributions are nearly similar, the fraud rate differs substantially across populations. The model applies the identical decision boundaries to each groups since the inputs look familiar, however the underlying risk just isn’t the identical. What has modified just isn’t the info itself, but who the info represents.

As traffic composition changes through growth or expansion those assumptions stop holding, despite the fact that the info continues to look statistically normal. Without explicitly modeling population context or evaluating performance across cohorts, these failures remain invisible until business metrics begin to degrade.

Before You Go

Not one of the failures in this text were attributable to bad models.

The architectures were reasonable. The features were thoughtfully designed. What failed was the system across the model, specifically the assumptions we made about time, absence, and who the info represented.

Time just isn’t a static index. Labels arrive late. Features mature unevenly. Batch boundaries rarely align with decision moments. Once we ignore that, models learn from information they are going to never see again.

If there may be one takeaway, it is that this: strong offline metrics should not proof of correctness. They’re proof that the model matches the assumptions you gave it. The actual work of machine learning begins when those assumptions meet reality.

Design for that moment.

References & Further Reading

[1] ROC Curves and AUC (Google Machine Learning Crash Course)
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

[2] Kolmogorov–Smirnov Test (Wikipedia)
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test[3] Data Distribution Shifts and Monitoring (Huyen Chip)
https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html

Why Your ML Model Works in Training But Fails in Production

Time Travel: An Assumption Leak

Feature Defaults That Grow to be Signals

Population Shift Without Distribution Shift

Before You Go

References & Further Reading

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Why Your ML Model Works in Training But Fails in Production

Time Travel: An Assumption Leak

Feature Defaults That Grow to be Signals

Population Shift Without Distribution Shift

Before You Go

References & Further Reading

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.