Home Artificial Intelligence How one can Detect Data Drift with Hypothesis Testing

How one can Detect Data Drift with Hypothesis Testing

3
How one can Detect Data Drift with Hypothesis Testing

p-value

Enters the infamous p-value. It’s a number that answers the query: what’s the probability of observing the chi-2 value we got or a fair more extreme one, provided that the null hypothesis is true? Or, using some notation, the p-value represents the probability of observing the info assuming the null hypothesis is true: P(data|H₀) (To be precise, the p-value is defined as P(test_static(data) > T | H₀), where T is the chosen threshold for the test statistic). Notice how that is different from what we are literally curious about, which is the probability that our hypothesis is true given the info now we have observed: P(H₀|data).

what p-value represents: P(data|H₀)
what we often want: P(H₀|data)

Graphically speaking, the p-value is the sum of the blue probability density to the correct of the red line. The best technique to compute it’s to calculate one minus the cumulative distribution on the observed value, that’s one minus the probability mass on the left side.

1 - chi2.cdf(chisq, df=1)

This offers us 0.0396. If there was no data drift, we might get the test statistic we’ve got or a fair larger one in roughly 4% of the cases. Not that seldom, in spite of everything. In most use cases, the p-value is conventionally in comparison with the importance level of 1% or 5%. If it’s lower than that, one rejects the null. Let’s be conservative and follow the 1% significance threshold. In our case with a p-value of virtually 4%, there isn’t enough evidence to reject it. Hence, no data drift was detected.

To make sure that our test was correct, let’s confirm it with scipy’s built-in test function.

from scipy.stats import chi2_contingency

chisq, pvalue, df, expected = chi2_contingency(cont_table)
print(chisq, pvalue)

4.232914541135393 0.03964730311588313

That is how hypothesis testing works. But how relevant is it for data drift detection in a production machine learning system?

Statistics, in its broadest sense, is the science of creating inferences about entire populations based on small samples. When the famous t-test was first published in the beginning of the twentieth century, all calculations were made with pen and paper. Even today, students in STATS101 courses will learn that a “large sample” starts from 30 observations.

Back in the times when data was hard to gather and store, and manual calculations were tedious, statistically rigorous tests were an ideal technique to answer questions on the broader populations. Nowadays, nonetheless, with often abundant data, many tests diminish in usefulness.

The characteristic is that many statistical tests treat the quantity of information as evidence. With less data, the observed effect is more liable to random variation attributable to sampling error, and with a variety of data, its variance decreases. Consequently, the very same observed effect is stronger evidence against the null hypothesis with more data than with less.

As an instance this phenomenon, consider comparing two firms, A and B, by way of the gender ratio amongst their employees. Let’s imagine two scenarios. First, let’s take random samples of 10 employees from each company. At company A, 6 out of 10 are women while at company B, 4 out of 10 are women. Second, let’s increase our sample size to 1000. At company A, 600 out of 1000 are women, and at B, it’s 400. In each scenarios, the gender ratios were the identical. Nonetheless, more data seems to supply stronger evidence for the incontrovertible fact that company A employs proportionally more women than company A, doesn’t it?

This phenomenon often manifests in hypothesis testing with large data samples. The more data, the lower the p-value, and so the more likely we’re to reject the null hypothesis and declare the detection of some type of statistical effect, similar to data drift.

Let’s see whether this holds for our chi-2 test for the difference in frequencies of a categorical variable. In the unique example, the serving set was roughly ten times smaller than the training set. Let’s multiply the frequencies within the serving set by a set of scaling aspects between 1/100 and 10 and calculate the chi-2 statistic and the test’s p-value every time. Notice that multiplying all frequencies within the serving set by the identical constant doesn’t impact their distribution: the one thing we’re changing is the dimensions of one in all the sets.

training_freqs = np.array([10_322, 24_930, 30_299])
serving_freqs = np.array([1_015, 2_501, 3_187])

p_values, chi_sqs = [], []
multipliers = [0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
for serving_size_multiplier in multipliers:
augmented_serving_freqs = serving_freqs * serving_size_multiplier
cont_table = pd.DataFrame([
training_freqs,
augmented_serving_freqs,
])
chi_sq, pvalue, _, _ = chi2_contingency(cont_table)
p_values.append(pvalue)
chi_sqs.append(chi_sq)

The values on the multiplier equal to at least one are those we’ve calculated before. Notice how with a serving size just 3 times larger (marked with a vertical dashed line) our conclusion changes completely: we get the chi-2 statistic of 11 and the p-value of virtually zero, which in our case corresponds to indicating data drift.

The consequence of that is the increasing amount of false alarms. Although these effects can be statistically significant, they may not necessarily be significant from the performance monitoring perspective. With a big enough data set, even the tiniest of information drifts can be indicated even whether it is so weak that it doesn’t deteriorate the model’s performance.

Having learned this, you is likely to be tempted to suggest dividing the serving data into quite a few chunks and running multiple tests with smaller data sets. Unfortunately, this isn’t a very good idea either. To grasp why, we’d like to deeply understand what the p-value really means.

We now have already defined the p-value because the probability of observing the test statistic not less than as unlikely because the one now we have actually observed, provided that the null hypothesis is true. Let’s attempt to unpack this mouthful.

The null hypothesis means no effect, in our case: no data drift. Which means whatever differences there are between the training and serving data, they’ve emerged as a consequence of random sampling. The p-value can subsequently be seen because the probability of getting the differences we got, provided that they only come from randomness.

Hence, our p-value of roughly 0.1 implies that in the whole absence of information drift, 10% of tests will erroneously signal data drift attributable to random probability. This stays consistent with the notation for what the p-value represents which we introduced earlier: P(data|H₀). If this probability is 0.1, then provided that H₀ is true (no drift), now we have a ten% probability of observing the info not less than as different as what we observed (in response to the test statistic)

That is the explanation why running more tests on smaller data samples isn’t a very good idea: if as an alternative of testing the serving data from all the day every day, we might split it into 10 chunks and run 10 tests every day, we might find yourself with one false alarm daily, on average! This will result in the so-called alert fatigue, a situation by which you’re bombarded by alerts to the extent that you just stop listening to them. And when data drift really does occur, you would possibly miss it.

We now have seen that detecting data drift based on a test’s p-value could be unreliable, resulting in many false alarms. How can we do higher? One solution is to go 180 degrees and resort to Bayesian testing, which allows us to directly estimate what we’d like, P(H₀|data), slightly than the p-value, P(data|H₀).

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here