Data drift occurs when the statistical properties of the input data change over time, resulting in a shift in the info distribution.
The necessity for data drift monitoring in industries arises from the indisputable fact that machine learning models depend on accurate and up-to-date data so as to make accurate predictions or decisions. Nevertheless, in lots of industries, the info can change over time, resulting in data drift.
Zillion, an actual estate company based out of Seattle suffered a lack of around $300 million because of faulty price predictions from its ML model (in consequence of knowledge drift) after the drastic changes in the actual estate market because of the worldwide pandemic.
Because the saying goes
There are several statistical tests available to detect data drift. Among the widely used tests includes
- Kolmogorov-Smirnov test
- Population Stability Index
- Kullback-Leibler or KL Divergence
- Jensen-Shannon or JS Divergence
- Wasserstein Metric or Earth Mover Distance
- Chi-squared test
- Z test
Evidently is an open-source Machine learning model monitoring library in Python developed by Evidently.ai
It follows a default logic for selecting the suitable drift test for every column (or feature) in the info based on
- column type: categorical, numerical or text data
- the variety of observations within the reference dataset
- the variety of unique values within the column (n_unique)
With the logic (as shown within the image above), the drift test for use is detected and the corresponding test is applied on the reference data and the present data and a drift rating is obtained for every feature or column in the info, with which it decides whether there may be an information drift within the feature or not (as shown within the image below)
All tests use a 0.95 confidence level by default and all metrics use a threshold of 0.1 by default.
Tips on how to install Evidently
!pip install evidently
For the aim of demonstration let’s take the Iris flower dataset available in sklearn, nevertheless custom data could be used for actual model monitoring in production
- Import the required libraries in python
from sklearn import datasets
import pandas as pdfrom evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab
from evidently.options import DataDriftOptions
2. Load the dataset (in our case, the iris flower dataset)
iris = datasets.load_iris()
iris_frame = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_frame['target'] = iris.goal
3. Calculate the info drift by utilizing the reference data (the info used for training) and the present data (the info from production). In our case, lets consider the primary 100 rows as reference data and the remaining as the present data.
data_drift_dashboard = Dashboard(tabs=[DataDriftTab()])
data_drift_dashboard.calculate(iris_frame[:100], iris_frame[100:])
data_drift_dashboard.show(mode='inline')
Yayyy!!! Here’s our dashboard to detect data drift with just a couple of lines of code
We’d like not follow the default drift detection logic followed by Evidently.ai, we at all times search for customizations based on our usecase and the great part is Evidently allows us to switch their drift detection logic based on our needs.
within the list below :
and might even pass a custom written test
- Tips on how to customise the statistical test and the brink for drift detection
from evidently.options import DataDriftOptionsoptions = DataDriftOptions(all_features_stattest="jensenshannon", threshold=0.6)
data_drift_dashboard = Dashboard(tabs=[DataDriftTab()], options=[options])
data_drift_dashboard.calculate(iris_frame[:100], iris_frame[100:])
data_drift_dashboard.show(mode='inline')
With the above code snippet, we are able to customize the default drift detection logic by utilizing the jensenshanon distance test to detect drift with a threshold of 0.25
i.e for all of the features, if the jensenshannon distance > = 0.6, then data drift will probably be detected
Note:
With default logic, z test is used for goal and KS test is used for remaining numerical features and data drift will probably be detected if p_value < 0.1 (which is the default threshold). Because of this, 4 out of 5 columns were detected with data drift (Confer with previously obtained dashboard)
Whereas after customization (as shown within the above image), Jensenshannon distance is used to detect drift for all features with a threshold of 0.6. Because of this, 3 out of 5 columns were detected with data drift
Similarly, based on our use case, we may also modify the statistical test just for numerical features alone or just for categorical features alone.
options = DataDriftOptions(num_features_stattest="psi", threshold=0.25)
data_drift_dashboard = Dashboard(tabs=[DataDriftTab()], options=[options])
data_drift_dashboard.calculate(iris_frame[:100], iris_frame[100:])
data_drift_dashboard.show(mode='inline')
options = DataDriftOptions(cat_features_stattest="psi", threshold=0.25)
data_drift_dashboard = Dashboard(tabs=[DataDriftTab()], options=[options])
data_drift_dashboard.calculate(iris_frame[:100], iris_frame[100:])
data_drift_dashboard.show(mode='inline')
Customizations will also be done on individual features as well
So why wait?
Happpyyyy model monitoring !!!
Because its gonna be a head ache in case you miss the info drift
smooth jazz
I don’t think the title of your article matches the content lol. Just kidding, mainly because I had some doubts after reading the article.