Detecting data drift to observe ML models in production (using Evidently library in Python) What’s data drift and why should we detect that ? Tips on how to detect data drift ? Drift detection in Python using Evidently How Evidently works ? Using Evidently to detect Data Drift (in Python) Customizations in Evidently

Artificial Intelligence

Detecting data drift to observe ML models in production (using Evidently library in Python) What’s data drift and why should we detect that ? Tips on how to detect data drift ? Drift detection in Python using Evidently How Evidently works ? Using Evidently to detect Data Drift (in Python) Customizations in Evidently

admin

March 11, 2023

Detecting data drift to observe ML models in production (using Evidently library in Python)
What’s data drift and why should we detect that ?
Tips on how to detect data drift ?
Drift detection in Python using Evidently
How Evidently works ?
Using Evidently to detect Data Drift (in Python)
Customizations in Evidently

Data drift occurs when the statistical properties of the input data change over time, resulting in a shift in the info distribution.

Data drift resulting in mispredictions in production

The necessity for data drift monitoring in industries arises from the indisputable fact that machine learning models depend on accurate and up-to-date data so as to make accurate predictions or decisions. Nevertheless, in lots of industries, the info can change over time, resulting in data drift.

Zillion, an actual estate company based out of Seattle suffered a lack of around $300 million because of faulty price predictions from its ML model (in consequence of knowledge drift) after the drastic changes in the actual estate market because of the worldwide pandemic.

Because the saying goes

There are several statistical tests available to detect data drift. Among the widely used tests includes

Kolmogorov-Smirnov test
Population Stability Index
Kullback-Leibler or KL Divergence
Jensen-Shannon or JS Divergence
Wasserstein Metric or Earth Mover Distance
Chi-squared test
Z test

Evidently is an open-source Machine learning model monitoring library in Python developed by Evidently.ai

It follows a default logic for selecting the suitable drift test for every column (or feature) in the info based on

column type: categorical, numerical or text data
the variety of observations within the reference dataset
the variety of unique values within the column (n_unique)

Evidently’s default logic to detect data drift (Image by writer)

With the logic (as shown within the image above), the drift test for use is detected and the corresponding test is applied on the reference data and the present data and a drift rating is obtained for every feature or column in the info, with which it decides whether there may be an information drift within the feature or not (as shown within the image below)

How drift is detected using each test in Evidently (Image by writer)

All tests use a 0.95 confidence level by default and all metrics use a threshold of 0.1 by default.

Tips on how to install Evidently

!pip install evidently

For the aim of demonstration let’s take the Iris flower dataset available in sklearn, nevertheless custom data could be used for actual model monitoring in production

Import the required libraries in python

from sklearn import datasets
import pandas as pdfrom evidently.dashboard import Dashboard  
from evidently.dashboard.tabs import DataDriftTab
from evidently.options import DataDriftOptions

2. Load the dataset (in our case, the iris flower dataset)

iris = datasets.load_iris()
iris_frame = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_frame['target'] = iris.goal

3. Calculate the info drift by utilizing the reference data (the info used for training) and the present data (the info from production). In our case, lets consider the primary 100 rows as reference data and the remaining as the present data.

data_drift_dashboard = Dashboard(tabs=[DataDriftTab()])
data_drift_dashboard.calculate(iris_frame[:100], iris_frame[100:])
data_drift_dashboard.show(mode='inline')

Yayyy!!! Here’s our dashboard to detect data drift with just a couple of lines of code

expansion of knowledge drift dashboard to envision the drift of individual feature (Shows that the sepal width feature isn’t drifted)

expansion of knowledge drift dashboard to envision the drift of individual features (Shows how the petal width feature in the present dataset is drifted from the reference dataset)

We’d like not follow the default drift detection logic followed by Evidently.ai, we at all times search for customizations based on our usecase and the great part is Evidently allows us to switch their drift detection logic based on our needs.

within the list below :

and might even pass a custom written test

Tips on how to customise the statistical test and the brink for drift detection

from evidently.options import DataDriftOptionsoptions = DataDriftOptions(all_features_stattest="jensenshannon", threshold=0.6)
data_drift_dashboard = Dashboard(tabs=[DataDriftTab()], options=[options])
data_drift_dashboard.calculate(iris_frame[:100], iris_frame[100:])
data_drift_dashboard.show(mode='inline')

With the above code snippet, we are able to customize the default drift detection logic by utilizing the jensenshanon distance test to detect drift with a threshold of 0.25

i.e for all of the features, if the jensenshannon distance > = 0.6, then data drift will probably be detected

Customized data drift detection dashboard

Note:

With default logic, z test is used for goal and KS test is used for remaining numerical features and data drift will probably be detected if p_value < 0.1 (which is the default threshold). Because of this, 4 out of 5 columns were detected with data drift (Confer with previously obtained dashboard)

Whereas after customization (as shown within the above image), Jensenshannon distance is used to detect drift for all features with a threshold of 0.6. Because of this, 3 out of 5 columns were detected with data drift

Similarly, based on our use case, we may also modify the statistical test just for numerical features alone or just for categorical features alone.

options = DataDriftOptions(num_features_stattest="psi", threshold=0.25)
data_drift_dashboard = Dashboard(tabs=[DataDriftTab()], options=[options])
data_drift_dashboard.calculate(iris_frame[:100], iris_frame[100:])
data_drift_dashboard.show(mode='inline')

options = DataDriftOptions(cat_features_stattest="psi", threshold=0.25)
data_drift_dashboard = Dashboard(tabs=[DataDriftTab()], options=[options])
data_drift_dashboard.calculate(iris_frame[:100], iris_frame[100:])
data_drift_dashboard.show(mode='inline')

Customizations will also be done on individual features as well

So why wait?

Happpyyyy model monitoring !!!

Because its gonna be a head ache in case you miss the info drift

2 COMMENTS

Leave a Reply to binance Anmeldung Cancel reply