This text continues my series on outlier detection, following articles on Counts Outlier Detector and Frequent Patterns Outlier Factor, and provides one other excerpt from my book Outlier Detection in Python.
In this text, we have a look at the difficulty of testing and evaluating outlier detectors, a notoriously difficult problem, and present one solution, sometimes known as doping. Using doping, real data rows are modified (often) randomly, but in such a way as to make sure they’re likely an outlier in some regard and, as such, ought to be detected by an outlier detector. We’re then in a position to evaluate detectors by assessing how well they’re able to detect the doped records.
In this text, we glance specifically at tabular data, but the identical idea could also be applied to other modalities as well, including text, image, audio, network data, and so forth.
Likely, in case you’re acquainted with outlier detection, you’re also familiar, at the very least to some extent, with predictive models for regression and classification problems. With a lot of these problems, we have now labelled data, and so it’s relatively easy to judge each option when tuning a model (choosing one of the best pre-processing, features, hyper-parameters, and so forth); and it’s also relatively easy to estimate a model’s accuracy (how it’s going to perform on unseen data): we simply use a train-validation-test split, or higher, use cross validation. As the information is labelled, we are able to see directly how the model performs on a labelled test data.
But, with outlier detection, there isn’t a labelled data and the issue is significantly tougher; we have now no objective technique to determine if the records scored highest by the outlier detector are, in reality, essentially the most statistically unusual inside the dataset.
With clustering, as one other example, we also haven’t any labels for the information, however it is at the very least possible to measure the standard of the clustering: we are able to determine how internally consistent the clusters are and the way different the clusters are from one another. Using a ways metric (reminiscent of Manhattan or Euclidean distances), we are able to measure how close records inside a cluster are to one another and the way far apart clusters are from one another.
So, given a set of possible clusterings, it’s possible to define a smart metric (reminiscent of the Silhouette rating) and determine which is the popular clustering, at the very least with respect to that metric. That’s, very similar to prediction problems, we are able to calculate a rating for every clustering, and choose the clustering that appears to work best.
With outlier detection, though, we have now nothing analogous to this we are able to use. Any system that seeks to quantify how anomalous a record is, or that seeks to find out, given two records, which is the more anomalous of the 2, is effectively an outlier detection algorithm in itself.
For instance, we could use entropy as our outlier detection method, and might then examine the entropy of the total dataset in addition to the entropy of the dataset after removing any records identified as strong outliers. That is, in a way, valid; entropy is a useful measure of the presence of outliers. But we cannot assume entropy is the definitive definition of outliers on this dataset; certainly one of the elemental qualities of outlier detection is that there isn’t a definitive definition of outliers.
Basically, if we have now any technique to try to judge the outliers detected by an outlier detection system (or, as within the previous example, the dataset with and without the identified outliers), that is effectively an outlier detection system in itself, and it becomes circular to make use of this to judge the outliers found.
Consequently, it’s quite difficult to judge outlier detection systems and there’s effectively no good technique to achieve this, at the very least using the true data that’s available.
We will, though, create synthetic test data (in such a way that we are able to assume the synthetically-created data are predominantly outliers). Given this, we are able to determine the extent to which outlier detectors are inclined to rating the synthetic records more highly than the true records.
There are quite a lot of ways to create synthetic data we cover within the book, but for this text, we deal with one method, doping.
Doping data records refers to taking existing data records and modifying them barely, typically changing the values in only one, or a small number, of cells per record.
If the information being examined is, for instance, a table related to the financial performance of an organization comprised of franchise locations, we could have a row for every franchise, and our goal could also be to discover essentially the most anomalous of those. Let’s say we have now features including:
- Age of the franchise
- Variety of years with the present owner
- Variety of sales last yr
- Total dollar value of sales last yr
In addition to some variety of other features.
A typical record could have values for these 4 features reminiscent of: 20 years old, 5 years with the present owner, 10,000 unique sales within the last yr, for a complete of $500,000 in sales within the last yr.
We could create a doped version of this record by adjusting a worth to a rare value, for instance, setting the age of the franchise to 100 years. This may be done, and can provide a fast smoke test of the detectors being tested — likely any detector will have the opportunity to discover this as anomalous (assuming a worth is 100 is rare), though we may have the opportunity to eliminate some detectors that should not in a position to detect this type of modified record reliably.
We might not necessarily remove from consideration the style of outlier detector (e.g. kNN, Entropy, or Isolation Forest) itself, but the mixture of style of outlier detector, pre-processing, hyperparameters, and other properties of the detector. We may find, for instance, that kNN detectors with certain hyperparameters work well, while those with other hyperparameters don’t (at the very least for the kinds of doped records we test with).
Normally, though, most testing will likely be done creating more subtle outliers. In this instance, we could change the dollar value of total sales from 500,000 to 100,000, which should still be a typical value, but the mixture of 10,000 unique sales with $100,000 in total sales is probably going unusual for this dataset. That’s, much of the time with doping, we’re creating records which have unusual combos of values, though unusual single values are sometimes created as well.
When changing a worth in a record, it’s not known specifically how the row will turn out to be an outlier (assuming it does), but we are able to assume most tables have associations between the features. Changing the dollar value to 100,000 in this instance, may (in addition to creating an unusual combination of variety of sales and dollar value of sales) quite likely create an unusual combination given the age of the franchise or the variety of years with the present owner.
With some tables, nevertheless, there are not any associations between the features, or there are only few and weak associations. That is rare, but can occur. With any such data, there isn’t a concept of surprising combos of values, only unusual single values. Although rare, this is definitely an easier case to work with: it’s easier to detect outliers (we simply check for single unusual values), and it’s easier to judge the detectors (we simply check how well we’re in a position to detect unusual single values). For the rest of this text, though, we are going to assume there are some associations between the features and that the majority anomalies could be unusual combos of values.
Most outlier detectors (with a small variety of exceptions) have separate training and prediction steps. In this manner, most are just like predictive models. Throughout the training step, the training data is assessed and the conventional patterns inside the data (for instance, the conventional distances between records, the frequent item sets, the clusters, the linear relationships between features, etc.) are identified. Then, throughout the prediction step, a test set of knowledge (which will be the same data used for training, or could also be separate data) is compared against the patterns found during training, and every row is assigned an outlier rating (or, in some cases, a binary label).
Given this, there are two important ways we are able to work with doped data:
- Including doped records within the training data
We may include some small variety of doped records within the training data after which use this data for testing as well. This tests our ability to detect outliers within the currently-available data. This can be a common task in outlier detection: given a set of knowledge, we frequently wish to search out the outliers on this dataset (though may need to search out outliers in subsequent data as well — records which might be anomalous relative to the norms for this training data).
Doing this, we are able to test with only a small variety of doped records, as we don’t want to significantly affect the general distributions of the information. We then check if we’re in a position to discover these as outliers. One key test is to incorporate each the unique and the doped version of the doped records within the training data with a view to determine if the detectors rating the doped versions significantly higher than the unique versions of the identical records.
We also, though, wish do check that the doped records are generally scored amongst the best (with the understanding that some original, unmodified records may legitimately be more anomalous than the doped records, and that some doped records will not be anomalous).
Provided that we are able to test only with a small variety of doped records, this process could also be repeated over and over.
The doped data is used, nevertheless, just for evaluating the detectors in this manner. When creating the ultimate model(s) for production, we are going to train on only the unique (real) data.
If we’re in a position to reliably detect the doped records in the information, we may be reasonably confident that we’re in a position to discover other outliers inside the same data, at the very least outliers along the lines of the doped records (but not necessarily outliers which might be substantially more subtle — hence we wish to incorporate tests with reasonably subtle doped records).
2. Including doped records only within the testing data
Additionally it is possible to coach using only the true data (which we are able to assume is essentially non-outliers) after which test with each the true and the doped data. This enables us to coach on relatively clean data (some records in the true data will likely be outliers, but the bulk will likely be typical, and there isn’t a contamination resulting from doped records).
It also allows us to check with the actual outlier detector(s) which will, potentially, be put in production (depending how well they perform with the doped data — each in comparison with the opposite detectors we test, and in comparison with our sense of how well a detector should perform at minimum).
This tests our ability to detect outliers in future data. That is one other common scenario with outlier detection: where we have now one dataset that may be assumed to be reasonable clean (either freed from outliers, or containing only a small, typical set of outliers, and with none extreme outliers) and we wish to check future data to this.
Training with real data only and testing with each real and doped, we may test with any volume of doped data we wish, because the doped data is used just for testing and never for training. This enables us to create a big, and consequently, more reliable test dataset.
There are quite a lot of ways to create doped data, including several covered in Outlier Detection in Python, each with its own strengths and weaknesses. For simplicity, in this text we cover only one option, where the information is modified in a reasonably random manner: where the cell(s) modified are chosen randomly, and the brand new values that replace the unique values are created randomly.
Doing this, it is feasible for some doped records to not be truly anomalous, but most often, assigning random values will upset a number of associations between the features. We will assume the doped records are largely anomalous, though, depending how they’re created, possibly only barely so.
Here we undergo an example, taking an actual dataset, modifying it, and testing to see how well the modifications are detected.
In this instance, we use a dataset available on OpenML called abalone (https://www.openml.org/search?type=data&sort=runs&id=42726&status=lively, available under public license).
Although other preprocessing could also be done, for this instance, we one-hot encode the explicit features and use RobustScaler to scale the numeric features.
We test with three outlier detectors, Isolation Forest, LOF, and ECOD, all available in the favored PyOD library (which have to be pip installed to execute).
We also use an Isolation Forest to scrub the information (remove any strong outliers) before any training or testing. This step shouldn’t be needed, but is commonly useful with outlier detection.
That is an example of the second of the 2 approaches described above, where we train on the unique data and test with each the unique and doped data.
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import RobustScaler
import matplotlib.pyplot as plt
import seaborn as sns
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.ecod import ECOD# Collect the information
data = fetch_openml('abalone', version=1)
df = pd.DataFrame(data.data, columns=data.feature_names)
df = pd.get_dummies(df)
df = pd.DataFrame(RobustScaler().fit_transform(df), columns=df.columns)
# Use an Isolation Forest to scrub the information
clf = IForest()
clf.fit(df)
if_scores = clf.decision_scores_
top_if_scores = np.argsort(if_scores)[::-1][:10]
clean_df = df.loc[[x for x in df.index if x not in top_if_scores]].copy()
# Create a set of doped records
doped_df = df.copy()
for i in doped_df.index:
col_name = np.random.selection(df.columns)
med_val = clean_df[col_name].median()
if doped_df.loc[i, col_name] > med_val:
doped_df.loc[i, col_name] =
clean_df[col_name].quantile(np.random.random()/2)
else:
doped_df.loc[i, col_name] =
clean_df[col_name].quantile(0.5 + np.random.random()/2)
# Define a way to check a specified detector.
def test_detector(clf, title, df, clean_df, doped_df, ax):
clf.fit(clean_df)
df = df.copy()
doped_df = doped_df.copy()
df['Scores'] = clf.decision_function(df)
df['Source'] = 'Real'
doped_df['Scores'] = clf.decision_function(doped_df)
doped_df['Source'] = 'Doped'
test_df = pd.concat([df, doped_df])
sns.boxplot(data=test_df, orient='h', x='Scores', y='Source', ax=ax)
ax.set_title(title)
# Plot each detector by way of how well they rating doped records
# higher than the unique records
fig, ax = plt.subplots(nrows=1, ncols=3, sharey=True, figsize=(10, 3))
test_detector(IForest(), "IForest", df, clean_df, doped_df, ax[0])
test_detector(LOF(), "LOF", df, clean_df, doped_df, ax[1])
test_detector(ECOD(), "ECOD", df, clean_df, doped_df, ax[2])
plt.tight_layout()
plt.show()
Here, to create the doped records, we copy the total set of original records, so can have an equal variety of doped as original records. For every doped record, we select one feature randomly to switch. If the unique value is below the median, we create a random value above the median; if the unique is below the median, we create a random value above.
In this instance, we see that IF does rating the doped records higher, but not significantly so. LOF does a superb job distinguishing the doped records, at the very least for this manner of doping. ECOD is a detector that detects only unusually small or unusually large single values and doesn’t test for unusual combos. Because the doping utilized in this instance doesn’t create extreme values, only unusual combos, ECOD is unable to tell apart the doped from the unique records.
This instance uses boxplots to check the detectors, but normally we might use an objective rating, fairly often the AUROC (Area Under a Receiver Operator Curve) rating to judge each detector. We might also typically test many combos of model type, pre-processing, and parameters.
The above method will are inclined to create doped records that violate the conventional associations between features, but other doping techniques could also be used to make this more likely. For instance, considering first categorical columns, we may select a brand new value such that each:
- The brand new value is different from the unique value
- The brand new value is different from the worth that will be predicted from the opposite values within the row. To realize this, we are able to create a predictive model that predicts the present value of this column, for instance a Random Forest Classifier.
With numeric data, we are able to achieve the equivalent by dividing each numeric feature into 4 quartiles (or some variety of quantiles, but at the very least three). For every latest value in a numeric feature, we then select a worth such that each:
- The brand new value is in a unique quartile than the unique
- The brand new value is in a unique quartile than what could be predicted given the opposite values within the row.
For instance, if the unique value is in Q1 and the anticipated value is in Q2, then we are able to select a worth randomly in either Q3 or Q4. The brand new value will, then, more than likely go against the conventional relationships among the many features.
There isn’t any definitive technique to say how anomalous a record is once doped. Nevertheless, we are able to assume that on average the more features modified, and the more they’re modified, the more anomalous the doped records will likely be. We will reap the benefits of this to create not a single test suite, but multiple test suites, which allows us to judge the outlier detectors rather more accurately.
For instance, we are able to create a set of doped records which might be very obvious (multiple features are modified in each record, each to a worth significantly different from the unique value), a set of doped records which might be very subtle (only a single feature is modified, not significantly from the unique value), and plenty of levels of difficulty in between. This might help differentiate the detectors well.
So, we are able to create a set of test sets, where each test set has a (roughly estimated) level of difficulty based on the variety of features modified and the degree they’re modified. We may also have different sets that modify different features, provided that outliers in some features could also be more relevant, or could also be easier or tougher to detect.
It’s, though, necessary that any doping performed represents the style of outliers that will be of interest in the event that they did appear in real data. Ideally, the set of doped records also covers well the range of what you could be eager about detecting.
If these conditions are met, and multiple test sets are created, this could be very powerful for choosing the best-performing detectors and estimating their performance on future data. We cannot predict what number of outliers will likely be detected or what levels of false positives and false negatives you will notice — these depend greatly on the information you’ll encounter, which in an outlier detection context could be very difficult to predict. But, we are able to have a good sense of the kinds of outliers you’re prone to detect and to not.
Possibly more importantly, we’re also well situated to create an efficient ensemble of outlier detectors. In outlier detection, ensembles are typically needed for many projects. Provided that some detectors will catch some kinds of outliers and miss others, while other detectors will catch and miss other types, we are able to often only reliably catch the range of outliers we’re eager about using multiple detectors.
Creating ensembles is a big and involved area in itself, and is different than ensembling with predictive models. But, for this text, we are able to indicate that having an understanding of what kinds of outliers each detector is in a position to detect gives us a way of which detectors are redundant and which may detect outliers most others should not in a position to.
It’s difficult to evaluate how well any given outlier detects outliers in the present data, and even harder to asses how well it could do on future (unseen) data. Additionally it is very difficult, given two or more outlier detectors, to evaluate which might do higher, again on each the present and on future data.
There are, though, quite a lot of ways we are able to estimate these using synthetic data. In this text, we went over, at the very least quickly (skipping plenty of the nuances, but covering the important ideas), one approach based on doping real records and evaluating how well we’re in a position to rating these more highly than the unique data. Although not perfect, these methods may be invaluable and there could be very often no other practical alternative with outlier detection.
All images are from the writer.