Perform outlier detection more effectively using subsets of features

-

Discover relevant subspaces: subsets of features that let you most effectively perform outlier detection on tabular data

This text is a component of a series related to the challenges, and the techniques that could be used, to best discover outliers in data, including articles related to using PCA, Distance Metric Learning, Shared Nearest Neighbors, Frequent Patterns Outlier Factor, Counts Outlier Detector (a multi-dimensional histogram-based method), and doping. This text also comprises an excerpt from my book, Outlier Detection in Python.

We glance here at techniques to create, as a substitute of a single outlier detector examining all features inside a dataset, a series of smaller outlier detectors, each working with a subset of the features (known as subspaces).

When performing outlier detection on tabular data, we’re on the lookout for the records in the info which are essentially the most unusual — either relative to the opposite records in the identical dataset, or relative to previous data.

There are numerous challenges related to finding essentially the most meaningful outliers, particularly that there isn’t a definition of statistically unusual that definitively specifies which anomalies in the info needs to be considered the strongest. As well, the outliers which are most relevant (and never necessarily essentially the most statistically unusual) in your purposes might be specific to your project, and should evolve over time.

There are also numerous technical challenges that appear in outlier detection. Amongst these are the difficulties that occur where data has many features. As covered in previous articles related to Counts Outlier Detector and Shared Nearest Neighbors, where we now have many features, we frequently face a problem generally known as the curse of dimensionality.

This has numerous implications for outlier detection, including that it makes distance metrics unreliable. Many outlier detection algorithms depend on calculating the distances between records — so as to discover as outliers the records which are much like unusually few other records, and which are unusually different from most other records — that’s, records which are near few other records and removed from most other records.

For instance, if we now have a table with 40 features, each record in the info could also be viewed as some extent in 40-dimensional space, and its outlierness will be evaluated by the distances from it to the opposite points on this space. This, then, requires a option to measure the gap between records. A wide range of measures are used, with Euclidean distances being quite common (assuming the info is numeric, or is converted to numeric values). So, the outlierness of every record is commonly measured based on the Euclidean distance between it and the opposite records within the dataset.

These distance calculations can, though, break down where we’re working with many features and, in actual fact, issues with distance metrics may appear even with only ten or twenty features, and fairly often with about thirty or forty or more.

We should always note though, issues coping with large numbers of features don’t appear with all outlier detectors. For instance, they don’t are inclined to be significant when working with univariate tests (tests akin to z-score or interquartile range tests, that consider each feature separately, independently of the opposite features — described in additional detail in A Easy Example Using PCA for Outlier Detection) or when using categorical outlier detectors akin to FPOF.

Nevertheless, the vast majority of outlier detectors commonly used are numeric multi-variate outlier detectors — detectors that assume all features are numeric, and that generally work on all features directly. For instance, LOF (Local Outlier Factor) and KNN (k-Nearest Neighbors) are two the essentially the most widely-used detectors and these each evaluate the outlierness of every record based on their distances (within the high-dimensional spaces the info points live in) to the opposite records.

Consider the plots below. This presents a dataset with six features, shown in three second scatter plots. This includes two points that may reasonably be considered outliers, P1 and P2.

Looking, for now, at P1, it is much from the opposite points, not less than in feature A. That’s, considering just feature A, P1 can easily be flagged as an outlier. Nevertheless, most detectors will consider the gap of every point to the opposite points using all six dimensions, which, unfortunately, means P1 may not necessarily stand out as an outlier, attributable to the character of distance calculations in high-dimensional spaces. P1 is fairly typical in the opposite five features, and so it’s distance to the opposite points, in 6d space, could also be fairly normal.

Nevertheless, we will see that this general approach to outlier detection — where we examine the distances from each record to the opposite records — is kind of reasonable: P1 and P2 are outliers because they’re far (not less than in some dimensions) from the opposite points.

As KNN and LOF are very commonly used detectors, we’ll take a look at them a little bit closer here, after which look specifically at using subspaces with these algorithms.

With the KNN outlier detector, we pick a price for k, which determines what number of neighbors each record is in comparison with. Let’s say we pick 10 (in practice, this could be a reasonably typical value).

For every record, we then measure the gap to its 10 nearest neighbors, which provides a superb sense of how isolated and distant each point is. We then must create a single outlier rating (i.e., a single number) for every record based on these 10 distances. For this, we generally then take either the mean or the utmost of those distances.

Let’s assume we take the utmost (using the mean, median, or other function works similarly, though each have their nuances). If a record has an unusually large distance to its tenth nearest neighbor, this implies there are at most 9 records which are reasonably near it (and possibly less), and that it’s otherwise unusually removed from most other points, so will be considered an outlier.

With the LOF outlier detector, we use the same approach, though it really works a bit in another way. We also take a look at the gap of every point to its k nearest neighbors, but then compare this to the distances of those k neighbors to their k nearest neighbors. So LOF measures the outlierness of every point relative to the opposite points of their neighborhoods.

That’s, while KNN uses a worldwide standard to find out what are unusually large distances to their neighbors, LOF uses a neighborhood standard to find out what are unusually large distances.

The main points of the LOF algorithm are literally a bit more involved, and the implications of the precise differences in these two algorithms (and the various variations of those algorithms) are covered in additional detail in Outlier Detection in Python.

These are interesting considerations in themselves, however the most important point for here is that KNN and LOF each evaluate records based on their distances to their closest neighbors. And that these distance metrics can work sub-optimally (and even completely breakdown) if using many features directly, which is reduced greatly by working with small numbers of features (subspaces) at a time.

The concept of using subspaces is helpful even where the detector used doesn’t use distance metrics, but where detectors based on distance calculations are used, among the advantages of using subspaces is usually a bit more clear. And, using distances in ways much like KNN and LOF is kind of common amongst detectors. In addition to KNN and LOF, for instance, Radius, ODIN, INFLO, and LoOP detectors, in addition to detectors based on sampling, and detectors based on clustering, all use distances.

Nevertheless, issues with the curse of dimensionality can occur with other detectors as well. For instance, ABOD (Angle-based Outlier Detector) uses the angles between records to guage the outlierness of every record, versus the distances. But, the thought is analogous, and using subspaces will also be helpful when working with ABOD.

As well, other advantages of subspaces I’ll undergo below apply equally to many detectors, whether using distance calculations or not. Still, the curse of dimensionality is a serious concern in outlier detection: where detectors use distance calculations (or similar measures, akin to angle calculations), and there are lots of features, these distance calculations can break down. Within the plots above, P1 and P2 could also be detected well considering only six dimensions, and quite possibly if using 10 or 20 features, but when there have been, say, 100 dimensions, the distances between all points would actually find yourself in regards to the same, and P1 and P2 wouldn’t stand out in any respect as unusual.

Outside of the problems related to working with very large numbers of features, our attempts to discover essentially the most unusual records in a dataset will be undermined even when working with fairly small numbers of features.

While very large numbers of features could make the distances calculated between records meaningless, even moderate numbers of features could make records which are unusual in only one or two features tougher to discover.

Consider again the scatter plot shown earlier, repeated here. Point P1 is an outlier in feature A (thought not in the opposite five features). Point P2 is unusual in features C and D, but not in the opposite 4 features). Nevertheless, when considering the Euclidean distances of those points to the opposite points in 6-dimensional space, they could not reliably stand out as outliers. The identical could be true using Manhattan, and most other distance metrics as well.

The left pane shows point P1 in a 2D dataspace. The purpose is unusual considering feature A, but less so if using Euclidean distances in the total 6D dataspace, and even the 2D dataspace shown on this plot. That is an example where using additional features will be counterproductive. In the center pane, we see one other point, point P2, which is an outlier within the C–D subspace but not within the A-B or E–F subspaces. We want only features C and D to discover this outlier, and again including other features will simply make P2 tougher to discover.

P1, for instance, even within the second space shown within the left-most plot, just isn’t unusually removed from most other points. It’s unusual that there aren’t any other points near it (which KNN and LOF will detect), but the gap from P1 to the opposite points on this second space just isn’t unusual: it’s much like the distances between most other pairs of points.

Using a KNN algorithm, we might likely find a way to detect this, not less than if k is about fairly low, for instance, to five or 10 — most records have their fifth (and their tenth) nearest neighbors much closer than P1 does. Though, when including all six features within the calculations, this is way less clear than when viewing just feature A, or simply the left-most plot, with just features A and B.

Point P2 stands out well as an outlier when considering just features C and D. Using a KNN detector with a k value of, say, 5, we will discover its 5 nearest neighbors, and the distances to those could be larger than is typical for points on this dataset.

Using an LOF detector, again with a k value of, say, 5, we will compare the distances to P1’s or P2’s 5 nearest neighbors to the distances to their 5 nearest neighbors and here as well, the gap from P1 or P2 to their 5 nearest neighbors could be found to be unusually large.

Not less than this is simple when considering only Features A and B, or Features C and D, but again, when considering the total 6-d space, they grow to be tougher to discover as outliers.

While many outlier detectors should find a way to discover P1 and P2 even with six, or a small number more, dimensions, it’s clearly easier and more reliable to make use of fewer features. To detect P1, we actually only need to think about feature A; and to discover P2, we actually only need to think about features C and D. Including other features in the method simply makes this tougher.

This is definitely a typical theme with outlier detection. We regularly have many features within the datasets we work with, and every will be useful. For instance, if we now have a table with 50 features, it could be that each one 50 features are relevant: either a rare value in any of those features could be interesting, or a rare combination of values in two or more features, for every of those 50 features, could be interesting. It will be, then, value keeping all 50 features for evaluation.

But, to discover anyone anomaly, we generally need only a small variety of features. In reality, it’s very rare for a record to be unusual in all features. And it’s very rare for a record to have a anomaly based on a rare combination of many features (see Counts Outlier Detector for more explanation of this).

Any given outlier will likely have a rare value in a single or two features, or a rare combination of values in a pair, or a set of perhaps three or 4 features. Only these features are essential to discover the anomalies in that row, regardless that the opposite features could also be essential to detect the anomalies in other rows.

To deal with these issues, a crucial technique in outlier detection is using subspaces. The term subspaces simply refers to subsets of the features. In the instance above, if we use the subspaces: A-B, C-D, E-F, A-E, B-C, B-D-F, and A-B-E, then we now have seven subspaces (five second subspaces and two 3d subspaces). Creating these, we might run one (or more) detectors on each subspace, so would run not less than seven detectors on each record.

Realistically, subspaces grow to be more useful where we now have many more features that six, and usually even the the subspaces themselves can have greater than six features, and not only two or three, but viewing this easy case, for now, with a small variety of small subspaces is fairly easy to know.

Using these subspaces, we will more reliably find P1 and P2 as outliers. P1 would likely be scored high by the detector running on features A-B, the detector running on features A-E, and the detector running on features A-B-E. P2 would likely be detected by the detector running on features C-D, and possibly the detector running on B-C.

Nevertheless, we now have to watch out: using only these seven subspaces, versus a single 6d space covering all features, would miss any rare mixtures of, for instance, A and D, or C and E. These may or will not be detected using a detector covering all six features, but definitely couldn’t be detected using a collection of detectors that simply never examine these mixtures of features.

Using subspaces does have some large advantages, but does have some risk of missing relevant outliers. We’ll cover some techniques to generate subspaces below that mitigate this issue, but it could be useful to still run a number of outlier detectors on the total dataspace as well. Typically, with outlier detection, we’re rarely capable of find the total set of outliers we’re focused on unless we apply many techniques. As essential as using subspaces will be, it continues to be often useful to make use of a wide range of techniques, which can include running some detectors on the total data.

Similarly, with each subspace, we may execute multiple detectors. For instance, we may use each a KNN and LOF detector, in addition to Radius, ABOD, and possibly numerous other detectors — again, using multiple techniques allows us to higher cover the range of outliers we want to detect.

We’ve seen, then, a pair motivations for working with subspaces: we will mitigate the curse of dimensionality, and we will reduce where anomalies are usually not identified reliably where they’re based on small numbers of features which are lost amongst many features.

In addition to handling situations like this, there are numerous other benefits to using subspaces with outlier detection. These include:

  • Accuracy attributable to the results of using ensembles — Using multiple subspaces allows us to create ensembles (collections of outlier detectors), which allows us to mix the outcomes of many detectors. Typically, using ensembles of detectors provides greater accuracy than using a single detector. This is analogous (though with some real differences too) to the best way ensembles of predictors are inclined to be stronger for classification and regression problems than a single predictor. Here, using subspaces, each record is examined multiple times, which provides a more stable evaluation of every record than any single detector would.
  • Interpretability — The outcomes will be more interpretable, and interpretability is commonly a key concern in outlier detection. Fairly often in outlier detection, we’re flagging unusual records with the concept they could be a priority, or a focal point, indirectly, and sometimes they might be manually examined. Knowing why they’re unusual is essential to find a way to do that efficiently and effectively. Manually assessing outliers which are flagged by detectors that examined many features will be especially difficult; however, outliers flagged by detectors using only a small variety of features will be way more manageable to asses.
  • Faster systems — Using fewer features allows us to create faster (and fewer memory-intensive) detectors. This will speed up each fitting and inference, particularly when working with detectors whose execution time is non-linear within the variety of features (many detectors are, for instance, quadratic in execution time based on the variety of features). Depending on the detectors, using, say, 20 detectors, each covering 8 features, may very well execute faster than a single detector covering 100 features.
  • Execution in parallel — Provided that we use many small detectors as a substitute of 1 large detector, it’s possible to execute each the fitting and the predicting steps in parallel, allowing for faster execution where there are the hardware resources to support this.
  • Ease of tuning over time — Using many easy detectors creates a system that’s easier to tune over time. Fairly often with outlier detection, we’re simply evaluating a single dataset and want simply to discover the outliers on this. Nevertheless it’s also quite common to execute outlier detection systems on a long-running basis, for instance, monitoring industrial processes, website activity, financial transactions, the info being input to machine learning systems or other software applications, the output of those systems, and so forth. In these cases, we generally wish to enhance the outlier detection system over time, allowing us to focus higher on the more relevant outliers. Having a collection of easy detectors, each based on a small variety of features, makes this way more manageable. It allows us to, over time, increase the load of the more useful detectors and reduce the load of the less useful detectors.

As indicated, we’ll need, for every dataset evaluated, to find out the suitable subspaces. It might, though, be difficult to search out the relevant set of subspaces, or not less than to search out the optimal set of subspaces. That’s, assuming we’re focused on finding any unusual mixtures of values, it could be difficult to know which sets of features will contain essentially the most relevant of the weird mixtures.

For instance, if a dataset has 100 features, we may train 10 models, each covering 10 features. We may use, say, the primary 10 features for the primary detector, the second set of 10 features for the second, and so forth, If the primary two features have some rows with anomalous mixtures of values, we’ll detect this. But when there are anomalous mixtures related to the primary feature and any of the 90 features not covered by the identical model, we’ll miss these.

We will improve the chances of putting relevant features together by utilizing many more subspaces, but it could be difficult to make sure all sets of features that needs to be together are literally together not less than once, particularly where there are relevant outliers in the info which are based on three, 4, or more features — which must appear together in not less than one subspace to be detected. For instance, in a table of staff expenses, you could want to discover expenses for rare mixtures of Department, Expense Type, and Amount. In that case, these three features must appear together in not less than one subspace.

So, we now have the questions of what number of features needs to be in each subspace, which features should go together, and what number of subspaces to create.

There are a really large variety of mixtures to think about. If there are 20 features, there are ²²⁰ possible subspaces, which is just over 1,000,000. If there are 30 features, there over a billion. If we resolve ahead of time what number of features might be in each subspace, the numbers of mixtures decreases, but continues to be very large. If there are 20 features and we wish to make use of subspaces with 8 features each, there are 20 selected 8, or 125,970 mixtures. If there are 30 features and we wish for subspaces with 7 features each, there are 30 selected 7, or 2,035,800 mixtures.

One approach we may need to take is to maintain the subspaces small, which allows for greater interpretability. Essentially the most interpretable option, using two features per subspace, also allows for easy visualization. Nevertheless, if we now have d features, we’ll need d*(d-1)/2 models to cover all mixtures, which will be intractable. With 100 features, we might require 4,950 detectors. We often need to make use of not less than several features per detector, though not necessarily a big number.

We wish to make use of enough detectors, and enough features per detector, that every pair of features appears together ideally not less than once, and few enough features per detector that the detectors have largely different features from one another. For instance, if each detector used 90 out of the 100 features, we’d cover all mixtures of features well, however the subspaces would still be quite large (undoing much of the advantage of using subspaces), and all of the subspaces might be quite much like one another (undoing much of the advantage of creating ensembles).

While the variety of features per subspace requires balancing these concerns, the variety of subspaces created is a little more straightforward: when it comes to accuracy, using more subspaces is strictly higher, but is computationally costlier.

There are a number of broad approaches to finding useful subspaces. I list these here quickly, then take a look at some in additional detail below.

  • Based on domain knowledge — Here we consider which sets of features could potentially have mixtures of values we might consider noteworthy.
  • Based on associations — Unusual mixtures of values are only possible if a set of features are associated indirectly. In prediction problems, we frequently wish to attenuate the correlations between features, but with outlier detection, these are the features which are most useful to think about together. The features with the strongest associations can have essentially the most meaningful outliers if there are exceptions to the traditional patterns.
  • Based on finding very sparse regions — Records are typically regarded as outliers in the event that they are unlike most other records in the info, which suggests they’re situated in sparse regions of the info. Subsequently, useful subspaces will be found as people who contain large, nearly-empty regions.
  • Randomly — That is the tactic utilized by a method shown later called FeatureBagging and, while it could be suboptimal, it avoids the expensive searches for associations and sparse regions, and might work reasonably well where many subspaces are used.
  • Exhaustive searches — That is the tactic employed by Counts Outlier Detector. This is proscribed to subspaces with small numbers of features, but the outcomes are highly interpretable. It also avoids any computation, or biases, related to choosing only a subset of the possible subspaces.
  • Using the features related to any known outliers — If we now have a set of known outliers, can discover why they’re outliers (the relevant features), and are in a situation where we don’t want to discover unknown outliers (only these specific outliers), then we will make the most of this and discover the sets of features relevant for every known outlier, and construct models for the varied sets of features required.

We’ll take a look at a number of of those next in a little bit more detail.

Domain knowledge

Let’s take the instance of a dataset, specifically an expenses table, shown below. If examining this table, we may find a way to find out the varieties of outliers we might and wouldn’t be focused on. Unusual mixtures of Account and Amount, in addition to unusual mixtures of Department and Account, could also be of interest; whereas Date of Expense and Time would likely not be a useful combination. We will proceed in this fashion, making a small variety of subspaces, each with likely two, three, or 4 features, which may allow for very efficient and interpretable outlier detection, flagging essentially the most relevant outliers.

Expenses table

This will miss cases where we now have an association in the info, though the association just isn’t obvious. So, in addition to making the most of domain knowledge, it could be value searching the info for associations. We will discover relationships among the many features, for instance, testing where features will be predicted accurately from the opposite features using easy predictive models. Where we discover such associations, these will be value investigating.

Discovering these associations, though, could also be useful for some purposes, but may or will not be useful for the outlier detection process. If there’s, for instance, a relationship between accounts and the time of the day, this may increasingly simply be attributable to the method people occur to typically use to submit their expenses, and it could be that deviations from this are of interest, but more likely they are usually not.

Random feature subspaces

Creating subspaces randomly will be effective if there isn’t a domain knowledge to attract on. That is fast and might create a set of subspaces that may are inclined to catch the strongest outliers, though it could miss some essential outliers too.

The code below provides an example of 1 method to create a set of random subspaces. This instance uses a set of eight features, named A through H, and creates a set of subspaces of those.

Each subspace starts by choosing the feature that’s to date the least-used (if there’s a tie, one is chosen randomly). It uses a variable called ft_used_counts to trace this. It then adds features to this subspace separately, each step choosing the feature that has appeared in other subspaces the least often with the features to date within the subspace. It uses a feature called ft_pair_mtx to trace what number of subspaces each pair of features have appeared in together to date. Doing this, we create a set of subspaces that matches each pair of features roughly equally often.

import pandas as pd
import numpy as np

def get_random_subspaces(features_arr, num_base_detectors,
num_feats_per_detector):
num_feats = len(features_arr)
feat_sets_arr = []
ft_used_counts = np.zeros(num_feats)
ft_pair_mtx = np.zeros((num_feats, num_feats))

# Each loop generates one subspace, which is one set of features
for _ in range(num_base_detectors):
# Get the set of features with the minimum count
min_count = ft_used_counts.min()
idxs = np.where(ft_used_counts == min_count)[0]

# Pick one in every of these randomly and add to the present set
feat_set = [np.random.choice(idxs)]

# Find the remaining set of features
while len(feat_set) < num_feats_per_detector:
mtx_with_set = ft_pair_mtx[:, feat_set]
sums = mtx_with_set.sum(axis=1)
min_sum = sums.min()
min_idxs = np.where(sums==min_sum)[0]
new_feat = np.random.alternative(min_idxs)
feat_set.append(new_feat)
feat_set = list(set(feat_set))

# Updates ft_pair_mtx
for c in feat_set:
ft_pair_mtx[c][new_feat] += 1
ft_pair_mtx[new_feat][c] += 1

# Updates ft_used_counts
for c in feat_set:
ft_used_counts[c] += 1

feat_sets_arr.append(feat_set)

return feat_sets_arr

np.random.seed(0)
features_arr = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
num_base_detectors = 4
num_feats_per_detector = 5

feat_sets_arr = get_random_subspaces(features_arr,
num_base_detectors,
num_feats_per_detector)
for feat_set in feat_sets_arr:
print([features_arr[x] for x in feat_set])

Normally we might create many more base detectors (each subspace often corresponds to at least one base detector, though we also can run multiple base detectors on each subspace) than we do in this instance, but this uses just 4 to maintain things easy. This can output the next subspaces:

['A', 'E', 'F', 'G', 'H']
['B', 'C', 'D', 'F', 'H']
['A', 'B', 'C', 'D', 'E']
['B', 'D', 'E', 'F', 'G']

The code here will create the subspaces such that each one have the identical variety of features. There may be also a bonus in having the subspaces cover different numbers of features, as this may introduce some more diversity (which is significant when creating ensembles), but there is robust diversity in any case from using different features (as long as each uses a comparatively small variety of features, such that the subspaces are largely different features).

Having the identical variety of features has a pair advantages. It simplifies tuning the models, as many parameters utilized by outlier detectors rely on the variety of features. If all subspaces have the identical variety of features, they also can use the identical parameters.

It also simplifies combining the scores, because the detectors might be more comparable to one another. If using different numbers of features, this may produce scores which are on different scales, and never easily comparable. For instance, with k-Nearest Neighbors (KNN), we expect greater distances between neighbors if there are more features.

Feature subspaces based on correlations

Every little thing else equal, in creating the subspaces, it’s useful to maintain associated features together as much as possible. Within the code below, we offer an example of code to pick out subspaces based on correlations.

There are several ways to check for associations. We will create predictive models to try to predict each feature from one another single feature (it will capture even relatively complex relationships between features). With numeric features, the only method is probably going to envision for Spearman correlations, which can miss nonmonotonic relationships, but will detect most strong relationships. That is what’s utilized in the code example below.

To execute the code, we first specify the variety of subspaces desired and the variety of features in each.

This executes by first finding all pairwise correlations between the features and storing this in a matrix. We then create the primary subspace, starting by finding the biggest correlation within the correlation matrix (this adds two features to this subspace) after which looping over the variety of other features to be added to this subspace. For every, we take the biggest correlation within the correlation matrix for any pair of features, such that one feature is currently within the subspace and one just isn’t. Once this subspace has a sufficient variety of features, we create the subsequent subspace, taking the biggest correlation remaining within the correlation matrix, and so forth.

For this instance, we use an actual dataset, the baseball dataset from OpenML (available with a public license). The dataset seems to contain some large correlations. The correlation, for instance, between At bats and Runs is 0.94, indicating that any values that deviate significantly from this pattern would likely be outliers.

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml

# Function to search out the pair of features remaining within the matrix with the
# highest correlation
def get_highest_corr():
return np.unravel_index(
np.argmax(corr_matrix.values, axis=None),
corr_matrix.shape)

def get_correlated_subspaces(corr_matrix, num_base_detectors,
num_feats_per_detector):
sets = []

# Loop through each subspace to be created
for _ in range(num_base_detectors):
m1, m2 = get_highest_corr()

# Start each subspace because the two remaining features with
# the very best correlation
curr_set = [m1, m2]
for _ in range(2, num_feats_per_detector):
# Get the opposite remaining correlations
m = np.unravel_index(np.argsort(corr_matrix.values, axis=None),
corr_matrix.shape)
m0 = m[0][::-1]
m1 = m[1][::-1]
for i in range(len(m0)):
d0 = m0[i]
d1 = m1[i]
# Add the pair if either feature is already within the subset
if (d0 in curr_set) or (d1 in curr_set):
curr_set.append(d0)
curr_set = list(set(curr_set))
if len(curr_set) < num_feats_per_detector:
curr_set.append(d1)
# Remove duplicates
curr_set = list(set(curr_set))
if len(curr_set) >= num_feats_per_detector:
break

# Update the correlation matrix, removing the features now used
# in the present subspace
for i in curr_set:
i_idx = corr_matrix.index[i]
for j in curr_set:
j_idx = corr_matrix.columns[j]
corr_matrix.loc[i_idx, j_idx] = 0
if len(curr_set) >= num_feats_per_detector:
break

sets.append(curr_set)
return sets

data = fetch_openml('baseball', version=1)
df = pd.DataFrame(data.data, columns=data.feature_names)

corr_matrix = abs(df.corr(method='spearman'))
corr_matrix = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
corr_matrix = corr_matrix.fillna(0)

feat_sets_arr = get_correlated_subspaces(corr_matrix, num_base_detectors=5,
num_feats_per_detector=4)
for feat_set in feat_sets_arr:
print([df.columns[x] for x in feat_set])

This produces:

['Games_played', 'At_bats', 'Runs', 'Hits']
['RBIs', 'At_bats', 'Hits', 'Doubles']
['RBIs', 'Games_played', 'Runs', 'Doubles']
['Walks', 'Runs', 'Games_played', 'Triples']
['RBIs', 'Strikeouts', 'Slugging_pct', 'Home_runs']

PyOD is probably going essentially the most comprehensive and well-used tool for outlier detection on numeric tabular data available in Python today. It includes numerous detectors, starting from quite simple to very complex — including several deep learning-based methods.

Now that we now have an idea of how subspaces work with outlier detection, we’ll take a look at two tools provided by PyOD that work with subspaces, called SOD and FeatureBagging. Each of those tools discover a set of subspaces, execute a detector on each subspace, and mix the outcomes for a single rating for every record.

Whether using subspaces or not, it’s essential to find out what base detectors to make use of. If not using subspaces, we would choose a number of detectors and run these on the total dataset. And, if we’re using subspaces, we again select a number of detectors, here running these on each subspace. As indicated above, LOF and KNN will be reasonable selections, but PyOD provides numerous others as well that may work well if executed on each subspace, including, for instance, Angle-based Outlier Detector (ABOD), models based on Gaussian Mixture Models (GMMs), Kernel Density Estimations (KDE), and a number of other others. Other detectors, provided outside PyOD can work very effectively as well.

SOD was designed specifically to handle situations akin to shown within the scatter plots above. SOD works, much like KNN and LOF, by identifying a neighborhood of k neighbors for every point, generally known as the reference set. The reference set is found differently, though, using a technique called shared nearest neighbors (SNN).

Shared nearest neighbors are described thoroughly in this text, but the overall idea is that if two points are generated by the identical mechanism, they’ll are inclined to not only be close, but additionally to have most of the same neighbors. And so, the similarity of any two records will be measured by the variety of shared neighbors they’ve. Given this, neighborhoods will be identified by utilizing not only the sets of points with the smallest Euclidean distances between them (as KNN and LOF do), however the points with essentially the most shared neighbors. This tends to be robust even in high dimensions and even where there are lots of irrelevant features: the rank order of neighbors tends to stay meaningful even in these cases, and so the set of nearest neighbors will be reliably found even where specific distances cannot.

Once we now have the reference set, we use this to find out the subspace, which here is the set of features that specify the best amount of variance for the reference set. Once we discover these subspaces, SOD examines the distances of every point to the info center.

I provide a fast example using SOD below. This assumes pyod has been installed, which requires running:

pip install pyod

We’ll use, for instance, an artificial dataset, which allows us to experiment with the info and model hyperparameters to get a greater sense of the strengths and limitations of every detector. The code here provides an example of working with 35 features, where two features (features 8 and 9) are correlated and the opposite features are irrelevant. A single outlier is created as an unusual combination of the 2 correlated features.

SOD is capable of discover the one known outlier as the highest outlier. I set the contamination rate to 0.01 to specify to return (given there are 100 records) only a single outlier. Testing this beyond 35 features, though, SOD scores this point much lower. This instance specifies the scale of the reference set to be 3; different results could also be seen with different values.

import pandas as pd
import numpy as np
from pyod.models.sod import SOD

np.random.seed(0)
d = np.random.randn(100, 35)
d = pd.DataFrame(d)

#A Ensure features 8 and 9 are correlated, while all others are irrelevant
d[9] = d[9] + d[8]

# Insert a single outlier
d.loc[99, 8] = 3.5
d.loc[99, 9] = -3.8

#C Execute SOD, flagging only one outlier
clf = SOD(ref_set=3, contamination=0.01)
d['SOD Scores'] = clf.fit (d)
d['SOD Scores'] = clf.labels_

We display 4 scatterplots below, showing 4 pairs of the 35 features. The known outlier is shown as a star in each of those. We will see features 8 and 9 (the 2 relevant features) within the second pane, and we will see the purpose is a transparent outlier, though it’s typical in all other dimensions.

Testing SOD with 35-dimensional data. One outlier was inserted into the info and will be seen clearly within the second pane for features 8 and 9. Although the purpose is typical otherwise, it’s flagged as the highest outlier by SOD. The third pane also includes feature 9, and we will see the purpose is somewhat unusual here, though no more so than many other points in other dimensions. The connection in features 8 and 9 is essentially the most relevant, and SOD appears to detect this

FeatureBagging was designed to resolve the identical problem as SOD, though takes a unique approach to determining the subspaces. It creates the subspaces completely randomly (so barely in another way than the instance above, which keeps a record of how often each pair of features are placed in a subspace together and attempts to balance this). It also subsamples the rows for every base detector, which provides a little bit more diversity between the detectors.

A specified variety of base detectors are used (10 by default, though it’s preferable to make use of more), each of which selects a random set of rows and features. For every, the utmost variety of features that could be chosen is specified as a parameter, defaulting to all. So, for every base detector, FeatureBagging:

  • Determines the variety of features to make use of, as much as the desired maximum.
  • Chooses this many features randomly.
  • Chooses a set of rows randomly. It is a bootstrap sample of the identical size because the variety of rows.
  • Creates an LOF detector (by default; other base detectors could also be used) to guage the subspace.

Once that is complete, each row can have been scored by each base detector and the scores must then be combined right into a single, final rating for every row. PyOD’s FeatureBagging provides two options for this: using the utmost rating and using the mean rating.

As we saw within the scatter plots above, points will be strong outliers in some subspaces and never in others, and averaging of their scores from the subspaces where they’re typical can water down their scores and defeat the advantage of using subspaces. In other types of ensembling with outlier detection, using the mean can work well, but when working with multiple subspaces, using the utmost will typically be the higher of the 2 options. Doing that, we give each record a rating based on the subspace where it was most unusual. This isn’t perfect either, and there will be higher options, but using the utmost is easy and is sort of at all times preferable to the mean.

Any detector will be used inside the subspaces. PyOD uses LOF by default, as did the unique paper describing FeatureBagging. LOF is a robust detector and a good selection, though you could find higher results with other base detectors.

In the unique paper, subspaces are created randomly, each using between d/2 and d — 1 features, where d is the entire variety of features. Some researchers have identified that the variety of features utilized in the unique paper is probably going much larger than is suitable.

If the total variety of features is large, using over half the features directly will allow the curse of dimensionality to take effect. And using many features in each detector will lead to the detectors being correlated with one another (for instance, if all base detectors use 90% of the features, they’ll use roughly the identical features and are inclined to rating each record roughly the identical), which also can remove much of the advantage of creating ensembles.

PyOD allows setting the variety of features utilized in each subspace, and it needs to be typically set fairly low, with numerous base estimators created.

In this text we’ve checked out subspaces as a option to improve outlier detection in numerous ways, including reducing the curse of dimensionality, increasing interpretability, allowing parallel execution, allowing easier tuning over time, and so forth. Each of those are essential considerations, and using subspaces is commonly very helpful.

There are, though, often other approaches as well that will be used for these purposes, sometimes as alternatives, and sometimes together of with using subspaces. For instance, to enhance interpretability, its essential to, as much as possible, select model types which are inherently interpretable (for instance univariate tests akin to z-score tests, Counts Outlier Detector, or a detector provided by PyOD called ECOD).

Where the most important interest is in reducing the curse of dimensionality, here again, it could be useful to have a look at model types that scale to many features well, for example Isolation Forest or Counts Outlier Detector. It might even be useful to have a look at executing univariate tests, or applying PCA.

One thing to pay attention to when constructing subspaces, in the event that they are formed based on correlations, or on sparse regions, is that the relevant subspaces may change over time as the info changes. Latest associations may emerge between features and latest sparse regions may form that might be useful for identifying outliers, though these might be missed if the subspaces are usually not recalculated on occasion. Finding the relevant subspaces in these ways will be quite effective, but they could must to be updated on some schedule, or where the info is understood to have modified.

It’s common with outlier detection projects on tabular data for it to be value taking a look at using subspaces, particularly where we now have many features. Using subspaces is a comparatively straightforward technique with numerous noteworthy benefits.

Where you face issues related to large data volumes, execution times, or memory limits, using PCA may additionally be a useful technique, and may fit higher in some cases than creating sub-spaces, though working with sub-spaces (and so, working with the unique features, and never the components created by PCA) will be substantially more interpretable, and interpretability is commonly quite essential with outlier detection.

Subspaces will be used together with other techniques to enhance outlier detection. For instance, using subspaces will be combined with other ways to create ensembles: it’s possible to create larger ensembles using each subspaces (where different detectors within the ensemble use different features) in addition to different model types, different training rows, different pre-processing, and so forth. This will provide some further advantages, though with some increase in computation as well.

All images by writer

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x