Home Artificial Intelligence Outlier Detection Using Distribution Fitting in Univariate Datasets

Outlier Detection Using Distribution Fitting in Univariate Datasets

3
Outlier Detection Using Distribution Fitting in Univariate Datasets

Photo by Randy Fath on Unsplash

Anomaly or novelty detection is applicable in a big selection of situations where a transparent, early warning of an abnormal condition is required, akin to for sensor data, security operations, and fraud detection amongst others. Attributable to the character of the issue, outliers don’t present themselves ceaselessly, and as a consequence of the shortage of labels, it may possibly turn out to be difficult to create supervised models. Outliers are also called anomalies or novelties but there are some fundamental differences within the underlying assumptions and the modeling process. Here I’ll discuss the elemental differences between anomalies and novelties and the concepts of outlier detection. With a hands-on example, I’ll reveal create an unsupervised model for the detection of anomalies and novelties using probability density fitting for univariate data sets. The distfit library is used across all examples.

Anomalies and novelties are each observations that deviate from what’s standard, normal, or expected. The collective name for such observations is the outlier. Typically, outliers present themselves on the (relative) tail of a distribution and are far-off from the remaining of the density. As well as, for those who observe large spikes in density for a given value or a small range of values, it could point toward possible outliers. Although the aim for anomaly and novelty detection is identical, there are some conceptual modeling differences [1], briefly summarized as follows:

Anomalies are outliers which can be known to be present within the training data and deviate from what’s normal or expected. In such cases, we should always aim to suit a model on the observations which have the expected/normal behavior (also named inliers) and ignore the deviant observations. The observations that fall outside the expected/normal behavior are the outliers.

Novelties are outliers that usually are not known to be present within the training data. The information doesn’t contain observations that deviate from what’s normal/expected. Novelty detection will be more difficult as there isn’t a reference of an outlier. Domain knowledge is more vital in such cases to stop model overfitting on the inliers.

I just identified that the difference between anomalies and novelties is within the modeling process. But there may be more to it. Before we will start modeling, we’d like to set some expectations about “how an outlier should seem like”. There are roughly three varieties of outliers (Figure 1), summarized as follows:

  • Global outliers (also named point outliers) are single, and independent observations that deviate from all other observations [1, 2]. When someone speaks about “outliers”, it is frequently in regards to the global outlier.
  • Contextual outliers occur when a selected remark doesn’t slot in a selected context. A context can present itself in a bimodal or multimodal distribution, and an outlier deviates inside the context. As an illustration, temperatures below 0 are normal in winter but are unusual in the summertime and are then called outliers. Besides time series and seasonal data, other known applications are in sensor data [3] and security operations [4].
  • Collective outliers (or group outliers) are a gaggle of comparable/related instances with unusual behavior in comparison with the remaining of the info set [5]. The group of outliers can form a bimodal or multimodal distribution because they often indicate a special sort of problem than individual outliers, akin to a batch processing error or a systemic problem in the info generation process. Note that the Detection of collective outliers typically requires a special approach than detecting individual outliers.
Figure 1. From left to right an example of world, contextual, and collective outliers. Image by the creator.

Yet one more part that should be discussed before we will start modeling outliers is the data set part. From an information set perspective, outliers will be detected based on a single feature (univariate) or based on multiple features per remark (multivariate). Carry on reading because the following section is about univariate and multivariate evaluation.

A modeling approach for the detection of any sort of outlier has two most important flavors; univariate and multivariate evaluation (Figure 2). I’ll deal with the detection of outliers for univariate random variables but not before I’ll briefly describe the differences:

  • The univariate approach is when the sample/remark is marked as an outlier using one variable at a time, i.e., an individual’s age, weight, or a single variable in time series data. Analyzing the info distribution in such cases is well-suited for outlier detection.
  • The multivariate approach is when the sample/observations contain multiple features that will be jointly analyzed, akin to age, weight, and height together. It’s well suited to detect outliers with features which have (non-)linear relationships or where the distribution of values in each variable is (highly) skewed. In these cases, the univariate approach is probably not as effective, because it doesn’t take note of the relationships between variables.
Figure 2. Overview of univariate vs. multivariate evaluation for the detection of outliers. Image by the creator.

There are numerous (non-)parametric manners for the detection of outliers in univariate data sets, akin to Z-scores, Tukey’s fences, and density-based approaches amongst others. The common theme across the methods is that the underlying distribution is modeled. The distfit library [6] is subsequently well fitted to outlier detection as it may possibly determine the Probability Density Function (PDF) for univariate random variables but can even model univariate data sets in a non-parametric manner using percentiles or quantiles. Furthermore, it may possibly be used to model anomalies or novelties in any of the three categories; global, contextual, or collective outliers. See this blog for more detailed details about distribution fitting using the distfit library [6]. The modeling approach will be summarized as follows:

  1. Compute the fit to your random variable across various PDFs, then rank PDFs using the goodness of fit test, and evaluate with a bootstrap approach. Note that non-parametric approaches with quantiles or percentiles can be used.
  2. Visually inspect the histogram, PDFs, CDFs, and Quantile-Quantile (QQ) plot.
  3. Select one of the best model based on steps 1 and a pair of, but additionally be sure the properties of the (non-)parametric model (e.g., the PDF) match the use case. Selecting one of the best model just isn’t only a statistical query; it is usually a modeling decision.
  4. Make predictions on latest unseen samples using the (non-)parametric model akin to the PDF.

Let’s start with a straightforward and intuitive example to reveal the working of novelty detection for univariate variables using distribution fitting and hypothesis testing. In this instance, our aim is to pursue a novelty approach for the detection of world outliers, i.e., the info doesn’t contain observations that deviate from what’s normal/expected. Which means, sooner or later, we should always rigorously include domain knowledge to set the boundaries of what an outlier looks like.

Suppose we have now measurements of 10.000 human heights. Let’s generate random normal data with mean=163 and std=10 that represents our human height measurements. We expect a bell-shaped curve that incorporates two tails; those with smaller and bigger heights than average. Note that as a consequence of the stochastic component, results can differ barely when repeating the experiment.

# Import library
import numpy as np

# Generate 10000 samples from a standard distribution
X = np.random.normal(163, 10, 10000)

1. Determine the PDFs that best fit Human Height.

Before we will detect any outliers, we’d like to suit a distribution (PDF) on what’s normal/expected behavior for human height. The distfit library can fit as much as 89 theoretical distributions. I’ll limit the search to only common/popular probability density functions as we readily expect a bell-shaped curve (see the next code section).

# Install distfit library
pip install distfit
# Import library
from distfit import distfit

# Initialize for common/popular distributions with bootstrapping.
dfit = distfit(distr='popular', n_boots=100)

# Estimate one of the best fit
results = dfit.fit_transform(X)

# Plot the RSS and bootstrap results for the highest scoring PDFs
dfit.plot_summary(n_top=10)

# Show the plot
plt.show()

Figure 3. The RSS scores for the fit of human height with probably the most common distributions.

The loggamma PDF is detected as one of the best fit for human height in response to the goodness of fit test statistic (RSS) and the bootstrapping approach. Note that the bootstrap approach evaluates whether there was overfitting for the PDFs. The bootstrap rating ranges between [0, 1], and depicts the fit-success ratio across the variety of bootstraps (n_bootst=100) for the PDF. It will possibly even be seen from Figure 3 that, besides the loggamma PDF, multiple other PDFs are detected too with a low Residual Sum of Squares, i.e., Beta, Gamma, Normal, T-distribution, Loggamma, generalized extreme value, and the Weibull distribution (Figure 3). Nonetheless, only five PDFs did pass the bootstrap approach.

2: Visual inspection of the best-fitting PDFs.

A best practice is to visually inspect the distribution fit. The distfit library incorporates built-in functionalities for plotting, akin to the histogram combined with the PDF/CDF but additionally QQ-plots. The plot will be created as follows:

# Make figure
fig, ax = plt.subplots(1, 2, figsize=(20, 8))

# PDF for less than one of the best fit
dfit.plot(chart='PDF', n_top=1, ax=ax[0]);

# CDF for the highest 10 matches
dfit.plot(chart='CDF', n_top=10, ax=ax[1])

# Show the plot
plt.show()

Figure 4. Pareto plot with the histogram for the empirical data and the estimated PDF. Left panel: PDF with one of the best fit (Beta). Right panel: CDF with the highest 10 most closely fits. The boldness intervals are based on alpha=0.05.

A visible inspection confirms the goodness of fit scores for the top-ranked PDFs. Nonetheless, there may be one exception, the Weibull distribution (yellow line in Figure 4) appears to have two peaks. In other words, although the RSS is low, a visible inspection doesn’t show fit for our random variable. Note that the bootstrap approach readily excluded the Weibull distribution and now we all know why.

Step 3: Resolve by also using the PDF properties.

The last step could be the most difficult step because there are still five candidate distributions that scored thoroughly within the goodness of fit test, the bootstrap approach, and the visual inspection. We should always now resolve which PDF matches best on its fundamental properties to model human height. I’ll stepwise elaborate on the properties of the highest candidate distributions with respect to our use case of modeling human height.

The Normal distribution is a typical alternative but it will be significant to notice that the belief of normality for human height may not hold in all populations. It has no heavy tails and subsequently it could not capture outliers thoroughly.

The Students T-distribution is usually used as an alternative choice to the traditional distribution when the sample size is small or the population variance is unknown. It has heavier tails than the traditional distribution, which may higher capture the presence of outliers or skewness in the info. In case of low sample sizes, this distribution might have been an option but because the sample size increases, the t-distribution approaches the traditional distribution.

The Gamma distribution is a continuous distribution that is usually used to model data which can be positively skewed, meaning that there may be a protracted tail of high values. Human height could also be positively skewed as a consequence of the presence of outliers, akin to very tall individuals. Nonetheless, the bootstrap appraoch showed a poor fit.

The Log-gamma distribution has a skewed shape, much like the gamma distribution, but with heavier tails. It models the log of the values which makes it more appropriate to make use of when the info has large variety of high values.

The Beta distribution is usually used to model proportions or rates [9], moderately than continuous variables akin to in our use-case for height. It might have been an appropriate alternative if height was divided by a reference value, akin to the median height. So despite it scores best on the goodness of fit test, and we confirm fit using a visible inspection, it could not be my first alternative.

The Generalized Extreme Value (GEV) distribution will be used to model the distribution of utmost values in a population, akin to the utmost or minimum values. It also allows heavy tails which may capture the presence of outliers or skewness in the info. Nonetheless, it is usually used to model the distribution of utmost values [10], moderately than the general distribution of a continuous variable akin to human height.

The Dweibull distribution is probably not one of the best match for this research query because it is usually used to model data that has a monotonic increasing or decreasing trend, akin to time-to-failure or time-to-event data [11]. Human height data may not have a transparent monotonic trend. The visual inspection of the PDF/CDF/QQ-plot also showed no good match.

To summarize, the loggamma distribution could be the most suitable option on this particular use case after considering the goodness of fit test, the bootstrap approach, the visual inspection, and now also based on the PDF properties related to the research query. Note that we will easily specify the loggamma distribution and re-fit on the input data (see code section) if required (see code section).

# Initialize for common or popular distributions.
dfit = distfit(distr='loggamma', alpha=0.01, sure='each')

# Estimate one of the best fit
results = dfit.fit_transform(X)

# Print model parameters
print(dfit.model)

# {'name': 'loggamma',
# 'rating': 6.676334203908028e-05,
# 'loc': -1895.1115726427015,
# 'scale': 301.2529482991781,
# 'arg': (927.596119872062,),
# 'params': (927.596119872062, -1895.1115726427015, 301.2529482991781),
# 'color': '#e41a1c',
# 'CII_min_alpha': 139.80923469906566,
# 'CII_max_alpha': 185.8446340627711}

# Save model
dfit.save('./human_height_model.pkl')

Step 4. Predictions for brand new unseen samples.

With the fitted model we will assess the importance of recent (unseen) samples and detect whether or not they deviate from what’s normal/expected (the inliers). Predictions are made on the theoretical probability density function, making it lightweight, fast, and explainable. The boldness intervals for the PDF are set using the alpha parameter. That is the part where domain knowledge is required because there aren’t any known outliers in our data set present. On this case, I set the boldness interval (CII) alpha=0.01 which leads to a minimum boundary of 139.8cm and a maximum boundary of 185.8cm. The default is that each tails are analyzed but this will be modified using the sure parameter (see code section above).

We are able to use the predict function to make latest predictions on latest unseen samples, and create the plot with the prediction results (Figure 5). Remember that significance is corrected for multiple testing: multtest='fdr_bh'. Outliers can thus be positioned outside the boldness interval but not marked as significant.

# Latest human heights
y = [130, 160, 200]

# Make predictions
results = dfit.predict(y, alpha=0.01, multtest='fdr_bh', todf=True)

# The prediction results
results['df']

# y y_proba y_pred P
# 0 130.0 0.000642 down 0.000428
# 1 160.0 0.391737 none 0.391737
# 2 200.0 0.000321 up 0.000107

plt.figure();
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
# PDF for less than one of the best fit
dfit.plot(chart='PDF', ax=ax[0]);
# CDF for the highest 10 matches
dfit.plot(chart='CDF', ax=ax[1])
# Show plot
plt.show()

Figure 5. Left Panel: Histogram for the empirical data and the log-gamma PDF. The black line is the empirical data distribution. The red line is the fitted theoretical distribution. The red vertical lines are the boldness intervals which can be set to 0.01. The green dashed lines are detected as outliers and the red crosses usually are not significant. (image by the creator)

The outcomes of the predictions are stored in results and incorporates multiple columns: y, y_proba, y_pred, and P . The P stands for the raw p-values and y_proba are the possibilities after multiple test corrections (default: fdr_bh). Note that an information frame is returned when using the todf=True parameter. Two observations have a probability alpha<0.01 and are marked as significant up or down.

Thus far we have now seen fit a model and detect global outliers for novelty detection. Here we'll use real-world data for the detection of anomalies. Using real-world data is frequently much more difficult to work with. To reveal this, I'll download the info set of natural gas spot price from Thomson Reuters [7] which is an open-source and freely available dataset [8]. After downloading, importing, and removing nan values, there are 6555 data points across 27 years.

# Initialize distfit
dfit = distfit()

# Import dataset
df = dfit.import_example(data='gas_spot_price')

print(df)
# price
# date
# 2023-02-07 2.35
# 2023-02-06 2.17
# 2023-02-03 2.40
# 2023-02-02 2.67
# 2023-02-01 2.65
# ...
# 1997-01-13 4.00
# 1997-01-10 3.92
# 1997-01-09 3.61
# 1997-01-08 3.80
# 1997-01-07 3.82

# [6555 rows x 1 columns]

Visually inspection of the info set.

To visually inspect the info, we will create a line plot of the natural gas spot price to see whether there are any obvious trends or other relevant matters (Figure 6). It will possibly be seen that 2003 and 2021 contain two major peaks (which hint toward global outliers). Moreover, the value actions appear to have a natural movement with local highs and lows. Based on this line plot, we will construct an intuition of the expected distribution. The worth moves mainly within the range [2, 5] but with some exceptional years from 2003 to 2009, where the range was more between [6, 9].

# Get unique years
dfit.lineplot(df, xlabel='Years', ylabel='Natural gas spot price', grid=True)

# Show the plot
plt.show()

Figure 6. Open data source data set of Natural gas spot price from Thomson Reuters [7, 8].

Let’s use distfit to deeper investigate the info distribution, and determine the accompanying PDF. The search space is ready to all available PDFs and the bootstrap approach is ready to 100 to guage the PDFs for overfitting.

# Initialize
from distfit import distfit

# Fit distribution
dfit = distfit(distr='full', n_boots=100)

# Seek for best theoretical fit.
results = dfit.fit_transform(df['price'].values)

# Plot PDF/CDF
fig, ax = plt.subplots(1,2, figsize=(25, 10))
dfit.plot(chart='PDF', n_top=10, ax=ax[0])
dfit.plot(chart='CDF', n_top=10, ax=ax[1])

# Show plot
plt.show()

Figure 7. left: PDF, and right: CDF. All fitted theoretical distributions are shown in several colours. (image by the creator)

The very best-fitting PDF is Johnsonsb (Figure 7) but once we plot the empirical data distributions, the PDF (red line) doesn't precisely follow the empirical data. Typically, we will confirm that nearly all of data points are positioned within the range [2, 5] (that is where the height of the distribution is) and that there's a second smaller peak within the distribution with price actions around value 6. This can also be the purpose where the PDF doesn't easily fit the empirical data and causes some undershoots and overshoots. With the summary plot and QQ plot, we will investigate the fit even higher. Let’s create these two plots with the next lines of code:

# Plot Summary and QQ-plot
fig, ax = plt.subplots(1,2, figsize=(25, 10))

# Summary plot
dfit.plot_summary(ax=ax[0])

# QQplot
dfit.qqplot(df['price'].values, n_top=10, ax=ax[1])

# Show the plot
plt.show()

It's interesting to see within the summary plot that the goodness of fit test showed good results (low rating) amongst all the highest distributions. Nonetheless, once we have a look at the outcomes of the bootstrap approach, it shows that every one, except one distribution, are overfitted (Figure 8A, orange line). This just isn't entirely unexpected because we already noticed overshooting and undershooting. The QQ plot confirms that the fitted distributions deviate strongly from the empirical data (Figure 8B). Only the Johnsonsb distribution showed a (borderline) good fit.

Figure 8. A. left panel: PDFs are sorted on the bootstrap rating and the goodness of fit test. B. right panel: QQ-plot containing the comparison between empirical distribution vs. all other theoretical distributions. (image by the creator)

Detection of Global and Contextual Outliers.

We'll proceed using the Johnsonsb distribution and the predict functionality for the detection of outliers. We already know that our data set incorporates outliers as we followed the anomaly approach, i.e., the distribution is fitted on the inliers, and observations that now fall outside the boldness intervals will be marked as potential outliers. With the predict function and the lineplot we will detect and plot the outliers. It will possibly be seen from Figure 9 that the worldwide outliers are detected but additionally some contextual outliers, despite we didn't model for it explicitly. Red bars are the underrepresented outliers and green bars are the overrepresented outliers. The alpha parameter will be set to tune the boldness intervals.

# Make prediction
dfit.predict(df['price'].values, alpha=0.05, multtest=None)

# Line plot with data points outside the boldness interval.
dfit.lineplot(df['price'], labels=df.index)

Figure 9. Plotting outliers after fitting distribution and making predictions. Green bars are outliers outside the upper sure of the 95% CII. Red bars are outliers outside the lower sure of the 95% CII.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here