## Learn detect outliers using Probability Density Functions for fast and light-weight models and explainable results.

Anomaly or novelty detection is applicable in a big selection of situations where a transparent, early warning of an abnormal condition is required, akin to for sensor data, security operations, and fraud detection amongst others. Attributable to the character of the issue, outliers don’t present themselves ceaselessly, and as a consequence of the shortage of labels, it may possibly turn out to be difficult to create supervised models. Outliers are also called anomalies or novelties but there are some fundamental differences within the underlying assumptions and the modeling process. *Here I’ll discuss the elemental differences between anomalies and novelties and the concepts of outlier detection. With a hands-on example, I’ll reveal create an unsupervised model for the detection of anomalies and novelties using probability density fitting for univariate data sets.**The distfit library is used across all examples.*

Anomalies and novelties are each observations that deviate from what’s standard, normal, or expected. The collective name for such observations is the ** outlier**. Typically, outliers present themselves on the (relative) tail of a distribution and are far-off from the remaining of the density. As well as, for those who observe large spikes in density for a given value or a small range of values, it could point toward possible outliers.

*Although the aim for anomaly and novelty detection is identical, there are some conceptual modeling differences**[1]*, briefly summarized as follows:

Anomalies are outliers which can be known to be present within the training data and deviate from what’s normal or expected.In such cases, we should always aim to suit a model on the observations which have the expected/normal behavior (also named inliers) and ignore the deviant observations. The observations that fall outside the expected/normal behavior are the outliers.

Novelties are outliers that usually are not known to be present within the training data. The information doesn’t contain observations that deviate from what’s normal/expected.Novelty detection will be more difficult as there isn’t a reference of an outlier. Domain knowledge is more vital in such cases to stop model overfitting on the inliers.

I just identified that the difference between anomalies and novelties is within the modeling process. But there may be more to it. Before we will start modeling, we’d like to set some expectations about “*how an outlier should seem like*”. There are roughly three varieties of outliers (Figure 1), summarized as follows:

**Global outliers**(also named point outliers) are single, and independent observations that deviate from all other observations [1, 2]. When someone speaks about “*outliers*”, it is frequently in regards to the global outlier.**Contextual outliers**occur when a selected remark doesn’t slot in a selected context. A context can present itself in a bimodal or multimodal distribution, and an outlier deviates inside the context. As an illustration, temperatures below 0 are normal in winter but are unusual in the summertime and are then called outliers. Besides time series and seasonal data, other known applications are in sensor data [3] and security operations [4].**Collective outliers**(or group outliers)

Yet one more part that should be discussed before we will start modeling outliers is the ** data set** part. From an information set perspective, outliers will be detected based on a single feature (univariate) or based on multiple features per remark (multivariate). Carry on reading because the following section is about univariate and multivariate evaluation.

A modeling approach for the detection of any sort of outlier has two most important flavors; *univariate and multivariate evaluation (Figure 2)*. I’ll deal with the detection of outliers for univariate random variables but not before I’ll briefly describe the differences:

**The univariate**approach is when the sample/remark is marked as an outlier using one variable at a time, i.e., an individual’s age, weight, or a single variable in time series data. Analyzing the info distribution in such cases is well-suited for outlier detection.**The multivariate**approach is when the sample/observations contain multiple features that will be jointly analyzed, akin to age, weight, and height together. It’s well suited to detect outliers with features which have (non-)linear relationships or where the distribution of values in each variable is (highly) skewed. In these cases, the univariate approach is probably not as effective, because it doesn’t take note of the relationships between variables.

There are numerous (non-)parametric manners for the detection of outliers in univariate data sets, akin to Z-scores, Tukey’s fences, and density-based approaches amongst others. The common theme across the methods is that the underlying distribution is modeled. The *distfit *library [6] is subsequently well fitted to outlier detection as it may possibly determine the Probability Density Function (PDF) for univariate random variables but can even model univariate data sets in a non-parametric manner using percentiles or quantiles. Furthermore, it may possibly be used to model anomalies or novelties in any of the three categories; global, contextual, or collective outliers. See this blog for more detailed details about distribution fitting using the *distfit* library [6]. The modeling approach will be summarized as follows:

- Compute the fit to your random variable across various PDFs, then rank PDFs using the goodness of fit test, and evaluate with a bootstrap approach.
*Note that non-parametric approaches with quantiles or percentiles can be used.* - Visually inspect the histogram, PDFs, CDFs, and Quantile-Quantile (QQ) plot.
- Select one of the best model based on steps 1 and a pair of, but additionally be sure the properties of the (non-)parametric model (e.g., the PDF) match the use case.
*Selecting one of the best model just isn’t only a statistical query; it is usually a modeling decision.* - Make predictions on latest unseen samples using the (non-)parametric model akin to the PDF.

Let’s start with a straightforward and intuitive example to reveal the working of novelty detection for univariate variables using distribution fitting and hypothesis testing. ** In this instance, our aim is to pursue a novelty approach for the detection of world outliers**, i.e.,

*the info doesn’t contain observations that deviate from what’s normal/expected.*Which means, sooner or later, we should always rigorously include domain knowledge to set the boundaries of what an outlier looks like.

Suppose we have now measurements of 10.000 human heights. Let’s generate random normal data with `mean=163`

and `std=10`

that represents our *human height* measurements*. *We expect a bell-shaped curve that incorporates two tails; those with smaller and bigger heights than average. *Note that as a consequence of the stochastic component, results can differ barely when repeating the experiment.*

`# Import library`

import numpy as np# Generate 10000 samples from a standard distribution

X = np.random.normal(163, 10, 10000)

## 1. Determine the PDFs that best fit Human Height.

Before we will detect any outliers, we’d like to suit a distribution (PDF) on what’s normal/expected behavior for human height. The *distfit *library can fit as much as 89 theoretical distributions. I’ll limit the search to only common/popular probability density functions as we readily expect a bell-shaped curve (s*ee the next code section).*

`# Install distfit library`

pip install distfit

`# Import library`

from distfit import distfit# Initialize for common/popular distributions with bootstrapping.

dfit = distfit(distr='popular', n_boots=100)

# Estimate one of the best fit

results = dfit.fit_transform(X)

# Plot the RSS and bootstrap results for the highest scoring PDFs

dfit.plot_summary(n_top=10)

# Show the plot

plt.show()

The ** loggamma **PDF is detected as one of the best fit for

*human height*in response to the goodness of fit test statistic (RSS) and the bootstrapping approach. Note that the bootstrap approach evaluates whether there was overfitting for the PDFs. The bootstrap rating ranges between [0, 1], and depicts the fit-success ratio across the variety of bootstraps (

`n_bootst=100`

) for the PDF. It will possibly even be seen from Figure 3 that, besides the *loggamma*PDF, multiple other PDFs are detected too with a low Residual Sum of Squares, i.e.,

*Beta, Gamma, Normal, T-distribution, Loggamma, generalized extreme value, and the Weibull distribution*(Figure 3). Nonetheless, only five PDFs did pass the bootstrap approach.

## 2: Visual inspection of the best-fitting PDFs.

A best practice is to visually inspect the distribution fit. The *distfit *library incorporates built-in functionalities for plotting, akin to the histogram combined with the PDF/CDF but additionally QQ-plots. The plot will be created as follows:

`# Make figure`

fig, ax = plt.subplots(1, 2, figsize=(20, 8))# PDF for less than one of the best fit

dfit.plot(chart='PDF', n_top=1, ax=ax[0]);

# CDF for the highest 10 matches

dfit.plot(chart='CDF', n_top=10, ax=ax[1])

# Show the plot

plt.show()

A visible inspection confirms the goodness of fit scores for the top-ranked PDFs. Nonetheless, there may be one exception, the Weibull distribution (yellow line in Figure 4) appears to have two peaks. In other words, although the RSS is low, a visible inspection doesn’t show fit for our random variable. *Note that the bootstrap approach readily excluded the Weibull distribution and now we all know why*.

## Step 3: Resolve by also using the PDF properties.

The last step could be the most difficult step because there are still five candidate distributions that scored thoroughly within the goodness of fit test, the bootstrap approach, and the visual inspection. We should always now resolve which PDF matches best on its fundamental properties to model human height. I’ll stepwise elaborate on the properties of the highest candidate distributions with respect to our use case of modeling human height.

The Normal distributionis a typical alternative but it will be significant to notice that the belief of normality for human height may not hold in all populations. It has no heavy tails and subsequently it could not capture outliers thoroughly.

The Students T-distributionis usually used as an alternative choice to the traditional distribution when the sample size is small or the population variance is unknown. It has heavier tails than the traditional distribution, which may higher capture the presence of outliers or skewness in the info. In case of low sample sizes, this distribution might have been an option but because the sample size increases, the t-distribution approaches the traditional distribution.

The Gamma distributionis a continuous distribution that is usually used to model data which can be positively skewed, meaning that there may be a protracted tail of high values. Human height could also be positively skewed as a consequence of the presence of outliers, akin to very tall individuals. Nonetheless, the bootstrap appraoch showed a poor fit.

The Log-gamma distributionhas a skewed shape, much like the gamma distribution, but with heavier tails. It models the log of the values which makes it more appropriate to make use of when the info has large variety of high values.

The Beta distributionis usually used to model proportions or rates [9], moderately than continuous variables akin to in our use-case for height. It might have been an appropriate alternative if height was divided by a reference value, akin to the median height. So despite it scores best on the goodness of fit test, and we confirm fit using a visible inspection, it could not be my first alternative.

The Generalized Extreme Value (GEV)distribution will be used to model the distribution of utmost values in a population, akin to the utmost or minimum values. It also allows heavy tails which may capture the presence of outliers or skewness in the info. Nonetheless, it is usually used to model the distribution of utmost values [10], moderately than the general distribution of a continuous variable akin to human height.

The Dweibull distributionis probably not one of the best match for this research query because it is usually used to model data that has a monotonic increasing or decreasing trend, akin to time-to-failure or time-to-event data [11]. Human height data may not have a transparent monotonic trend. The visual inspection of the PDF/CDF/QQ-plot also showed no good match.

To summarize, the ** loggamma** distribution could be the most suitable option on this particular use case after considering the

*goodness of fit test, the bootstrap approach, the visual inspection, and now also based on the PDF properties related to the research query*. Note that we will easily specify the

*loggamma*distribution and re-fit on the input data (see code section) if required (see code section).

`# Initialize for common or popular distributions.`

dfit = distfit(distr='loggamma', alpha=0.01, sure='each')# Estimate one of the best fit

results = dfit.fit_transform(X)

# Print model parameters

print(dfit.model)

# {'name': 'loggamma',

# 'rating': 6.676334203908028e-05,

# 'loc': -1895.1115726427015,

# 'scale': 301.2529482991781,

# 'arg': (927.596119872062,),

# 'params': (927.596119872062, -1895.1115726427015, 301.2529482991781),

# 'color': '#e41a1c',

# 'CII_min_alpha': 139.80923469906566,

# 'CII_max_alpha': 185.8446340627711}

# Save model

dfit.save('./human_height_model.pkl')

## Step 4. Predictions for brand new unseen samples.

With the fitted model we will assess the importance of recent (unseen) samples and detect whether or not they deviate from what’s normal/expected (the inliers). Predictions are made on the theoretical probability density function, making it lightweight, fast, and explainable. The boldness intervals for the PDF are set using the `alpha`

parameter. ** That is the part where domain knowledge is required because there aren’t any known outliers in our data set present.** On this case, I set the boldness interval (CII)

`alpha=0.01`

which leads to a minimum boundary of 139.8cm and a maximum boundary of 185.8cm. The default is that each tails are analyzed but this will be modified using the `sure`

parameter *(see code section above)*.

We are able to use the `predict`

function to make latest predictions on latest unseen samples, and create the plot with the prediction results (Figure 5). Remember that significance is corrected for multiple testing: `multtest='fdr_bh'`

. *Outliers can thus be positioned outside the boldness interval but not marked as significant.*

`# Latest human heights`

y = [130, 160, 200]# Make predictions

results = dfit.predict(y, alpha=0.01, multtest='fdr_bh', todf=True)

# The prediction results

results['df']

# y y_proba y_pred P

# 0 130.0 0.000642 down 0.000428

# 1 160.0 0.391737 none 0.391737

# 2 200.0 0.000321 up 0.000107

plt.figure();

fig, ax = plt.subplots(1, 2, figsize=(20, 8))

# PDF for less than one of the best fit

dfit.plot(chart='PDF', ax=ax[0]);

# CDF for the highest 10 matches

dfit.plot(chart='CDF', ax=ax[1])

# Show plot

plt.show()

The outcomes of the predictions are stored in `results`

and incorporates multiple columns: `y`

, `y_proba`

, `y_pred`

, and `P`

. The `P`

stands for the raw p-values and `y_proba`

are the possibilities after multiple test corrections (default: `fdr_bh`

). Note that an information frame is returned when using the `todf=True`

parameter. Two observations have a probability `alpha<0.01`

and are marked as significant `up`

or `down`

.

Thus far we have now seen fit a model and detect global outliers for novelty detection. ** Here we'll use real-world data for the detection of anomalies.** Using real-world data is frequently much more difficult to work with. To reveal this, I'll download the info set of

*natural gas spot price*from Thomson Reuters [7] which is an open-source and freely available dataset [8]. After downloading, importing, and removing nan values, there are 6555 data points across 27 years.

`# Initialize distfit`

dfit = distfit()# Import dataset

df = dfit.import_example(data='gas_spot_price')

print(df)

# price

# date

# 2023-02-07 2.35

# 2023-02-06 2.17

# 2023-02-03 2.40

# 2023-02-02 2.67

# 2023-02-01 2.65

# ...

# 1997-01-13 4.00

# 1997-01-10 3.92

# 1997-01-09 3.61

# 1997-01-08 3.80

# 1997-01-07 3.82

# [6555 rows x 1 columns]

## Visually inspection of the info set.

To visually inspect the info, we will create a line plot of the *natural gas spot price* to see whether there are any obvious trends or other relevant matters (Figure 6). It will possibly be seen that 2003 and 2021 contain two major peaks (which hint toward global outliers). Moreover, the value actions appear to have a natural movement with local highs and lows. Based on this line plot, we will construct an intuition of the expected distribution. The worth moves mainly within the range [2, 5] but with some exceptional years from 2003 to 2009, where the range was more between [6, 9].

`# Get unique years`

dfit.lineplot(df, xlabel='Years', ylabel='Natural gas spot price', grid=True)# Show the plot

plt.show()

Let’s use *distfit *to deeper investigate the info distribution, and determine the accompanying PDF. The search space is ready to all available PDFs and the bootstrap approach is ready to 100 to guage the PDFs for overfitting.

`# Initialize`

from distfit import distfit# Fit distribution

dfit = distfit(distr='full', n_boots=100)

# Seek for best theoretical fit.

results = dfit.fit_transform(df['price'].values)

# Plot PDF/CDF

fig, ax = plt.subplots(1,2, figsize=(25, 10))

dfit.plot(chart='PDF', n_top=10, ax=ax[0])

dfit.plot(chart='CDF', n_top=10, ax=ax[1])

# Show plot

plt.show()

The very best-fitting PDF is *Johnsonsb *(Figure 7) but once we plot the empirical data distributions, the PDF (red line) doesn't precisely follow the empirical data. Typically, we will confirm that nearly all of data points are positioned within the range [2, 5] (*that is where the height of the distribution is)* and that there's a second smaller peak within the distribution with price actions around value 6. This can also be the purpose where the PDF doesn't easily fit the empirical data and causes some undershoots and overshoots. With the summary plot and QQ plot, we will investigate the fit even higher. Let’s create these two plots with the next lines of code:

`# Plot Summary and QQ-plot`

fig, ax = plt.subplots(1,2, figsize=(25, 10))# Summary plot

dfit.plot_summary(ax=ax[0])

# QQplot

dfit.qqplot(df['price'].values, n_top=10, ax=ax[1])

# Show the plot

plt.show()

It's interesting to see within the summary plot that the goodness of fit test showed good results (low rating) amongst all the highest distributions. Nonetheless, once we have a look at the outcomes of the bootstrap approach, it shows that every one, except one distribution, are overfitted (Figure 8A, orange line). This just isn't entirely unexpected because we already noticed overshooting and undershooting. The QQ plot confirms that the fitted distributions deviate strongly from the empirical data (Figure 8B). Only the *Johnsonsb *distribution showed a (borderline) good fit.

## Detection of Global and Contextual Outliers.

We'll proceed using the *Johnsonsb *distribution and the `predict`

functionality for the detection of outliers. We already know that our data set incorporates outliers as we followed the anomaly approach, i.e., *the distribution is fitted on the inliers, and observations that now fall outside the boldness intervals will be marked as potential outliers.* With the `predict`

function and the `lineplot`

we will detect and plot the outliers. It will possibly be seen from Figure 9 that the worldwide outliers are detected but additionally some contextual outliers, despite we didn't model for it explicitly. ** Red bars** are the underrepresented outliers and

**are the overrepresented outliers. The**

*green bars*`alpha`

parameter will be set to tune the boldness intervals.`# Make prediction`

dfit.predict(df['price'].values, alpha=0.05, multtest=None)# Line plot with data points outside the boldness interval.

dfit.lineplot(df['price'], labels=df.index)

The point of view of your article has taught me a lot, and I already know how to improve the paper on gate.oi, thank you.

Thanks for sharing. I read many of your blog posts, cool, your blog is very good. https://www.binance.com/sk/register?ref=PORL8W0Z

aviator,aviator oyunu,uçak oyunu