Home Artificial Intelligence Comparing Outlier Detection Methods

Comparing Outlier Detection Methods

0
Comparing Outlier Detection Methods

Using batting stats from Major League Baseball’s 2023 season

Shohei Ohtani at bat, wearing Angels’ away gray uniform.
Shohei Ohtani, photo by Erik Drost on Flikr, CC BY 2.0

Outlier detection is an unsupervised machine learning task to discover anomalies (unusual observations) inside a given data set. This task is useful in lots of real-world cases where our available dataset is already “contaminated” by anomalies. Scikit-learn implements several outlier detection algorithms, and in cases where we’ve got an uncontaminated baseline, we may use these algorithms for novelty detection, a semi-supervised task that predicts whether latest observations are outliers.

The 4 outlier detection algorithms we’ll compare are:

  • Elliptic Envelope is suitable for normally-distributed data with low dimensionality. As its name implies, it uses the multivariate normal distribution to create a distance measure to separate outliers from inliers.
  • Local Outlier Factor is a comparison of the local density of an commentary with that of its neighbors. Observations with much lower density than their neighbors are considered outliers.
  • One-Class Support Vector Machine (SVM) with Stochastic Gradient Descent (SGD) is an O(n) approximate solution of the One-Class SVM. Note that the O(n²) One-Class SVM works well on our small example dataset but could also be impractical on your actual use case.
  • Isolation Forest is a tree-based approach where outliers are more quickly isolated by random splits than inliers.

Since our task is unsupervised, we don’t have ground truth to match accuracies of those algorithms. As an alternative, we wish to see how their results (player rankings specifically) differ from each other and gain some intuition into their behavior and limitations, in order that we would know when to prefer one over one other.

Let’s compare a couple of of those techniques using two metrics of batter performance from 2023’s Major Leage Baseball (MLB) season:

  • On-base percentage (OBP), the speed at which a batter reaches base (by hitting, walking, or getting hit by pitch) per plate appearance
  • Slugging (SLG), the common variety of total bases per at bat

There are many more sophisticated metrics of batter performance, including OBP plus SLG (OPS), weighted on-base average (wOBA), and adjusted weighted runs created (WRC+). Nonetheless, we’ll see that along with being commonly used and simple to grasp, OBP and SLG are moderately correlated and roughly normally distributed, making them well suited to this comparison.

We use the pybaseball package to acquire hitting data. This Python package is under MIT license and returns data from Fangraphs.com, Baseball-Reference.com, and other sources which have in turn obtained offical records from Major League Baseball.

We use pybaseball’s 2023 batting statistics, which might be obtained either by batting_stats (FanGraphs) or batting_stats_bref (Baseball Reference). It seems that the player names are more appropriately formatted from Fangraphs, but player teams and leagues from Baseball Reference are higher formatted within the case of traded players. For a dataset with improved readability, we really need to merge three tables: FanGraphs, Baseball Reference, and a key lookup.

from pybaseball import (cache, batting_stats_bref, batting_stats, 
playerid_reverse_lookup)
import pandas as pd

cache.enable() # avoid unnecessary requests when re-running

MIN_PLATE_APPEARANCES = 200

# For readability and reasonable default sort order
df_bref = batting_stats_bref(2023).query(f"PA >= {MIN_PLATE_APPEARANCES}"
).rename(columns={"Lev":"League",
"Tm":"Team"}
)
df_bref["League"] =
df_bref["League"].str.replace("Maj-","").replace("AL,NL","NL/AL"
).astype('category')

df_fg = batting_stats(2023, qual=MIN_PLATE_APPEARANCES)

key_mapping =
playerid_reverse_lookup(df_bref["mlbID"].to_list(), key_type='mlbam'
)[["key_mlbam","key_fangraphs"]
].rename(columns={"key_mlbam":"mlbID",
"key_fangraphs":"IDfg"}
)

df = df_fg.drop(columns="Team"
).merge(key_mapping, how="inner", on="IDfg"
).merge(df_bref[["mlbID","League","Team"]],
how="inner", on="mlbID"
).sort_values(["League","Team","Name"])

First, we note that these metrics differ in mean and variance and are moderately correlated. We also note that every metric is fairly symmetric, with median value near mean.

print(df[["OBP","SLG"]].describe().round(3))

print(f"nCorrelation: {df[['OBP','SLG']].corr()['SLG']['OBP']:.3f}")

           OBP      SLG
count 362.000 362.000
mean 0.320 0.415
std 0.034 0.068
min 0.234 0.227
25% 0.300 0.367
50% 0.318 0.414
75% 0.340 0.460
max 0.416 0.654

Correlation: 0.630

Let’s visualize this joint distribution, using:

  • Scatterplot of the players, coloured by National League (NL) vs American League (AL)
  • Bivariate kernel density estimator (KDE) plot of the players, which smoothes the scatterplot with a Gaussian kernel to estimate density
  • Marginal KDE plots of every metric
import matplotlib.pyplot as plt
import seaborn as sns

g = sns.JointGrid(data=df, x="OBP", y="SLG", height=5)
g = g.plot_joint(func=sns.scatterplot, data=df, hue="League",
palette={"AL":"blue","NL":"maroon","NL/AL":"green"},
alpha=0.6
)
g.fig.suptitle("On-base percentage vs. Sluggingn2023 season, min "
f"{MIN_PLATE_APPEARANCES} plate appearances"
)
g.figure.subplots_adjust(top=0.9)
sns.kdeplot(x=df["OBP"], color="orange", ax=g.ax_marg_x, alpha=0.5)
sns.kdeplot(y=df["SLG"], color="orange", ax=g.ax_marg_y, alpha=0.5)
sns.kdeplot(data=df, x="OBP", y="SLG",
ax=g.ax_joint, color="orange", alpha=0.5
)
df_extremes = df[ df["OBP"].isin([df["OBP"].min(),df["OBP"].max()])
| df["OPS"].isin([df["OPS"].min(),df["OPS"].max()])
]

for _,row in df_extremes.iterrows():
g.ax_joint.annotate(row["Name"], (row["OBP"], row["SLG"]),size=6,
xycoords='data', xytext=(-3, 0),
textcoords='offset points', ha="right",
alpha=0.7)
plt.show()

The highest-right corner of the scatterplot shows a cluster of excellence in hitting corresponding to the heavy upper tails of the SLG and OBP distributions. This small group excels at getting on base and hitting for extra bases. How much we consider them to be outliers (due to their distance from the vast majority of the player population) versus inliers (due to their proximity to 1 one other) will depend on the definition utilized by our chosen algorithm.

Scikit-learn’s outlier detection algorithms typically have fit() and predict() methods, but there are exceptions and in addition differences between algorithms of their arguments. We’ll consider each algorithm individually, but we’ll fit each to a matrix of attributes (n=2) per player (m=453). We’ll then rating not only each player but a grid of values spanning the range of every attribute, to assist us visualize the prediction function.

To visualise decision boundaries, we want to take the next steps:

  1. Create a 2D meshgrid of input feature values.
  2. Apply the decision_function to every point on the meshgrid, which requires unstacking the grid.
  3. Re-shape the predictions back right into a grid.
  4. Plot the predictions.

We’ll use a 200×200 grid to cover the present observations plus some padding, but you would adjust the grid to your required speed and backbone.

import numpy as np

X = df[["OBP","SLG"]].to_numpy()

GRID_RESOLUTION = 200

disp_x_range, disp_y_range = ( (.6*X[:,i].min(), 1.2*X[:,i].max())
for i in [0,1]
)
xx, yy = np.meshgrid(np.linspace(*disp_x_range, GRID_RESOLUTION),
np.linspace(*disp_y_range, GRID_RESOLUTION)
)
grid_shape = xx.shape
grid_unstacked = np.c_[xx.ravel(), yy.ravel()]

Elliptic Envelope

The form of the elliptic envelope is decided by the information’s covariance matrix, which supplies the variance of feature i on the foremost diagonal [i,i] and the covariance of features i and j within the [i,j] positions. Since the covariance matrix is sensitive to outliers, this algorithm uses the Minimum Covariance Determinant (MCD) Estimator, which is advisable for unimodal and symmetric distributions, with shuffling determined by the random_state input for reproducibility. This robust covariance matrix will come in useful again later.

Because we wish to match the outlier scores of their rating relatively than a binary outlier/inlier classification, we use the decision_function to attain players.

from sklearn.covariance import EllipticEnvelope

ell = EllipticEnvelope(random_state=17).fit(X)
df["outlier_score_ell"] = ell.decision_function(X)
Z_ell = ell.decision_function(grid_unstacked).reshape(grid_shape)

Local Outlier Factor

This approach to measuring isolation is predicated on k-nearest neighbors (KNN). We calculate the whole distance from each commentary to its nearest neighbors to define local density, after which we compare each commentary’s local density with that of its neighbors. Observations with local density much lower than their neighbors are considered outliers.

Selecting the variety of neighbors to incorporate: In KNN, a rule of thumb is to let K = sqrt(N), where N is your commentary count. From this rule, we obtain a K close to twenty (which happens to be the default K for LOF). You possibly can increase or decrease K to cut back overfitting or underfitting, respectively.

K = int(np.sqrt(X.shape[0]))

print(f"Using K={K} nearest neighbors.")

Using K=19 nearest neighbors.

Selecting a distance measure: Note that our features are correlated and have different variances, so Euclidean distance shouldn’t be very meaningful. We are going to use Mahalanobis distance, which accounts for feature scale and correlation.

In calculating the Mahalanobis distance, we’ll use the robust covariance matrix. If we had not already calculated it via Ellliptic Envelope, we could calculate it directly.

from scipy.spatial.distance import pdist, squareform

# If we did not have the elliptical envelope already,
# we could calculate robust covariance:
# from sklearn.covariance import MinCovDet
# robust_cov = MinCovDet().fit(X).covariance_
# But we will just re-use it from elliptical envelope:
robust_cov = ell.covariance_

print(f"Robust covariance matrix:n{np.round(robust_cov,5)}n")

inv_robust_cov = np.linalg.inv(robust_cov)

D_mahal = squareform(pdist(X, 'mahalanobis', VI=inv_robust_cov))

print(f"Mahalanobis distance matrix of size {D_mahal.shape}, "
f"e.g.:n{np.round(D_mahal[:5,:5],3)}...n...n")

Robust covariance matrix:
[[0.00077 0.00095]
[0.00095 0.00366]]

Mahalanobis distance matrix of size (362, 362), e.g.:
[[0. 2.86 1.278 0.964 0.331]
[2.86 0. 2.63 2.245 2.813]
[1.278 2.63 0. 0.561 0.956]
[0.964 2.245 0.561 0. 0.723]
[0.331 2.813 0.956 0.723 0. ]]...
...

Fitting the Local Outlier Factor: Note that using a custom distance matrix requires us to pass metric="precomputed" to the constructor after which the gap matrix itself to the fit method. (See documentation for more details.)

Also note that unlike other algorithms, with LOF we’re instructed not to make use of the score_samples method for scoring existing observations; this method should only be used for novelty detection.

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=K, metric="precomputed", novelty=True
).fit(D_mahal)

df["outlier_score_lof"] = lof.negative_outlier_factor_

Create the choice boundary: Because we used a custom distance metric, we must also compute that custom distance between each point within the grid to the unique observations. Before we used the spatial measure pdist for pairwise distances between each member of a single set, but now we use cdist to return the distances from each member of the primary set of inputs to every member of the second set.

from scipy.spatial.distance import cdist

D_mahal_grid = cdist(XA=grid_unstacked, XB=X,
metric='mahalanobis', VI=inv_robust_cov
)
Z_lof = lof.decision_function(D_mahal_grid).reshape(grid_shape)

Support Vector Machine (SGD-One-Class SVM)

SVMs use the kernel trick to rework features into the next dimensionality where a separating hyperplane might be identified. The radial basis function (RBF) kernel requires the inputs to be standardized, but because the documentation for StandardScaler notes, that scaler is sensitive to outliers, so we’ll use RobustScaler. We’ll pipe the scaled inputs into Nyström kernel approximation, as suggested by the documentation for SGDOneClassSVM.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import SGDOneClassSVM

suv = make_pipeline(
RobustScaler(),
Nystroem(random_state=17),
SGDOneClassSVM(random_state=17)
).fit(X)

df["outlier_score_svm"] = suv.decision_function(X)

Z_svm = suv.decision_function(grid_unstacked).reshape(grid_shape)

Isolation Forest

This tree-based approach to measuring isolation performs random recursive partitioning. If the common variety of splits required to isolate a given commentary is low, that commentary is taken into account a stronger candidate outlier. Like Random Forests and other tree-based models, Isolation Forest doesn’t assume that the features are normally distributed or require them to be scaled. By default, it builds 100 trees. Our example only uses two features, so we don’t enable feature sampling.

from sklearn.ensemble import IsolationForest

iso = IsolationForest(random_state=17).fit(X)

df["outlier_score_iso"] = iso.score_samples(X)

Z_iso = iso.decision_function(grid_unstacked).reshape(grid_shape)

Note that the predictions from these models have different distributions. We apply QuantileTransformer to make them more visually comparable on a given grid. From the documentation, please note:

Note that this transform is non-linear. It might distort linear correlations between variables measured at the identical scale but renders variables measured at different scales more directly comparable.

from adjustText import adjust_text
from sklearn.preprocessing import QuantileTransformer

N_QUANTILES = 8 # This many color breaks per chart
N_CALLOUTS=15 # Label this many top outliers per chart

fig, axs = plt.subplots(2, 2, figsize=(12, 12), sharex=True, sharey=True)

fig.suptitle("Comparison of Outlier Identification Algorithms",size=20)
fig.supxlabel("On-Base Percentage (OBP)")
fig.supylabel("Slugging (SLG)")

ax_ell = axs[0,0]
ax_lof = axs[0,1]
ax_svm = axs[1,0]
ax_iso = axs[1,1]

model_abbrs = ["ell","iso","lof","svm"]

qt = QuantileTransformer(n_quantiles=N_QUANTILES)

for ax, nm, abbr, zz in zip( [ax_ell,ax_iso,ax_lof,ax_svm],
["Elliptic Envelope","Isolation Forest",
"Local Outlier Factor","One-class SVM"],
model_abbrs,
[Z_ell,Z_iso,Z_lof,Z_svm]
):
ax.title.set_text(nm)
outlier_score_var_nm = f"outlier_score_{abbr}"

qt.fit(np.sort(zz.reshape(-1,1)))
zz_qtl = qt.transform(zz.reshape(-1,1)).reshape(zz.shape)

cs = ax.contourf(xx, yy, zz_qtl, cmap=plt.cm.OrRd.reversed(),
levels=np.linspace(0,1,N_QUANTILES)
)
ax.scatter(X[:, 0], X[:, 1], s=20, c="b", edgecolor="k", alpha=0.5)

df_callouts = df.sort_values(outlier_score_var_nm).head(N_CALLOUTS)
texts = [ ax.text(row["OBP"], row["SLG"], row["Name"], c="b",
size=9, alpha=1.0)
for _,row in df_callouts.iterrows()
]
adjust_text(texts,
df_callouts["OBP"].values, df_callouts["SLG"].values,
arrowprops=dict(arrowstyle='->', color="b", alpha=0.6),
ax=ax
)

plt.tight_layout(pad=2)
plt.show()

for var in ["OBP","SLG"]:
df[f"Pctl_{var}"] = 100*(df[var].rank()/df[var].size).round(3)

model_score_vars = [f"outlier_score_{nm}" for nm in model_abbrs]
model_rank_vars = [f"Rank_{nm.upper()}" for nm in model_abbrs]

df[model_rank_vars] = df[model_score_vars].rank(axis=0).astype(int)

# Averaging the ranks is unfair; we just need a countdown order
df["Rank_avg"] = df[model_rank_vars].mean(axis=1)

print("Counting right down to the best outlier...n")
print(
df.sort_values("Rank_avg",ascending=False
).tail(N_CALLOUTS)[["Name","AB","PA","H","2B","3B",
"HR","BB","HBP","SO","OBP",
"Pctl_OBP","SLG","Pctl_SLG"
] +
[f"Rank_{nm.upper()}" for nm in model_abbrs]
].to_string(index=False)
)

Counting right down to the best outlier...

Name AB PA H 2B 3B HR BB HBP SO OBP Pctl_OBP SLG Pctl_SLG Rank_ELL Rank_ISO Rank_LOF Rank_SVM
Austin Barnes 178 200 32 5 0 2 17 2 43 0.256 2.6 0.242 0.6 19 7 25 12
J.D. Martinez 432 479 117 27 2 33 34 2 149 0.321 52.8 0.572 98.1 15 18 5 15
Yandy Diaz 525 600 173 35 0 22 65 8 94 0.410 99.2 0.522 95.4 13 15 13 10
Jose Siri 338 364 75 13 2 25 20 2 130 0.267 5.5 0.494 88.4 8 14 15 13
Juan Soto 568 708 156 32 1 35 132 2 129 0.410 99.2 0.519 95.0 12 13 11 11
Mookie Betts 584 693 179 40 1 39 96 8 107 0.408 98.6 0.579 98.3 7 10 20 7
Rob Refsnyder 202 243 50 9 1 1 33 5 47 0.365 90.5 0.317 6.6 5 19 2 14
Yordan Alvarez 410 496 120 24 1 31 69 13 92 0.407 98.3 0.583 98.6 6 9 18 6
Freddie Freeman 637 730 211 59 2 29 72 16 121 0.410 99.2 0.567 97.8 9 11 9 8
Matt Olson 608 720 172 27 3 54 104 4 167 0.389 96.5 0.604 99.2 11 6 7 9
Austin Hedges 185 212 34 5 0 1 11 2 47 0.234 0.3 0.227 0.3 10 1 4 3
Aaron Judge 367 458 98 16 0 37 88 0 130 0.406 98.1 0.613 99.4 3 5 6 4
Ronald Acuna Jr. 643 735 217 35 4 41 80 9 84 0.416 100.0 0.596 98.9 2 3 10 2
Corey Seager 477 536 156 42 0 33 49 4 88 0.390 97.0 0.623 99.7 4 4 3 5
Shohei Ohtani 497 599 151 26 8 44 91 3 143 0.412 99.7 0.654 100.0 1 2 1 1

It looks just like the 4 implementations mostly agree on the right way to define outliers, but with some noticeable differences in scores and in addition in ease of use.

Elliptic Envelope has narrower contours across the ellipse’s minor axis, so it tends to focus on those interesting players who run contrary to the general correlation between features. For instance, Rays outfielder José Siri ranks as more of an outlier under this algorithm on account of his high SLG (88th percentile) versus low OBP (fifth percentile), which is consistent with an aggressive hitter who swings hard at borderline pitches and either crushes them or gets weak-to-no contact.

Elliptic Envelope can also be easy to make use of without configuration, and it provides the robust covariance matrix. If you might have low-dimensional data and an inexpensive expectation for it to be normally distributed (which is usually not the case), it is advisable to try this easy approach first.

One-class SVM has more uniformly spaced contours, so it tends to emphasise observations along the general direction of correlation greater than the Elliptic Envelope. All-Star first basemen Freddie Freeman (Dodgers) and Yandy Diaz (Rays) rank more strongly under this algorithm than under others, since their SLG and OBP are each excellent (99th and 97th percentile for Freeman, 99th and ninety fifth for Diaz).

The RBF kernel required an additional step for standardization, but it surely also looked as if it would work well on this easy example without fine-tuning.

Local Outlier Factor picked up on the “cluster of excellence” mentioned earlier with a small bimodal contour (barely visible within the chart). For the reason that Dodgers’ outfielder/second-baseman Mookie Betts is surrounded by other excellent hitters including Freeman, Yordan Alvarez, and Ronald Acuña Jr., he ranks as only the Twentieth-strongest outlier under LOF, versus tenth or stronger under the opposite algorithms. Conversely, Braves outfielder Marcell Ozuna had barely lower SLG and considerably lower OBP than Betts, but he’s more of an outlier under LOF because his neighborhood is less dense.

LOF was probably the most time-consuming to implement since we created robust distance matrices for fitting and scoring. We could have spent a while tuning K as well.

Isolation Forest tends to emphasise observations on the corners of the feature space, because splits are distributed across features. Backup catcher Austin Hedges, who played for the Pirates and Rangers in 2023 and signed with Guardians for 2024, is powerful defensively however the worst batter (with not less than 200 plate appearances) in each SLG and OBP. Hedges might be isolated in a single split on either OBP or OPS, making him the strongest outlier. Isolation Forest is the only algorithm that didn’t rank Shohei Ohtani because the strongest outlier: since Ohtani was edged out in OBP by Ronald Acuña Jr., each Ohtani and Acuña might be isolated in a single split on only one feature.

As with common supervised tree-based learners, Isolation Forest doesn’t extrapolate, making it higher suited to fitting to a contaminated dataset for outlier detection than for fitting to an anomaly-free dataset for novelty detection (where it wouldn’t rating latest outliers more strongly than the present observations).

Although Isolation Forest worked well out of the box, its failure to rank Shohei Ohtani because the biggest outlier in baseball (and doubtless all skilled sports) illustrates the first limitation of any outlier detector: the information you employ to suit it.

Not only did we omit defensive stats (sorry, Austin Hedges), we didn’t hassle to incorporate pitching stats. Because pitchers don’t even attempt to hit anymore… apart from Ohtani, whose season included the second-best batting average against (BAA) and Eleventh-best earned run average (ERA) in baseball (minimum 100 innings), a complete-game shutout, and a game by which he struck out ten batters and hit two home runs.

It has been suggested that Shohei Ohtani is a sophisticated extraterrestrial impersonating a human, but it surely seems more likely that there are two advanced extraterrestrials impersonating the identical human. Unfortunately, certainly one of them just had elbow surgery and won’t pitch in 2024… but the opposite just signed a record 10-year, $700 million contract. And because of outlier detection, now we will see why!

LEAVE A REPLY

Please enter your comment!
Please enter your name here