Don’t Waste Your Labeled Anomalies: 3 Practical Strategies to Boost Anomaly Detection Performance

algorithms assume you’re working with completely unlabeled data.

But in the event you’ve actually worked on these problems, the fact is commonly different. In practice, anomaly detection tasks often include at the very least a number of labeled examples, possibly from past investigations, or your subject material expert flagged a few anomalies to make it easier to define the issue more clearly.

In these situations, if we ignore these priceless labeled examples and stick to those purely unsupervised methods, we’re leaving money on the table.

So the query is, how can we actually make use of those few labeled anomalies?

In the event you search the tutorial literature, you’ll find it is stuffed with clever solutions, especially with all the brand new deep learning methods coming out. But let’s be real, most of those solutions require adopting entirely recent frameworks with steep learning curves. They typically involve a painful amount of unintuitive hyperparameter tuning, and still may not perform well in your specific dataset.

On this post, I need to share three practical strategies you can start using instantly to spice up your anomaly detection performance. No fancy frameworks required. I’ll also walk through a concrete example on fraud detection data so you’ll be able to see how one among these approaches plays out in practice.

By the tip, you’ll have several actionable methods for making higher use of your limited labeled data, plus a real-world implementation you’ll be able to adapt to your personal use cases.

1. Threshold Tuning

Let’s start with the lowest-hanging fruit.

Most unsupervised models output a continuous anomaly rating. It’s entirely as much as you to make your mind up where to attract the road to differentiate the “normal” and “abnormal” classes.

That is a vital step for a practical anomaly detection solution, as choosing the improper threshold may end up in either missing critical anomalies or overwhelming operators with false alarms. Luckily, those few labeled abnormal examples can provide some guidance in properly setting this threshold.

The important thing insight is you can use those labeled anomalies as a validation set to quantify detection performance under different threshold selections.

Here’s how this works in practice:

Step (1): Proceed along with your usual model training & thresholding on the dataset excluding those labeled anomalies. If you may have curated a pure normal dataset, it is advisable to set the edge as the utmost anomaly rating observed in the traditional data. In the event you are working with unlabeled data, you’ll be able to set the edge by selecting a percentile (e.g., ninety fifth or 99th percentile) that corresponds to your tolerated false positive rate.

Step (2): Along with your labeled anomalies put aside, you’ll be able to calculate concrete detection metrics under your chosen threshold. These include recall (what percentage of known anomalies could be caught), precision, and recall@k (useful when you’ll be able to only investigate the highest k alerts). These metrics provide you with a quantitative measure of whether your current threshold yields acceptable detection performance.

💡Pro Tip: If the variety of your labeled anomalies is small, the estimated metrics (e.g., recall) would have high variances. A more robust way here could be to report its uncertainty via bootstrapping. Essentially, you’re creating many “pseudo-datasets” by randomly sampling known anomalies with alternative, re-compute the metrics for each replicate, and derive the boldness interval from the distribution (e.g., grab the two.5-th and 97.5-th percentiles, which supplies you 95% confidence interval). Those uncertainty estimates would provide you with the hint of how trustworthy those computed metrics are.

Step (3): In the event you are usually not satisfied with the present detection performance, you’ll be able to now actively tune the edge based on these metrics. In case your recall is just too low (meaning that you simply’re missing too many known anomalies), you’ll be able to lower the edge. In the event you’re catching most anomalies however the false positive rate is higher than acceptable, you’ll be able to raise the edge and measure the trade-off. The underside line is you can now find the optimal balance between false positives and false negatives on your specific use case, based on real performance data.

✨ Takeaway

The strength of this approach lies in its simplicity. You’re not changing your anomaly detection algorithm in any respect – you’re just using your labeled examples to intelligently tune a threshold you’ll have needed to set anyway. With a handful of labeled anomalies, you’ll be able to turn threshold selection from guesswork into an optimization problem with measurable outcomes.

2. Model Selection

Besides tuning the edge, the labeled anomalies may guide the choice of higher model selections and configurations.

Model selection is a standard pain point every practitioner faces: with so many anomaly detection algorithms on the market, each with their very own hyperparameters, how do which combination will actually work well on your specific problem?

To effectively answer this query, we want a concrete strategy to measure how well different models and configurations perform on the dataset we’re investigating.

This is precisely where those labeled anomalies develop into invaluable. Here’s the workflow:

Step (1): Train your candidate model (with a selected set of configurations) on the dataset, excluding those labeled anomalies, identical to what we did with the edge tuning.

Step (2): Rating all the dataset and calculate the typical anomaly rating percentile of your known anomalies. Specifically, for every of the labeled anomalies, you calculate what percentile it falls into of the distribution of the scores (e.g., if the rating of a known anomaly is higher than 95% of all data points, it’s on the ninety fifth percentile). Then, you average these percentiles across all of your labeled anomalies. This fashion, you obtain a single metric that captures how well the model pushes known anomalies toward the highest of the rating. The upper this metric is, the higher the model performs.

Step (3): You possibly can apply this approach to discover essentially the most promising hyperparameter configurations for a selected model type you may have in mind (e.g., Local Outlier Factor, Gaussian Mixture Models, Autoencoder, etc.), or to pick out the model type that best aligns along with your anomaly patterns.

💡Pro Tip: Ensemble learning is increasingly common in production anomaly detection systems. This paradigm means as a substitute of counting on one single detection model, multiple detectors, possibly with different model types and different model configurations, run concurrently to catch various kinds of anomalies. On this case, those labeled abnormal samples can make it easier to gauge which candidate model instance actually deserve a spot in your final ensemble.

✨ Takeaway

In comparison with the previous threshold tuning strategy, this current model selection strategy moves from “tuning what you may have” to “selecting what to make use of.”

Concretely, by utilizing the typical percentile rating of your known anomalies as a performance metric, you’ll be able to objectively compare different algorithms and configurations by way of how well they discover the varieties of anomalies you really encounter. Because of this, your model selection isn’t any longer a trial-and-error process, but a data-driven decision-making process.

3. Supervised Ensembling

To date, we’ve been discussing strategies where the labeled anomalies are primarily used as a validation tool, either for tuning the edge or choosing promising models. We will, after all, put them to work more directly within the detection process itself.

That is where the thought of supervised ensembling is available in.

To raised understand this approach, let’s first discuss the intuition behind this strategy.

We all know that different anomaly detection methods often disagree about what looks suspicious. One algorithm might flag “anomaly” at a knowledge point while one other might say it’s totally normal. But here’s the thing: these disagreements are quite informative, as they tell us loads about that data point’s anomaly signature.

Let’s consider the next scenario: Suppose we have now two data points, A and B. For data point A, it triggers alarms in a density-based method (e.g., Gaussian Mixture Models) but passes through an isolation-based one (e.g., Isolation Forest). For data point B, nevertheless, each detectors set off the alarm. Then, we’d generally imagine those two points carry completely different signatures, right?

Now the query is tips on how to capture these signatures in a scientific way.

Luckily, we are able to resort to supervised learning. Here is how:

Step (1): Start by training multiple base anomaly detectors in your unlabeled data (excluding your precious labeled examples, after all).

Step (2): For every data point, collect the anomaly scores from all these detectors. This becomes your feature vector, which is actually the “anomaly signatures” we aim to mine from. To offer a concrete example, let’s say you used three base detectors (e.g., Isolation Forest, GMM, and PCA), then the feature vector for a single data point i would appear like this:

X_i=[iForest_score, GMM_score, PCA_score]

The label for every data point is easy: 1 for the known anomalies and 0 for the remaining of the samples.

Step (3): Train an ordinary supervised classifier using these newly composed feature vectors as inputs and the labels because the goal outputs. Although any off-the-shelf classification algorithm could in principle work, a standard suggestion is to make use of gradient-boosted tree models, equivalent to XGBoost, as they’re adept at learning complex, non-linear patterns within the features, and so they are robust against the “noisy” labels (have in mind that probably not all of the unlabeled samples are normal).

Once trained, this supervised “meta-model” is your final anomaly detector. At inference time, you run recent data through all base detectors and feed their outputs to your trained meta-model for the ultimate decision, i.e., normal or abnormal.

✨ Takeaway

With the supervised ensembling strategy, we’re shifting the paradigm from using the labeled anomalies as passive validation tools to creating them energetic participants within the detection process. The meta-classifier model we built learns how different detectors reply to anomalies. This not only improves detection accuracy, but more importantly, gives us a principled strategy to mix the strengths of multiple algorithms, making the anomaly detection system more robust and reliable.

In the event you’re pondering of implementing this strategy, the excellent news is that the PyOD library already provides this functionality. Let’s take a have a look at it next.

4. Case Study: Fraud Detection

On this section, let’s undergo a concrete case study to see the supervised ensemble strategy in motion. Here, we consider a technique called XGBOD (Extreme Gradient Boosting Outlier Detection), which is implemented within the PyOD library.

For the case study, we consider a bank card fraud detection dataset (Database Contents License) from Kaggle. This dataset incorporates transactions made by bank cards in September 2013 by European cardholders. In total, there are 284,807 transactions, 492 of that are frauds. Note that resulting from confidentiality issues, the features presented within the dataset are usually not original, but are the results of a PCA transformation. Feature ‘Class’ is the response variable. It takes the worth 1 in case of fraud and 0 otherwise.

On this case study, we consider three learning paradigms, i.e., unsupervised learning, XGBOD, and fully supervised learning, for performing anomaly detection. We’ll vary the “supervision ratio” (percentage of anomalies which can be available during training) for each XGBOD and the supervised learning approach to see the effect of leveraging labeled anomalies on the detection performance.

4.1 Import Libraries

For unsupervised anomaly detection, we consider 4 algorithms: Principal Component Evaluation (PCA), Isolation Forest, Cluster-based Local Outlier Factor (CBLOF), and Histogram-based Outlier Detection (HBOS), which is an efficient detection method that assumes feature independence and calculates the degree of outlyingness by constructing histograms. All algorithms are implemented within the PyOD library.

For the supervised learning approach, we use an XGBoost classifier.

import pandas as pd
import numpy as np

# PyOD imports
# !pip install pyod
from pyod.models.xgbod import XGBOD
from pyod.models.pca import PCA
from pyod.models.iforest import IForest
from pyod.models.cblof import CBLOF
from pyod.models.hbos import HBOS

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (precision_recall_curve, average_precision_score,
                             roc_auc_score)
# !pip install xgboost
from xgboost import XGBClassifier

4.2 Data Preparation

Remember to download the dataset from Kaggle and store it locally under the name “creditcard.csv”.

# Load data
df = pd.read_csv('creditcard.csv')      
X, y = df.drop(columns='Class').values, df['Class'].values

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Dataset shape: {X.shape}")
print(f"Fraud rate (%): {y.mean()*100:.4f}")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

Here, we create a helper function to generate labeled data for XGBOD/XGBoost learning.

def create_supervised_labels(y_train, supervision_ratio=0.01):
    """
    Create supervised labels based on supervision ratio.
    """
    
    fraud_indices = np.where(y_train == 1)[0]
    n_labeled_fraud = int(len(fraud_indices) * supervision_ratio)
    
    # Randomly select labeled samples
    labeled_fraud_idx = np.random.alternative(fraud_indices, 
                                         n_labeled_fraud, 
                                         replace=False)
    
    # Create labels
    y_labels = np.zeros_like(y_train)
    y_labels[labeled_fraud_idx] = 1

    # Calculate what number of true frauds are within the "unlabeled" set
    unlabeled_fraud_count = len(fraud_indices) - n_labeled_fraud

    return y_labels, labeled_fraud_idx, unlabeled_fraud_count

Note that this function mimics the realistic scenario where we have now a number of known anomalies (labeled as 1), while all other unlabeled samples are treated as normal (labeled as 0). This implies our labels are effectively , since some true fraud cases are hidden among the many unlabeled data but still receive a label of 0.

Before we start our evaluation, let’s define a helper function for evaluating model performance:

def evaluate_model(model, X_test, y_test, model_name):
    """
    Evaluate a single model and return metrics.
    """
    # Get anomaly scores
    scores = model.decision_function(X_test)
    
    # Calculate metrics
    auc_pr = average_precision_score(y_test, scores)
    
    return {
        'model': model_name,
        'auc_pr': auc_pr,
        'scores': scores
    }

In PyOD framework, every trained model instance exposes a decision_function() method. By calling it on the inference samples, we are able to obtain the corresponding anomaly scores.

For comparing performance, we use AUCPR, i.e., the realm under the precision-recall curve. As we’re coping with a highly imbalanced dataset, AUCPR is mostly preferred over AUC-ROC. Moreover, using AUCPR eliminates the necessity for an explicit threshold to measure model performance. This metric already incorporates model performance under various threshold conditions.

4.3 Unsupervised Anomaly Detection

models = {
    'IsolationForest': IForest(random_state=42),
    'CBLOF': CBLOF(),
    'HBOS': HBOS(),
    'PCA': PCA(),
}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train)
    result = evaluate_model(model, X_test, y_test, name)
    print(f"{name:20} - AUC-PR: {result['auc_pr']:.4f}")

The outcomes we obtained are as follows:

IsolationForest: – AUC-PR: 0.1497

CBLOF: – AUC-PR: 0.1527

HBOS: – AUC-PR: 0.2488

PCA: – AUC-PR: 0.1411

With zero hyperparameter tuning, not one of the algorithms delivered very promising results, as their AUCPR values (~0.15–0.25) may fall in need of the very high precision/recall often required in fraud-detection settings.

Nonetheless, we should always note that, unlike AUC-ROC, which has a baseline value of 0.5, the baseline AUCPR is determined by the prevalence of the positive class. For our current dataset, since only 0.17% of the samples are fraud, a naive classifier that guesses randomly would have an AUCPR ≈ 0.0017. In that sense, all detectors already outperform random guessing by a large margin.

4.4 XGBOD Approach

Now we move to the XGBOD approach, where we are going to leverage a number of labeled anomalies to tell our anomaly detection.

supervision_ratios = [0.01, 0.02, 0.05, 0.1, 0.15, 0.2]

for ratio in supervision_ratios:

    # Create supervised labels
    y_labels, labeled_fraud_idx, unlabeled_fraud_count = create_supervised_labels(y_train, ratio)
    
    total_fraud = sum(y_train)
    labeled_fraud = sum(y_labels)
    
    print(f"Known frauds (labeled as 1): {labeled_fraud}")
    print(f"Hidden frauds in 'normal' data: {unlabeled_fraud_count}")
    print(f"Total samples treated as normal: {len(y_train) - labeled_fraud}")
    print(f"Fraud contamination in 'normal' set: {unlabeled_fraud_count/(len(y_train) - labeled_fraud)*100:.3f}%")
    
    # Train XGBOD models
    xgbod = XGBOD(estimator_list=[PCA(), CBLOF(), IForest(), HBOS()],
                  random_state=42, 
                  n_estimators=200, learning_rate=0.1, 
                  eval_metric='aucpr')
    
    xgbod.fit(X_train, y_labels)
    result = evaluate_model(xgbod, X_test, y_test, f"XGBOD_ratio_{ratio:.3f}")
    print(f"xgbod - AUC-PR: {result['auc_pr']:.4f}")

The obtained results are shown within the figure below, along with the performance of the very best unsupervised detector (HBOS) because the reference.

Figure 1. XGBOD vs Supervision ratio (Image by creator)

We will see that with only one% labeled anomalies, the XGBOD method already beats the very best unsupervised detector, achieving an AUCPR rating of 0.4. With more labeled anomalies becoming available for training, XGBOD’s performance continues to enhance.

4.5 Supervised Learning

Finally, we consider the scenario where we directly train a binary classifier on the dataset with the labeled anomalies.

for ratio in supervision_ratios:
    
    # Create supervised labels
    y_label, labeled_fraud_idx, unlabeled_fraud_count = create_supervised_labels(y_train, ratio)

    clf = XGBClassifier(n_estimators=200, random_state=42, 
                        learning_rate=0.1, eval_metric='aucpr')
    clf.fit(X_train, y_label)
    
    y_pred_proba = clf.predict_proba(X_test)[:, 1]
    auc_pr = average_precision_score(y_test, y_pred_proba)
    print(f"XGBoost - AUC-PR: {auc_pr:.4f}")

The outcomes are shown within the figure below, along with the XGBOD’s performance obtained from the previous section:

Figure 2. Performance comparison between the considered methods. (Image by creator)

Generally, we see that with only limited labeled data, the usual supervised classifier (XGBoost on this case) struggles to differentiate between normal and anomalous samples effectively. This is especially evident when the supervision ratio is incredibly low (i.e., 1%). While XGBoost’s performance improves as more labeled examples develop into available, we see that it stays consistently inferior to the XGBOD approach across the examined range of supervision ratios.

5. Conclusion

On this post, we discussed three practical strategies to leverage the few labeled anomalies to spice up the performance of your anomaly detector:

Threshold tuning: Use labeled anomalies to show threshold setting from guesswork right into a data-driven optimization problem.
Model selection: Objectively compare different algorithms and hyperparameter settings to search out what truly works well on your specific problems.
Supervised ensembling: Train a meta-model to systematically extract the anomaly signatures revealed by multiple unsupervised detectors.

Moreover, we went through a concrete case study on fraud detection and showed how the supervised ensembling method (XGBOD) dramatically outperformed each purely unsupervised models and standard supervised classifiers, especially when labeled data was scarce.

The important thing takeaway: a number of labels go a great distance in anomaly detection. Time to place those labels to work.

Don’t Waste Your Labeled Anomalies: 3 Practical Strategies to Boost Anomaly Detection Performance

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Keep MCPs Useful in Agentic Pipelines

Hugging Face on AMD Instinct MI300 GPU

Hugging Face and Microsoft Deepen Collaboration

Deploy models on AWS Inferentia2 from Hugging Face

An 11B parameter pretrained language model and VLM, trained on over 5000B tokens and 11 languages

Don’t Waste Your Labeled Anomalies: 3 Practical Strategies to Boost Anomaly Detection Performance

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.