your anomaly detection results to your stakeholders, the immediate next query is all the time “?”.
In practice, simply flagging an anomaly isn’t enough. Understanding is crucial to determining one of the best next motion.
Yet, most machine learning-based anomaly detection methods stop at producing an anomaly rating. They’re black-box in nature, which makes it painful to make sense of their outputs-why does this sample have the next anomaly rating than its neighbors?
To tackle this explainability challenge, you’ll have likely already resorted to popular eXplainable AI (XAI) techniques. Perhaps you might be calculating feature importance to discover which variables are driving the abnormality, or you might be running counterfactual evaluation to see how close a case was to normal.
These are useful, but what when you could do more? What when you can derive a set of interpretable IF-THEN rules that characterize the identified anomalies?
This is strictly what the RuleFit algorithm [1] guarantees.
On this post, we’ll explore how the RuleFit algorithm works intuitively, how it may be applied to clarify detected anomalies, and walk through a concrete case study.
1. How Does It Work?
Before diving into the technical details, let’s first make clear what we aim to have after applying the algorithm: We wish to have a set of IF-THEN rules that quantitatively characterize the abnormal samples, in addition to the importance of those rules.
To get there, we want to reply two questions:
(1) How will we generate meaningful IF-THEN conditions from the information?
(2) How will we calculate the rule importance rating to find out which of them actually matter?
The RuleFit algorithm addresses these questions by splitting the work into two complementary parts, the “Rule” and the “Fit”.
1.1 The “Rule” in RuleFit
In RuleFit, a rule looks like this:
IF x1 < 10 AND x2 > 5 THEN 1 ELSE 0
Would this structure look a bit more familiar if we visualize it like this:
Yes, it’s a decision tree! The rule here is just traversing one specific path through the tree, from the basis node to the leaf node.
In RuleFit, the rule generation process heavily relies on constructing decision trees, which predict the goal end result given the input features. Once the tree is built, any path from the basis to a node in a tree might be converted to a choice rule, as we have now just seen in the instance above.
To make sure the principles are diverse, RuleFit doesn’t just fit one decision tree. As an alternative, it leverages tree ensemble algorithms (e.g., random forest, Gradient Boosting trees, etc.) to generate many alternative decision trees.
Also, the depths of those trees are, normally, different. This brings the advantages of generating rules with variable lengths, further enhancing the variety.
Here, we must always note that although the ensemble trees are built with predicting the goal end result in mind, the RuleFit algorithm does not likely care concerning the end prediction results. It merely uses this tree-building exercise because the vehicle to extract meaningful, quantitative rules.
Effectively, because of this we’ll discard the expected value in each node and only keep the conditions that lead us to a node. Those conditions produce the principles we care about.
Okay, we will now wrap up the primary processing step within the RuleFit algorithm: the rule constructing. The end result of this step is a pool of candidate rules that might potentially explain the particular data behavior.
But out of all those rules, which of them actually deserve our attention?
Well, that is where the second step of RuleFit is available in. We “fit” to rank.
1.2 The “Fit” in RuleFit
Essentially, RuleFit uncovers an important rules via feature selection.
First, RuleFit treats each rule as a brand new binary feature, that’s, if the rule is satisfied for a selected sample, it gets a worth of 1 for this binary feature; otherwise, its value is 0.
Then, RuleFit performs sparse linear regression with Lasso by utilizing all of the “raw” features from the unique dataset, in addition to the newly engineered binary features derived from the principles, to predict the goal end result. This manner, each feature (raw features + binary rule features) gets a coefficient.
One key characteristic of Lasso is that its loss function forces the coefficients of those unimportant features to be exactly zero. This effectively means those unimportant features are faraway from the model.
Because of this, by simply examining which binary rule features survived the Lasso evaluation, we might immediately know which rules are essential by way of getting accurate predictions of the goal end result. As well as, by taking a look at the coefficient magnitudes related to the rule features, we might have the opportunity to rank the importance of the principles.
1.3 Recap
We’ve got just covered the essential theory behind the RuleFit algorithm. To summarize, we will view this approach as a two-step solution for providing explainability:
(1) It first extracts the principles by training an ensemble of decision trees. That’s the “Rule” part.
(2) It then cleverly converts those rules into binary features and performs standard feature selection by utilizing sparse linear regression (Lasso). That’s the “Fit” part.
Finally, the surviving rules with non-zero coefficients are essential ones which might be value our attention.
At this point, you’ll have noticed that “predicting goal end result” pops up at each the “Rule” and “Fit” steps. If we’re coping with a regression or classification problem, it is definitely comprehensible that the “goal end result” is the numerical value or the label we would like to predict, and the principles might be interpreted as patterns that drive the prediction.
But what about anomaly detection, which is basically an unsupervised task? How can we apply RuleFit there?
2. Anomaly Explanation with RuleFit
2.1 Application Pattern
To start with, we want to remodel the unsupervised explainability problem right into a supervised one. Here’s how.
Once we have now our anomaly detection results (doesn’t matter which algorithm we used), we will create binary labels, i.e., 1 for an identified anomaly and 0 for a traditional data point, as our “goal end result.” This manner, we have now exactly what RuleFit needs: the raw features, and the goal end result to predict.
Then, the RuleFit can work its magic to generate a pool of candidate rules and fit a sparse linear regression model to retain only the essential rules. The coefficients of the resulting model would then indicate how much each rule contributes to the log-odds of an instance being classified as an anomaly. To place it one other way, they tell us which rule combos most strongly push a sample toward being labeled as anomalous.
Note that you may, in theory, also use the anomaly rating (produced by the first anomaly detection model) because the “goal end result”. This can change the applying of RuleFit from a classification setting to a regression setting.
Each approaches are valid, but they answer barely different questions: With the binary label classification setting, the RuleFit uncovers “What makes something an anomaly?“; With the anomaly rating regression setting, the RuleFit uncovers “What drives the severity of an anomaly?“.
In practice, the principles generated by each approaches will probably be very similar. Nevertheless, using a binary anomaly label because the goal for a RuleFit is more commonly used for explaining detected anomalies. It is simple by way of interpretation and direct applicability to creating business rules for flagging future anomalies.
2.2 Case Study
Let’s walk through a concrete example to see how RuleFit works in motion. Here, we’ll create an anomaly detection scenario using the Iris dataset [2] (licensed CC BY 4.0), where each sample consists of 4 features (sepal_length, sepal_width, petal_length, petal_width) and is labeled as one in all the next three categories: Setosa, Versicolor, and Virginica.
Step 1: Data Setup
First, we’ll use all Setosa samples (50) and all Versicolor samples (50) because the “normal” samples. For the “abnormal” samples, we’ll use a subset of Virginica samples (10).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
np.random.seed(42)
# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y_true = iris.goal
# Get normal samples (Setosa + Versicolor)
normal_mask = (y_true == 0) | (y_true == 1)
X_normal_all = X[normal_mask].copy()
# Get Virginica samples
virginica_mask = (y_true == 2)
X_virginica = X[virginica_mask].copy()
# Randomly select 10
anomaly_indices = np.random.selection(len(X_virginica), size=10, replace=False)
X_anomalies = X_virginica.iloc[anomaly_indices].copy()
To make the scenario more realistic, we create a separate training set and test set. The train set incorporates pure “normal” samples, while the test set consists of randomly sampled 20 “normal” samples and 10 “abnormal” samples.
train_indices = np.random.selection(len(X_normal_all), size=80, replace=False)
test_indices = np.setdiff1d(np.arange(len(X_normal_all)), train_indices)
X_train = X_normal_all.iloc[train_indices].copy()
X_normal_test = X_normal_all.iloc[test_indices].copy()
# Create test set (20 normal + 10 anomalous)
X_test = pd.concat([X_normal_test, X_anomalies], ignore_index=True)
y_test_true = np.concatenate([
np.zeros(len(X_normal_test)),
np.ones(len(X_anomalies))
])
Step 2: Anomaly Detection
Next, we perform anomaly detection. Here, we pretend we don’t know the actual labels. On this case study, we apply (LOF) because the anomaly detection algorithm, which locates anomalies by measuring how isolated an information point is in comparison with the density of its local neighbors. In fact, you may as well try other anomaly detection algorithms, akin to (GMM), (KNN), and , amongst others. Nevertheless, take into accout that the intention here is barely to get the detection results, our important focus is the anomaly explanation in step 3.
Specifically, we’ll use the pyOD library to coach the model and make inferences:
# Install the pyOD library
#!pip install pyod
from pyod.models.lof import LOF
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Local Outlier Factor
lof = LOF(n_neighbors=3)
lof.fit(X_train_scaled)
train_scores = lof.decision_function(X_train_scaled)
test_scores = lof.decision_function(X_test_scaled)
threshold = np.percentile(train_scores_lof, 99)
y_pred = (test_scores > threshold).astype(int)
Notice that we have now used the 99% quantile of the anomaly scores obtained on the training set as the brink. For individual test samples, if its anomaly rating is higher than the brink, this sample will probably be labeled as “anomaly”. Otherwise, the sample is taken into account “normal”.
At this stage, we will quickly check the detection performance with:
classification_report(y_test_true, y_pred, target_names=['Normal', 'Anomaly'])

Not super great results. Out of 10 true anomalies, only 5 of them are caught. Nevertheless, the excellent news is that LOF didn’t produce any false positives. You may further improve the performance by tuning the LOF model hyperparameters, adjusting the brink, and even considering ensemble learning strategies. But take into accout: our goal here will not be to get one of the best detection accuracy. As an alternative, we aim to see if RuleFit can properly generate rules to clarify the anomalies detected by the LOF model.
Step 3: Anomaly Explanation
Now we’re attending to the core topic. To use RuleFit, let’s first install the library from imodels, which is a sklearn-compatible, Interpretable ML package for concise, transparent, and accurate predictive modeling:
pip install imodels
On this case, we’ll consider a binary label classification setting, where the abnormal samples (within the test set) flagged by the LOF model are labeled as 1, and other un-flagged normal samples (also within the test set) are labeled as 0. Note that we’re labeling based on LOF’s detection results, not the actual ground truth, which we pretend we don’t know.
To initiate the RuleFit model:
from imodels import RuleFitClassifier
rf = RuleFitClassifier(
max_rules = 30,
lin_standardise=True,
include_linear=True,
random_state = 42
)
We are able to then proceed with fitting the RuleFit model:
rf.fit(
X_test,
y_pred,
feature_names=X_test.columns
)
In practice, it is normally practice to do a fast sanity check to judge how well the RuleFit model’s predictions align with the anomaly labels determined by the LOF algorithm:
from sklearn.metrics import accuracy_score, roc_auc_score
y_label = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]
print("accuracy:", accuracy_score(y_pred, y_label))
print("roc-auc:", roc_auc_score (y_pred, y_prob))
For our case, we see that each printouts are 1. This confirms that the RuleFit model has successfully learned the patterns that LOF used to discover anomalies. For your individual problems, when you observe values much lower than 1, you would want to fine-tune your RuleFit hyperparameters.
Now let’s examine the principles:
rules = rf._get_rules()
rules = rules[rules.coef != 0]
rules = rules[~rules.type.str.contains('linear')]
rules['abs_coef'] = rules['coef'].abs()
rules = rules.sort_values('importance', ascending=False)
The RuleFit algorithm returns a complete of 24 rules. A snapshot is shown below:

Let’s first make clear the meaning of the outcomes columns:
- The “rule” column and the “abs_coef” column are self-explanatory.
- The “type” column has two unique values: “linear” and “rule”. The “linear” denotes the unique input features, while “rule” denotes the “IF-THEN” conditions generated from decision trees.
- The “coef” column represents the coefficients produced by the Lasso regression evaluation. A positive value indicates that if the rule applies, the log-odds of being classified because the abnormal class increases. A bigger magnitude indicates a stronger influence of that rule on the prediction.
- The “support” column records the fraction of knowledge samples where the rule applies.
- The “importance” column is calculated as absolutely the value of the coefficient multiplied by the usual deviation of the binary (0 or 1) values that the rule takes on. So why this calculation? As we have now just discussed, a bigger absolute coefficient means a stronger direct impact on the log-odds. That’s clear. For the usual deviation term, it effectively measures the “discriminative power” of the principles. For instance, if a rule is nearly all the time TRUE (very small standard deviation), it doesn’t split your data effectively. The identical holds if the rule is nearly all the time FALSE. In other words, the rule cannot much of the variation within the goal variable. Subsequently, the importance rating combines each the strength of the rule’s impact (coefficient magnitude) and the way well it discriminates between different samples (standard deviation).
For our specific case, we see just one high-impact rule (Rule #24):
(Note that exp(4.448999) ~= 85)
Rules #26 and #27 are inside Rule #24. That is common in practice, as RuleFit often produces “families” of comparable rules because they arrive from neighbouring tree splits. Subsequently, the one rule that really matters for characterizing the LOF-identified anomalies is Rule #24.
Also, we see that the support for Rule #24 is 0.1667 (5/30). This effectively implies that all 5 LOF-identified anomalies might be explained by this rule. We are able to see that more clearly within the figure below:

There you’ve gotten it: the rule to explain the identified anomalies!
3. Conclusion
On this blog post, we explored the RuleFit algorithm as a strong solution for explainable anomaly detection. We discussed:
- How it really works: A two-step approach where decision trees are first fitted to derive meaningful rules, followed by a sparse linear regression to rank the rule importance.
- apply to anomaly explanation: Use the detection results because the pseudo labels and use them because the “goal end result” for the RuleFit model.
With RuleFit in your modeling toolkit, the subsequent time stakeholders ask “Why is that this anomaly?”, you’ll have concrete IF-THEN rules that they will understand and act upon.
Reference
[1] Jerome H. Friedman, Bogdan E. Popescu, Predictive learning via rule ensembles, arXiv, 2008.
[2] Fisher, R. A., Iris [Data set]. UCI Machine Learning Repository, 1936.