Explainable AI in Production: A Neuro-Symbolic Model for Real-Time Fraud Detection

-

  • SHAP KernelExplainer takes ~30 ms per prediction (even with a small background)
  • A neuro-symbolic model generates explanations contained in the forward pass in 0.9 ms
  • That’s a 33× speedup with deterministic outputs
  • Fraud recall is equivalent (0.8469), with only a small AUC drop
  • No separate explainer, no randomness, no additional latency cost
  • All code runs on the Kaggle Credit Card Fraud Detection dataset [1]

Full code: https://github.com/Emmimal/neuro-symbolic-xai-fraud/

The Moment the Problem Became Real

I used to be debugging a fraud detection system late one evening and wanted to know why the model had flagged a particular transaction. I called KernelExplainer, passed in my background dataset, and waited. Three seconds later I had a bar chart of feature attributions. I ran it again to double-check a worth and got barely different numbers.

That’s after I realised there was a structural limitation in how explanations were being generated. The model was deterministic. The reason was not. I used to be explaining a consistent decision with an inconsistent method, and neither the latency nor the randomness was acceptable if this ever needed to run in real time.

This text is about what I built as a substitute, what it cost in performance, and what it got right, including one result that surprised me.

Key Insight: Explainability shouldn’t be a post-processing step. It needs to be a part of the model architecture.

Limitations of SHAP in Real-Time Settings

To be precise about what SHAP actually does: Lundberg and Lee’s SHAP framework [2] computes Shapley values (an idea from cooperative game theory [3]) that attribute a model’s output to its input features. KernelExplainer, the model-agnostic variant, approximates these values using a weighted linear regression over a sampled coalition of features. The background dataset acts as a baseline, and nsamples controls what number of coalitions are evaluated per prediction.

This approximation is amazingly useful for model debugging, feature selection, and post-hoc evaluation.
The limitation examined here is narrower but critical: when explanations have to be generated at inference time, attached to individual predictions, under real-time latency constraints.

Once you attach SHAP to a real-time fraud pipeline, you might be running an approximation algorithm that:

  • Will depend on a background dataset you might have to take care of and pass at inference time
  • Produces results that shift depending on nsamples and the random state
  • Takes 30 ms per sample at a reduced configuration

The chart below shows what that post-hoc output looks like — a worldwide feature rating computed after the prediction was already made.

SHAP mean absolute feature importance across 100 test samples, computed using KernelExplainer. V14 ranks highest, consistent with published EDA on this dataset. This is beneficial for global model understanding — nevertheless it is computed after the prediction, can’t be attached to a single real-time decision, and can produce barely different values on the subsequent run on account of Monte Carlo sampling. Image by writer.

Within the benchmark I ran on the Kaggle creditcard dataset [1], SHAP itself printed a warning:

Using 200 background data samples could cause slower run times.
Think about using shap.sample(data, K) or shap.kmeans(data, K)
to summarize the background as K samples.

This highlights the trade-off between background size and computational cost in SHAP. 30 ms at 200 background samples is the lower sure. Larger backgrounds, which improve attribution stability, push the fee higher.

The neuro-symbolic model I built takes 0.898 ms for the prediction and explanation together. There isn’t any floor to fret about because there isn’t a separate explainer.

The Dataset

All experiments use the Kaggle Credit Card Fraud Detection dataset [1], covering 284,807 real bank card transactions from European cardholders in September 2013, of which 492 are confirmed fraud.

Shape         : (284807, 31)
Fraud rate    : 0.1727%
Fraud samples : 492
Legit samples : 284,315

The features V1 through V28 are PCA-transformed principal components. The unique features are anonymised and never disclosed within the dataset. Amount is the transaction value. Time was dropped.

Amount was scaled with StandardScaler. I applied SMOTE [4] exclusively to the training set to handle the category imbalance. The test set was held on the real-world 0.17% fraud distribution throughout.

Train size after SMOTE : 454,902
Fraud rate after SMOTE : 50.00%
Test set                : 56,962 samples  |  98 confirmed fraud

The test set structure is essential: 98 fraud cases out of 56,962 samples is the actual operating condition of this problem. Any model that scores well here is doing so on a genuinely hard task.

Two Models, One Comparison

The Baseline: Standard Neural Network

The baseline is a four-layer MLP with batch normalisation [5] and dropout [6], a regular architecture for tabular fraud detection.

class FraudNN(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 128), nn.BatchNorm1d(128),
            nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(128, 64), nn.BatchNorm1d(64),
            nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(64, 32), nn.ReLU(),
            nn.Linear(32, 1), nn.Sigmoid(),
        )

It makes a prediction and nothing else. Explaining that prediction requires a separate SHAP call.

The Neuro-Symbolic Model: Explanation as Architecture

The neuro-symbolic model has three components working together: a neural backbone, a symbolic rule layer, and a fusion layer that mixes each signals.

The neural backbone learns latent representations from all 29 features. The symbolic rule layer runs six differentiable rules in parallel, every one computing a soft activation between zero and one using a sigmoid function. The fusion layer takes each outputs and produces the ultimate probability.

class NeuroSymbolicFraudDetector(nn.Module):
    """
    Input
      |--- Neural Backbone  (latent fraud representations)
      |--- Symbolic Rule Layer  (6 differentiable rules)
                  |
              Fusion Layer  -->  P(fraud)  +  rule_activations
    """
    def __init__(self, input_dim, feature_names):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(input_dim, 64), nn.BatchNorm1d(64),
            nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(64, 32), nn.BatchNorm1d(32), nn.ReLU(),
        )
        self.symbolic = SymbolicRuleLayer(feature_names)
        self.fusion   = nn.Sequential(
            nn.Linear(32 + 1, 16), nn.ReLU(),  # 32 from backbone + 1 from symbolic layer (weighted rule activation summary)
            nn.Linear(16, 1), nn.Sigmoid(),
        )
Architecture diagram showing input of 29 features splitting into two parallel paths. Left path is the Neural Backbone reducing from 64 to 32 dimensions. Right path is the Symbolic Rule Layer evaluating 6 rules with learnable thresholds. Both paths merge at the Fusion Layer which takes 32 plus 1 inputs and outputs through 16 neurons to a single probability. The output splits into P fraud and rule activations, with an annotation marking rule activations as the explanation produced here not after.
The neuro-symbolic model runs two paths in parallel on every forward pass. The neural backbone produces latent fraud representations. The symbolic rule layer evaluates six differentiable rules against learnable thresholds. The fusion layer combines each signals right into a single fraud probability. The rule activations — the reason — are a natural output of this computation, not a separate step. Image by writer.

The six symbolic rules are anchored to the creditcard features with the strongest published fraud signal [7, 8]: V14, V17, V12, V10, V4, and Amount.

RULE_NAMES = [
    "HIGH_AMOUNT",    # Amount exceeds threshold
    "LOW_V17",        # V17 below threshold
    "LOW_V14",        # V14 below threshold (strongest signal)
    "LOW_V12",        # V12 below threshold
    "HIGH_V10_NEG",   # V10 heavily negative
    "LOW_V4",         # V4 below threshold
]

Each threshold is a learnable parameter initialised with a website prior and updated during training via gradient descent. This implies the model does not only use rules. It learns where to attract the lines.

The reason is a by-product of the forward pass. When the symbolic layer evaluates the six rules, it already has the whole lot it needs to supply a human-readable breakdown. Calling predict_with_explanation() returns the prediction, confidence, which rules fired, the observed values, and the learned thresholds, all in a single forward pass at no extra cost.

Training

Each models were trained for 40 epochs using Adam [9] with weight decay and a step learning rate scheduler.

[Baseline NN]      Epoch 40/40  train=0.0067  val=0.0263
[Neuro-Symbolic]   Epoch 40/40  train=0.0030  val=0.0099

The neuro-symbolic model converges to a lower validation loss. Each curves are clean with no sign of instability from the symbolic components.

Line chart showing training loss curves over 40 epochs for two models. Red line labeled Baseline NN starts around 0.07 and drops sharply before flattening near 0.007. Green line labeled Neuro-Symbolic starts at the same point, drops faster, and flattens lower near 0.003. Both curves are smooth with no instability. Dark background.
Training loss over 40 epochs for each models on the SMOTE-balanced training set. The neuro-symbolic model converges to a lower final training loss (0.003 vs 0.007), suggesting the symbolic rule layer provides a useful inductive bias. Each curves are clean with no signs of instability from the differentiable rule components. Image by writer.

Performance on the Real-World Test Set

[Baseline NN]
                precision  recall  f1-score  support
Legit             0.9997   0.9989    0.9993    56864
Fraud             0.5685   0.8469    0.6803       98
ROC-AUC : 0.9737

[Neuro-Symbolic]
                precision  recall  f1-score  support
Legit             0.9997   0.9988    0.9993    56864
Fraud             0.5425   0.8469    0.6614       98
ROC-AUC : 0.9688

Recall on fraud is equivalent: 0.8469 for each models. The neuro-symbolic model catches the exact same proportion of fraud cases because the unconstrained black-box baseline.

The precision difference (0.5425 vs 0.5685) means the neuro-symbolic model generates just a few more false positives. Whether that is suitable is determined by the fee ratio between false positives and missed fraud in your specific deployment. The ROC-AUC gap (0.9688 vs 0.9737) is small.

The purpose will not be that the neuro-symbolic model is more accurate. It’s that it’s comparably accurate while producing explanations that the baseline cannot produce in any respect.

What the Model Actually Learned

After 40 epochs, the symbolic rule thresholds are not any longer initialised priors. The model learned them.

Rule               Learned Threshold                  Weight
--------------------------------------------------------------
HIGH_AMOUNT        Amount > -0.011 (scaled)            0.121
LOW_V17            V17 < -0.135                        0.081
LOW_V14            V14 < -0.440                        0.071
LOW_V12            V12 < -0.300                        0.078
HIGH_V10_NEG       V10 < -0.320                        0.078
LOW_V4             V4 < -0.251                         0.571

The thresholds for V14, V17, V12, and V10 are consistent with what published EDA on this dataset has identified because the strongest fraud signals [7, 8]. The model found them through gradient descent, not manual specification.

But there's something unusual in the load column: LOW_V4 carries 0.571 of the full symbolic weight, while the opposite five rules share the remaining 0.429. One rule dominates the symbolic layer by a large margin.

That is the result I didn't expect, and it's value being direct about what it means. The rule_weights are passed through a softmax during training, which in principle prevents any single weight from collapsing to at least one. But softmax doesn't implement uniformity. It just normalises. With sufficient gradient signal, one rule can still accumulate many of the weight if the feature it covers is strongly predictive across the training distribution.

V4 is a known fraud signal on this dataset [7], but this level of dominance suggests the symbolic layer is behaving more like a single-feature gate than a multi-rule reasoning system during inference. For the model’s predictions this will not be an issue, because the neural backbone remains to be doing the heavy lifting on latent representations. But for the reasons, it signifies that on many transactions, the symbolic layer’s contribution is essentially determined by a single rule.

I'll come back to what needs to be done about this.

The Benchmark

The central query: how long does it take to supply an evidence, and does the output have the properties you would like in production?

I ran each explanation methods on 100 test samples.

All latency measurements were taken on CPU (Intel i7-class machine, PyTorch, no GPU acceleration).

SHAP (KernelExplainer, 200 background samples, nsamples=100)
    Total : 3.00s   Per sample : 30.0 ms

Neuro-Symbolic (predict_with_explanation, single forward pass)
    Total : 0.0898s   Per sample : 0.898 ms

Speedup : 33x
Bar chart comparing explanation latency per sample. SHAP Post-Hoc bar reaches 29.98 milliseconds in red. Neuro-Symbolic Real-Time bar is barely visible at 0.90 milliseconds in green. Dark background with white axis labels.
Explanation latency measured on 100 test samples from the Kaggle creditcard dataset. SHAP KernelExplainer with 200 background samples costs 29.98 ms per prediction. The neuro-symbolic model produces its explanation in 0.90 ms as a part of the identical forward pass — no background dataset, no separate call. The visual gap will not be a styling alternative. That's the actual ratio. Image by writer.

The latency difference is the headline, however the consistency difference matters as much in practice.

SHAP’s KernelExplainer uses Monte Carlo sampling to approximate Shapley values [2]. Run it twice on the identical input and also you get different numbers. The reason shifts with the random state. In a regulated environment where decisions must be auditable, a stochastic explanation is a liability.

The neuro-symbolic model produces the identical explanation each time for a similar input. The rule activations are a deterministic function of the input features and the learned weights. There's nothing to differ.

Bar chart titled Explanation Variance Top 10 Features, with subtitle Neuro-Symbolic explanations are always deterministic. Y-axis shows variance across explanation runs scaled at 1e-5. Ten features on x-axis from V25 to V14. Salmon pink bars show varying SHAP variance values, with V11 highest around 1.02e-5 and V14 around 0.58e-5. A flat green dashed line sits at zero labeled Neuro-Symbolic variance equals zero.
SHAP explanation variance across runs for the highest 10 most significant features, measured by rerunning KernelExplainer with different random states. V11 shows the very best variance at roughly 1.02e-5, V14 at 0.58e-5. The green dashed line at zero represents the neuro-symbolic model, which produces equivalent explanations on every run for a similar input. For compliance logging or auditability, this difference matters as much because the latency gap. Image by writer.

Reading a Real Explanation

Here is the output from predict_with_explanation() on test set transaction 840, a confirmed fraud case.

Prediction  : FRAUD
Confidence  : 100.0%

Rules fired (4) -- produced INSIDE the forward pass:

Rule             Value    Op   Threshold   Weight
-------------------------------------------------
LOW_V17         -0.553     <      -0.135    0.081
LOW_V14         -0.582     <      -0.440    0.071
LOW_V12         -0.350     <      -0.300    0.078
HIGH_V10_NEG    -0.446     <      -0.320    0.078

4 rules fired concurrently. Each line tells you which ones feature was involved, the observed value, the learned threshold it crossed, and the load that rule carries within the symbolic layer. This output was not reconstructed from the prediction after the very fact. It was produced at the identical moment because the prediction, as a part of the identical computation.

Notice that LOW_V4 (the rule with 57% of the symbolic weight) didn't fire on this transaction. The 4 rules that did fire (V17, V14, V12, V10) all carry relatively modest weights individually. The model still predicted FRAUD at 100% confidence, which suggests the neural backbone carried this decision. The symbolic layer’s role here was to discover the precise pattern of 4 anomalous V-feature values firing together, and surface it as a readable explanation.

This is definitely a useful demonstration of how the 2 components interact. The neural backbone produces the prediction. The symbolic layer produces the justification. They will not be at all times in perfect alignment, and that tension is informative.

Bar chart titled Example Fraud Transaction Rule Activations, confidence 100 percent, rules fired 4. Six bars shown: HIGH_AMOUNT in grey around 0.30, LOW_V17 in green around 0.70, LOW_V14 in green around 0.57, LOW_V12 in green around 0.53, HIGH_V10_NEG in green around 0.56, LOW_V4 in grey around 0.37. Green bars indicate rules that fired.
Rule activations for test set transaction 840, a confirmed fraud case. 4 of the six rules fired: LOW_V17 with the strongest activation at roughly 0.70, followed by LOW_V14, HIGH_V10_NEG, and LOW_V12. HIGH_AMOUNT and LOW_V4 didn't cross their respective thresholds for this transaction despite LOW_V4 carrying 57% of the symbolic weight globally. This output was produced throughout the forward pass — not reconstructed from it. Image by writer.

The identical benchmark run records how regularly each rule fired across fraud-predicted transactions — produced during inference with no separate computation. Since the 100-sample window reflects the real-world 0.17% fraud rate, it comprises only a few fraud predictions, so the bars are thin. The pattern becomes clearer across the complete test set, but even here it confirms the mechanism is working.

Horizontal bar chart titled Neuro-Symbolic Rule Fire-Rate on Fraud Transactions. Six rules listed on the y-axis: LOW_V4, HIGH_V10_NEG, LOW_V12, LOW_V14, LOW_V17, HIGH_AMOUNT. All bars appear at or near zero. Dark background with white labels and green color scheme.
Rule fire-rate across fraud-predicted transactions within the 100-sample benchmark set. Since the benchmark draws from the primary 100 test samples on the real-world 0.17% fraud rate, only a few fraud predictions fall inside this window — which is why the bars appear empty. The hearth-rate statistics are meaningful when computed across the complete test set. The chart demonstrates the mechanism works; the sample selection for benchmarking was optimised for latency measurement, not coverage. Image by writer.

The Full Comparison

Dark-themed comparison table with three columns: Property, SHAP Post-Hoc in red text, and Neuro-Symbolic Real-Time in green text. Seven rows: Explanation timing shows After prediction post-hoc in red vs During prediction inline in green. Latency per sample shows 30 ms in red vs 0.90 ms in green. Speedup shows 1x baseline in red vs 33x faster in green. Consistency shows Stochastic nsamples-dependent in red vs Deterministic always identical in green. Production ready shows Too slow for real-time pipelines in red vs Sub-ms fits any latency SLA in green. Output format shows Feature attributions only in red vs Named rule firings with values in green. Extra inference cost shows Separate explainer call required in red vs Zero part of forward pass in green.
Seven-dimension comparison of SHAP and the neuro-symbolic approach measured on the Kaggle creditcard dataset. Latency and speedup values are from the 100-sample benchmark. Consistency reflects the deterministic vs stochastic nature of every explanation method. Performance metrics (precision, recall, AUC) are intentionally absent from this table — the 2 models are deliberately close on those dimensions, and the comparison here is about what happens after the prediction, not the prediction itself. Image by writer.

What Should Be Done In another way

The V4 weight collapse. The softmax over rule_weights failed to stop one rule from accumulating 57% of the symbolic weight. The right fix is a regularisation term during training that penalises weight concentration. For instance, an entropy penalty on the softmax output that actively rewards more uniform distributions across rules. Without this, the symbolic layer can degrade toward a single-feature gate, which weakens the interpretability argument.

The HIGH_AMOUNT threshold. The learned threshold for Amount converged to -0.011 (scaled), which is effectively zero, so the rule fires for nearly any non-trivially small transaction, which suggests it contributes little or no discrimination. The issue is probably going a mix of the feature being genuinely less predictive on this dataset than domain intuition suggests (V features dominate within the published literature [7, 8]) and the initialisation pulling the edge to a low-information region. A bounded threshold initialisation or a learned gate that may suppress low-utility rules would handle this more cleanly.

Decision threshold tuning. Each models were evaluated at a 0.5 threshold. In practice, the appropriate threshold is determined by the fee ratio between false positives and missed fraud within the deployment context. This is particularly necessary for the neuro-symbolic model where precision is barely lower. A threshold shift toward 0.6 or 0.65 would get better precision at the fee of some recall. This trade-off needs to be made deliberately, not left on the default.

Where This Matches

That is the fifth article in a series on neuro-symbolic approaches to fraud detection. The sooner work covers the foundations:

This text adds a fifth dimension: the explainability architecture itself. Not only whether the model will be explained, but whether the reason will be produced on the speed and consistency that production systems actually require.

SHAP stays the appropriate tool for model debugging, feature selection, and exploratory evaluation. What this experiment shows is that when explanation must be a part of the choice (logged in real time, auditable per transaction, available to downstream systems), the architecture has to vary. Post-hoc methods are too slow and too inconsistent for that role.

The neuro-symbolic approach trades a small amount of precision for an evidence that's deterministic, immediate, and structurally inseparable from the prediction itself. Whether that trade-off is worth it is determined by your system. The numbers are here to enable you to determine.

Code: https://github.com/Emmimal/neuro-symbolic-xai-fraud/

Disclosure

This text is predicated on independent experiments using publicly available data (Kaggle Credit Card Fraud dataset) and open-source tools. No proprietary datasets, company resources, or confidential information were used. The outcomes and code are fully reproducible as described, and the GitHub repository comprises the whole implementation. The views and conclusions expressed listed here are my very own and don't represent any employer or organization.

References

[1] ULB Machine Learning Group. . Kaggle, 2018. Available at: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (Dataset released under the Open Database License. Original research: Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G., 2015.)

[2] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. , 30. Available at: https://arxiv.org/abs/1705.07874

[3] Shapley, L. S. (1953). A worth for n-person games. In H. W. Kuhn & A. W. Tucker (Eds.), (Vol. 2, pp. 307–317). Princeton University Press. https://doi.org/10.1515/9781400881970-018

[4] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. , 16, 321–357. Available at: https://arxiv.org/abs/1106.1813

[5] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. . Available at: https://arxiv.org/abs/1502.03167

[6] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: An easy option to prevent neural networks from overfitting. , 15(1), 1929–1958. Available at: https://jmlr.org/papers/v15/srivastava14a.html

[7] Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in bank card fraud detection from a practitioner perspective. , 41(10), 4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026

[8] Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., & Bontempi, G. (2018). SCARFF: A scalable framework for streaming bank card fraud detection with Spark. , 41, 182–194. https://doi.org/10.1016/j.inffus.2017.09.005

[9] Kingma, D. P., & Ba, J. (2015). Adam: A way for stochastic optimization. . Available at: https://arxiv.org/abs/1412.6980

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x