- SHAP KernelExplainer takes ~30 ms per prediction (even with a small background)
- A neuro-symbolic model generates explanations contained in the forward pass in 0.9 ms
- That’s a 33× speedup with deterministic outputs
- Fraud recall is equivalent (0.8469), with only a small AUC drop
- No separate explainer, no randomness, no additional latency cost
- All code runs on the Kaggle Credit Card Fraud Detection dataset [1]
Full code:Â https://github.com/Emmimal/neuro-symbolic-xai-fraud/
The Moment the Problem Became Real
I used to be debugging a fraud detection system late one evening and wanted to know why the model had flagged a particular transaction. I called KernelExplainer, passed in my background dataset, and waited. Three seconds later I had a bar chart of feature attributions. I ran it again to double-check a worth and got barely different numbers.
That’s after I realised there was a structural limitation in how explanations were being generated. The model was deterministic. The reason was not. I used to be explaining a consistent decision with an inconsistent method, and neither the latency nor the randomness was acceptable if this ever needed to run in real time.
This text is about what I built as a substitute, what it cost in performance, and what it got right, including one result that surprised me.
Key Insight: Explainability shouldn’t be a post-processing step. It needs to be a part of the model architecture.
Limitations of SHAP in Real-Time Settings
To be precise about what SHAP actually does: Lundberg and Lee’s SHAP framework [2] computes Shapley values (an idea from cooperative game theory [3]) that attribute a model’s output to its input features. KernelExplainer, the model-agnostic variant, approximates these values using a weighted linear regression over a sampled coalition of features. The background dataset acts as a baseline, and nsamples controls what number of coalitions are evaluated per prediction.
This approximation is amazingly useful for model debugging, feature selection, and post-hoc evaluation.
The limitation examined here is narrower but critical: when explanations have to be generated at inference time, attached to individual predictions, under real-time latency constraints.
Once you attach SHAP to a real-time fraud pipeline, you might be running an approximation algorithm that:
- Will depend on a background dataset you might have to take care of and pass at inference time
- Produces results that shift depending on
nsamplesand the random state - Takes 30 ms per sample at a reduced configuration
The chart below shows what that post-hoc output looks like — a worldwide feature rating computed after the prediction was already made.
Within the benchmark I ran on the Kaggle creditcard dataset [1], SHAP itself printed a warning:
Using 200 background data samples could cause slower run times.
Think about using shap.sample(data, K) or shap.kmeans(data, K)
to summarize the background as K samples.
This highlights the trade-off between background size and computational cost in SHAP. 30 ms at 200 background samples is the lower sure. Larger backgrounds, which improve attribution stability, push the fee higher.
The neuro-symbolic model I built takes 0.898 ms for the prediction and explanation together. There isn’t any floor to fret about because there isn’t a separate explainer.
The Dataset
All experiments use the Kaggle Credit Card Fraud Detection dataset [1], covering 284,807 real bank card transactions from European cardholders in September 2013, of which 492 are confirmed fraud.
Shape : (284807, 31)
Fraud rate : 0.1727%
Fraud samples : 492
Legit samples : 284,315
The features V1 through V28 are PCA-transformed principal components. The unique features are anonymised and never disclosed within the dataset. Amount is the transaction value. Time was dropped.
Amount was scaled with StandardScaler. I applied SMOTE [4] exclusively to the training set to handle the category imbalance. The test set was held on the real-world 0.17% fraud distribution throughout.
Train size after SMOTE : 454,902
Fraud rate after SMOTE : 50.00%
Test set : 56,962 samples | 98 confirmed fraud
The test set structure is essential: 98 fraud cases out of 56,962 samples is the actual operating condition of this problem. Any model that scores well here is doing so on a genuinely hard task.
Two Models, One Comparison
The Baseline: Standard Neural Network
The baseline is a four-layer MLP with batch normalisation [5] and dropout [6], a regular architecture for tabular fraud detection.
class FraudNN(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 128), nn.BatchNorm1d(128),
nn.ReLU(), nn.Dropout(0.3),
nn.Linear(128, 64), nn.BatchNorm1d(64),
nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 1), nn.Sigmoid(),
)
It makes a prediction and nothing else. Explaining that prediction requires a separate SHAP call.
The Neuro-Symbolic Model: Explanation as Architecture
The neuro-symbolic model has three components working together: a neural backbone, a symbolic rule layer, and a fusion layer that mixes each signals.
The neural backbone learns latent representations from all 29 features. The symbolic rule layer runs six differentiable rules in parallel, every one computing a soft activation between zero and one using a sigmoid function. The fusion layer takes each outputs and produces the ultimate probability.
class NeuroSymbolicFraudDetector(nn.Module):
"""
Input
|--- Neural Backbone (latent fraud representations)
|--- Symbolic Rule Layer (6 differentiable rules)
|
Fusion Layer --> P(fraud) + rule_activations
"""
def __init__(self, input_dim, feature_names):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(input_dim, 64), nn.BatchNorm1d(64),
nn.ReLU(), nn.Dropout(0.2),
nn.Linear(64, 32), nn.BatchNorm1d(32), nn.ReLU(),
)
self.symbolic = SymbolicRuleLayer(feature_names)
self.fusion = nn.Sequential(
nn.Linear(32 + 1, 16), nn.ReLU(), # 32 from backbone + 1 from symbolic layer (weighted rule activation summary)
nn.Linear(16, 1), nn.Sigmoid(),
)

The six symbolic rules are anchored to the creditcard features with the strongest published fraud signal [7, 8]: V14, V17, V12, V10, V4, and Amount.
RULE_NAMES = [
"HIGH_AMOUNT", # Amount exceeds threshold
"LOW_V17", # V17 below threshold
"LOW_V14", # V14 below threshold (strongest signal)
"LOW_V12", # V12 below threshold
"HIGH_V10_NEG", # V10 heavily negative
"LOW_V4", # V4 below threshold
]
Each threshold is a learnable parameter initialised with a website prior and updated during training via gradient descent. This implies the model does not only use rules. It learns where to attract the lines.
The reason is a by-product of the forward pass. When the symbolic layer evaluates the six rules, it already has the whole lot it needs to supply a human-readable breakdown. Calling predict_with_explanation() returns the prediction, confidence, which rules fired, the observed values, and the learned thresholds, all in a single forward pass at no extra cost.
Training
Each models were trained for 40 epochs using Adam [9] with weight decay and a step learning rate scheduler.
[Baseline NN] Epoch 40/40 train=0.0067 val=0.0263
[Neuro-Symbolic] Epoch 40/40 train=0.0030 val=0.0099
The neuro-symbolic model converges to a lower validation loss. Each curves are clean with no sign of instability from the symbolic components.

Performance on the Real-World Test Set
[Baseline NN]
precision recall f1-score support
Legit 0.9997 0.9989 0.9993 56864
Fraud 0.5685 0.8469 0.6803 98
ROC-AUC : 0.9737
[Neuro-Symbolic]
precision recall f1-score support
Legit 0.9997 0.9988 0.9993 56864
Fraud 0.5425 0.8469 0.6614 98
ROC-AUC : 0.9688
Recall on fraud is equivalent: 0.8469 for each models. The neuro-symbolic model catches the exact same proportion of fraud cases because the unconstrained black-box baseline.
The precision difference (0.5425 vs 0.5685) means the neuro-symbolic model generates just a few more false positives. Whether that is suitable is determined by the fee ratio between false positives and missed fraud in your specific deployment. The ROC-AUC gap (0.9688 vs 0.9737) is small.
The purpose will not be that the neuro-symbolic model is more accurate. It’s that it’s comparably accurate while producing explanations that the baseline cannot produce in any respect.
What the Model Actually Learned
After 40 epochs, the symbolic rule thresholds are not any longer initialised priors. The model learned them.
Rule Learned Threshold Weight
--------------------------------------------------------------
HIGH_AMOUNT Amount > -0.011 (scaled) 0.121
LOW_V17 V17 < -0.135 0.081
LOW_V14 V14 < -0.440 0.071
LOW_V12 V12 < -0.300 0.078
HIGH_V10_NEG V10 < -0.320 0.078
LOW_V4 V4 < -0.251 0.571
The thresholds for V14, V17, V12, and V10 are consistent with what published EDA on this dataset has identified because the strongest fraud signals [7, 8]. The model found them through gradient descent, not manual specification.
But there's something unusual in the load column: LOW_V4 carries 0.571 of the full symbolic weight, while the opposite five rules share the remaining 0.429. One rule dominates the symbolic layer by a large margin.
That is the result I didn't expect, and it's value being direct about what it means. The rule_weights are passed through a softmax during training, which in principle prevents any single weight from collapsing to at least one. But softmax doesn't implement uniformity. It just normalises. With sufficient gradient signal, one rule can still accumulate many of the weight if the feature it covers is strongly predictive across the training distribution.
V4 is a known fraud signal on this dataset [7], but this level of dominance suggests the symbolic layer is behaving more like a single-feature gate than a multi-rule reasoning system during inference. For the model’s predictions this will not be an issue, because the neural backbone remains to be doing the heavy lifting on latent representations. But for the reasons, it signifies that on many transactions, the symbolic layer’s contribution is essentially determined by a single rule.
I'll come back to what needs to be done about this.
The Benchmark
The central query: how long does it take to supply an evidence, and does the output have the properties you would like in production?
I ran each explanation methods on 100 test samples.
All latency measurements were taken on CPU (Intel i7-class machine, PyTorch, no GPU acceleration).
SHAP (KernelExplainer, 200 background samples, nsamples=100)
Total : 3.00s Per sample : 30.0 ms
Neuro-Symbolic (predict_with_explanation, single forward pass)
Total : 0.0898s Per sample : 0.898 ms
Speedup : 33x

The latency difference is the headline, however the consistency difference matters as much in practice.
SHAP’s KernelExplainer uses Monte Carlo sampling to approximate Shapley values [2]. Run it twice on the identical input and also you get different numbers. The reason shifts with the random state. In a regulated environment where decisions must be auditable, a stochastic explanation is a liability.
The neuro-symbolic model produces the identical explanation each time for a similar input. The rule activations are a deterministic function of the input features and the learned weights. There's nothing to differ.

Reading a Real Explanation
Here is the output from predict_with_explanation() on test set transaction 840, a confirmed fraud case.
Prediction : FRAUD
Confidence : 100.0%
Rules fired (4) -- produced INSIDE the forward pass:
Rule Value Op Threshold Weight
-------------------------------------------------
LOW_V17 -0.553 < -0.135 0.081
LOW_V14 -0.582 < -0.440 0.071
LOW_V12 -0.350 < -0.300 0.078
HIGH_V10_NEG -0.446 < -0.320 0.078
4 rules fired concurrently. Each line tells you which ones feature was involved, the observed value, the learned threshold it crossed, and the load that rule carries within the symbolic layer. This output was not reconstructed from the prediction after the very fact. It was produced at the identical moment because the prediction, as a part of the identical computation.
Notice that LOW_V4 (the rule with 57% of the symbolic weight) didn't fire on this transaction. The 4 rules that did fire (V17, V14, V12, V10) all carry relatively modest weights individually. The model still predicted FRAUD at 100% confidence, which suggests the neural backbone carried this decision. The symbolic layer’s role here was to discover the precise pattern of 4 anomalous V-feature values firing together, and surface it as a readable explanation.
This is definitely a useful demonstration of how the 2 components interact. The neural backbone produces the prediction. The symbolic layer produces the justification. They will not be at all times in perfect alignment, and that tension is informative.

The identical benchmark run records how regularly each rule fired across fraud-predicted transactions — produced during inference with no separate computation. Since the 100-sample window reflects the real-world 0.17% fraud rate, it comprises only a few fraud predictions, so the bars are thin. The pattern becomes clearer across the complete test set, but even here it confirms the mechanism is working.

The Full Comparison

What Should Be Done In another way
The V4 weight collapse. The softmax over rule_weights failed to stop one rule from accumulating 57% of the symbolic weight. The right fix is a regularisation term during training that penalises weight concentration. For instance, an entropy penalty on the softmax output that actively rewards more uniform distributions across rules. Without this, the symbolic layer can degrade toward a single-feature gate, which weakens the interpretability argument.
The HIGH_AMOUNT threshold. The learned threshold for Amount converged to -0.011 (scaled), which is effectively zero, so the rule fires for nearly any non-trivially small transaction, which suggests it contributes little or no discrimination. The issue is probably going a mix of the feature being genuinely less predictive on this dataset than domain intuition suggests (V features dominate within the published literature [7, 8]) and the initialisation pulling the edge to a low-information region. A bounded threshold initialisation or a learned gate that may suppress low-utility rules would handle this more cleanly.
Decision threshold tuning. Each models were evaluated at a 0.5 threshold. In practice, the appropriate threshold is determined by the fee ratio between false positives and missed fraud within the deployment context. This is particularly necessary for the neuro-symbolic model where precision is barely lower. A threshold shift toward 0.6 or 0.65 would get better precision at the fee of some recall. This trade-off needs to be made deliberately, not left on the default.
Where This Matches
That is the fifth article in a series on neuro-symbolic approaches to fraud detection. The sooner work covers the foundations:
This text adds a fifth dimension: the explainability architecture itself. Not only whether the model will be explained, but whether the reason will be produced on the speed and consistency that production systems actually require.
SHAP stays the appropriate tool for model debugging, feature selection, and exploratory evaluation. What this experiment shows is that when explanation must be a part of the choice (logged in real time, auditable per transaction, available to downstream systems), the architecture has to vary. Post-hoc methods are too slow and too inconsistent for that role.
The neuro-symbolic approach trades a small amount of precision for an evidence that's deterministic, immediate, and structurally inseparable from the prediction itself. Whether that trade-off is worth it is determined by your system. The numbers are here to enable you to determine.
Disclosure
This text is predicated on independent experiments using publicly available data (Kaggle Credit Card Fraud dataset) and open-source tools. No proprietary datasets, company resources, or confidential information were used. The outcomes and code are fully reproducible as described, and the GitHub repository comprises the whole implementation. The views and conclusions expressed listed here are my very own and don't represent any employer or organization.
References
[1] ULB Machine Learning Group. . Kaggle, 2018. Available at: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (Dataset released under the Open Database License. Original research: Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G., 2015.)
[2] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. , 30. Available at: https://arxiv.org/abs/1705.07874
[3] Shapley, L. S. (1953). A worth for n-person games. In H. W. Kuhn & A. W. Tucker (Eds.), (Vol. 2, pp. 307–317). Princeton University Press. https://doi.org/10.1515/9781400881970-018
[4] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. , 16, 321–357. Available at: https://arxiv.org/abs/1106.1813
[5] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. . Available at: https://arxiv.org/abs/1502.03167
[6] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: An easy option to prevent neural networks from overfitting. , 15(1), 1929–1958. Available at: https://jmlr.org/papers/v15/srivastava14a.html
[7] Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in bank card fraud detection from a practitioner perspective. , 41(10), 4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026
[8] Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., & Bontempi, G. (2018). SCARFF: A scalable framework for streaming bank card fraud detection with Spark. , 41, 182–194. https://doi.org/10.1016/j.inffus.2017.09.005
[9] Kingma, D. P., & Ba, J. (2015). Adam: A way for stochastic optimization. . Available at: https://arxiv.org/abs/1412.6980
