- SHAP KernelExplainer takes ~30 ms per prediction (even with a small background)
- A neuro-symbolic model generates explanations inside the forward pass in 0.9 ms
- That’s a 33× speedup with deterministic outputs
- Fraud recall is identical (0.8469), with only a small AUC drop
- No separate explainer, no randomness, no additional latency cost
- All code runs on the Kaggle Credit Card Fraud Detection dataset [1]
Full code: https://github.com/Emmimal/neuro-symbolic-xai-fraud/
The Moment the Problem Became Real
I was debugging a fraud detection system late one evening and wanted to understand why the model had flagged a specific transaction. I called KernelExplainer, passed in my background dataset, and waited. Three seconds later I had a bar chart of feature attributions. I ran it again to double-check a value and got slightly different numbers.
That is when I realised there was a structural limitation in how explanations were being generated. The model was deterministic. The explanation was not. I was explaining a consistent decision with an inconsistent method, and neither the latency nor the randomness was acceptable if this ever had to run in real time.
This article is about what I built instead, what it cost in performance, and what it got right, including one result that surprised me.
If explanations can’t be produced instantly and consistently, they cannot be used in real-time fraud systems.
Key Insight: Explainability should not be a post-processing step. It should be part of the model architecture.
Limitations of SHAP in Real-Time Settings
To be precise about what SHAP actually does: Lundberg and Lee’s SHAP framework [2] computes Shapley values (a concept from cooperative game theory [3]) that attribute a model’s output to its input features. KernelExplainer, the model-agnostic variant, approximates these values using a weighted linear regression over a sampled coalition of features. The background dataset acts as a baseline, and nsamples controls how many coalitions are evaluated per prediction.
This approximation is extremely useful for model debugging, feature selection, and post-hoc analysis.
The limitation examined here is narrower but critical: when explanations must be generated at inference time, attached to individual predictions, under real-time latency constraints.
When you attach SHAP to a real-time fraud pipeline, you are running an approximation algorithm that:
- Depends on a background dataset you have to maintain and pass at inference time
- Produces results that shift depending on nsamples and the random state
- Takes 30 ms per sample at a reduced configuration
The chart below shows what that post-hoc output looks like — a global feature ranking computed after the prediction was already made.
SHAP mean absolute feature importance across 100 test samples, computed using KernelExplainer. V14 ranks highest, consistent with published EDA on this dataset. This is useful for global model understanding — but it is computed after the prediction, cannot be attached to a single real-time decision, and will produce slightly different values on the next run due to Monte Carlo sampling. Image by author.
In the benchmark I ran on the Kaggle creditcard dataset [1], SHAP itself printed a warning:
Using 200 background data samples could cause slower run times.
Consider using shap.sample(data, K) or shap.kmeans(data, K)
to summarize the background as K samples.
This highlights the trade-off between background size and computational cost in SHAP. 30 ms at 200 background samples is the lower bound. Larger backgrounds, which improve attribution stability, push the cost higher.
The neuro-symbolic model I built takes 0.898 ms for the prediction and explanation together. There is no floor to worry about because there is no separate explainer.
The Dataset
All experiments use the Kaggle Credit Card Fraud Detection dataset [1], covering 284,807 real credit card transactions from European cardholders in September 2013, of which 492 are confirmed fraud.
Shape : (284807, 31)
Fraud rate : 0.1727%
Fraud samples : 492
Legit samples : 284,315
The features V1 through V28 are PCA-transformed principal components. The original features are anonymised and not disclosed in the dataset. Amount is the transaction value. Time was dropped.
Amount was scaled with StandardScaler. I applied SMOTE [4] exclusively to the training set to address the class imbalance. The test set was held at the real-world 0.17% fraud distribution throughout.
Train size after SMOTE : 454,902
Fraud rate after SMOTE : 50.00%
Test set : 56,962 samples | 98 confirmed fraud
The test set structure is important: 98 fraud cases out of 56,962 samples is the actual operating condition of this problem. Any model that scores well here is doing so on a genuinely hard task.
Two Models, One Comparison
The Baseline: Standard Neural Network
The baseline is a four-layer MLP with batch normalisation [5] and dropout [6], a standard architecture for tabular fraud detection.
class FraudNN(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 128), nn.BatchNorm1d(128),
nn.ReLU(), nn.Dropout(0.3),
nn.Linear(128, 64), nn.BatchNorm1d(64),
nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 1), nn.Sigmoid(),
)
It makes a prediction and nothing else. Explaining that prediction requires a separate SHAP call.
The Neuro-Symbolic Model: Explanation as Architecture
The neuro-symbolic model has three components working together: a neural backbone, a symbolic rule layer, and a fusion layer that combines both signals.
The neural backbone learns latent representations from all 29 features. The symbolic rule layer runs six differentiable rules in parallel, each one computing a soft activation between zero and one using a sigmoid function. The fusion layer takes both outputs and produces the final probability.
class NeuroSymbolicFraudDetector(nn.Module):
“””
Input
|— Neural Backbone (latent fraud representations)
|— Symbolic Rule Layer (6 differentiable rules)
|
Fusion Layer –> P(fraud) + rule_activations
“””
def __init__(self, input_dim, feature_names):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(input_dim, 64), nn.BatchNorm1d(64),
nn.ReLU(), nn.Dropout(0.2),
nn.Linear(64, 32), nn.BatchNorm1d(32), nn.ReLU(),
)
self.symbolic = SymbolicRuleLayer(feature_names)
self.fusion = nn.Sequential(
nn.Linear(32 + 1, 16), nn.ReLU(), # 32 from backbone + 1 from symbolic layer (weighted rule activation summary)
nn.Linear(16, 1), nn.Sigmoid(),
)
The neuro-symbolic model runs two paths in parallel on every forward pass. The neural backbone produces latent fraud representations. The symbolic rule layer evaluates six differentiable rules against learnable thresholds. The fusion layer combines both signals into a single fraud probability. The rule activations — the explanation — are a natural output of this computation, not a separate step. Image by author.
The six symbolic rules are anchored to the creditcard features with the strongest published fraud signal [7, 8]: V14, V17, V12, V10, V4, and Amount.
RULE_NAMES = [
“HIGH_AMOUNT”, # Amount exceeds threshold
“LOW_V17”, # V17 below threshold
“LOW_V14”, # V14 below threshold (strongest signal)
“LOW_V12”, # V12 below threshold
“HIGH_V10_NEG”, # V10 heavily negative
“LOW_V4”, # V4 below threshold
]
Each threshold is a learnable parameter initialised with a domain prior and updated during training via gradient descent. This means the model does not just use rules. It learns where to draw the lines.
The explanation is a by-product of the forward pass. When the symbolic layer evaluates the six rules, it already has everything it needs to produce a human-readable breakdown. Calling predict_with_explanation() returns the prediction, confidence, which rules fired, the observed values, and the learned thresholds, all in a single forward pass at no extra cost.
Training
Both models were trained for 40 epochs using Adam [9] with weight decay and a step learning rate scheduler.
[Baseline NN] Epoch 40/40 train=0.0067 val=0.0263
[Neuro-Symbolic] Epoch 40/40 train=0.0030 val=0.0099
The neuro-symbolic model converges to a lower validation loss. Both curves are clean with no sign of instability from the symbolic components.
Training loss over 40 epochs for both models on the SMOTE-balanced training set. The neuro-symbolic model converges to a lower final training loss (0.003 vs 0.007), suggesting the symbolic rule layer provides a useful inductive bias. Both curves are clean with no signs of instability from the differentiable rule components. Image by author.
Performance on the Real-World Test Set
[Baseline NN]
precision recall f1-score support
Legit 0.9997 0.9989 0.9993 56864
Fraud 0.5685 0.8469 0.6803 98
ROC-AUC : 0.9737
[Neuro-Symbolic]
precision recall f1-score support
Legit 0.9997 0.9988 0.9993 56864
Fraud 0.5425 0.8469 0.6614 98
ROC-AUC : 0.9688
Recall on fraud is identical: 0.8469 for both models. The neuro-symbolic model catches exactly the same proportion of fraud cases as the unconstrained black-box baseline.
The precision difference (0.5425 vs 0.5685) means the neuro-symbolic model generates a few more false positives. Whether that is acceptable depends on the cost ratio between false positives and missed fraud in your specific deployment. The ROC-AUC gap (0.9688 vs 0.9737) is small.
The point is not that the neuro-symbolic model is more accurate. It is that it is comparably accurate while producing explanations that the baseline cannot produce at all.
What the Model Actually Learned
After 40 epochs, the symbolic rule thresholds are no longer initialised priors. The model learned them.
Rule Learned Threshold Weight
————————————————————–
HIGH_AMOUNT Amount > -0.011 (scaled) 0.121
LOW_V17 V17 < -0.135 0.081
LOW_V14 V14 < -0.440 0.071
LOW_V12 V12 < -0.300 0.078
HIGH_V10_NEG V10 < -0.320 0.078
LOW_V4 V4 < -0.251 0.571
The thresholds for V14, V17, V12, and V10 are consistent with what published EDA on this dataset has identified as the strongest fraud signals [7, 8]. The model found them through gradient descent, not manual specification.
But there is something unusual in the weight column: LOW_V4 carries 0.571 of the total symbolic weight, while the other five rules share the remaining 0.429. One rule dominates the symbolic layer by a wide margin.
This is the result I did not expect, and it is worth being direct about what it means. The rule_weights are passed through a softmax during training, which in principle prevents any single weight from collapsing to one. But softmax does not enforce uniformity. It just normalises. With sufficient gradient signal, one rule can still accumulate most of the weight if the feature it covers is strongly predictive across the training distribution.
V4 is a known fraud signal in this dataset [7], but this level of dominance suggests the symbolic layer is behaving more like a single-feature gate than a multi-rule reasoning system during inference. For the model’s predictions this is not a problem, as the neural backbone is still doing the heavy lifting on latent representations. But for the explanations, it means that on many transactions, the symbolic layer’s contribution is largely determined by a single rule.
I will come back to what should be done about this.
The Benchmark
The central question: how long does it take to produce an explanation, and does the output have the properties you need in production?
I ran both explanation methods on 100 test samples.
All latency measurements were taken on CPU (Intel i7-class machine, PyTorch, no GPU acceleration).
SHAP (KernelExplainer, 200 background samples, nsamples=100)
Total : 3.00s Per sample : 30.0 ms
Neuro-Symbolic (predict_with_explanation, single forward pass)
Total : 0.0898s Per sample : 0.898 ms
Speedup : 33x
Explanation latency measured on 100 test samples from the Kaggle creditcard dataset. SHAP KernelExplainer with 200 background samples costs 29.98 ms per prediction. The neuro-symbolic model produces its explanation in 0.90 ms as part of the same forward pass — no background dataset, no separate call. The visual gap is not a styling choice. That is the actual ratio. Image by author.
The latency difference is the headline, but the consistency difference matters as much in practice.
SHAP’s KernelExplainer uses Monte Carlo sampling to approximate Shapley values [2]. Run it twice on the same input and you get different numbers. The explanation shifts with the random state. In a regulated environment where decisions need to be auditable, a stochastic explanation is a liability.
The neuro-symbolic model produces the same explanation every time for the same input. The rule activations are a deterministic function of the input features and the learned weights. There is nothing to vary.
SHAP explanation variance across runs for the top 10 most important features, measured by rerunning KernelExplainer with different random states. V11 shows the highest variance at approximately 1.02e-5, V14 at 0.58e-5. The green dashed line at zero represents the neuro-symbolic model, which produces identical explanations on every run for the same input. For compliance logging or auditability, this difference matters as much as the latency gap. Image by author.
Reading a Real Explanation
Here is the output from predict_with_explanation() on test set transaction 840, a confirmed fraud case.
Prediction : FRAUD
Confidence : 100.0%
Rules fired (4) — produced INSIDE the forward pass:
Rule Value Op Threshold Weight
————————————————-
LOW_V17 -0.553 < -0.135 0.081
LOW_V14 -0.582 < -0.440 0.071
LOW_V12 -0.350 < -0.300 0.078
HIGH_V10_NEG -0.446 < -0.320 0.078
Four rules fired simultaneously. Each line tells you which feature was involved, the observed value, the learned threshold it crossed, and the weight that rule carries in the symbolic layer. This output was not reconstructed from the prediction after the fact. It was produced at the same moment as the prediction, as part of the same computation.
Notice that LOW_V4 (the rule with 57% of the symbolic weight) did not fire on this transaction. The four rules that did fire (V17, V14, V12, V10) all carry relatively modest weights individually. The model still predicted FRAUD at 100% confidence, which means the neural backbone carried this decision. The symbolic layer’s role here was to identify the specific pattern of four anomalous V-feature values firing together, and surface it as a readable explanation.
This is actually a useful demonstration of how the two components interact. The neural backbone produces the prediction. The symbolic layer produces the justification. They are not always in perfect alignment, and that tension is informative.
Rule activations for test set transaction 840, a confirmed fraud case. Four of the six rules fired: LOW_V17 with the strongest activation at approximately 0.70, followed by LOW_V14, HIGH_V10_NEG, and LOW_V12. HIGH_AMOUNT and LOW_V4 did not cross their respective thresholds for this transaction despite LOW_V4 carrying 57% of the symbolic weight globally. This output was produced during the forward pass — not reconstructed from it. Image by author.
The same benchmark run records how frequently each rule fired across fraud-predicted transactions — produced during inference with no separate computation. Because the 100-sample window reflects the real-world 0.17% fraud rate, it contains very few fraud predictions, so the bars are thin. The pattern becomes clearer across the full test set, but even here it confirms the mechanism is working.
Rule fire-rate across fraud-predicted transactions in the 100-sample benchmark set. Because the benchmark draws from the first 100 test samples at the real-world 0.17% fraud rate, very few fraud predictions fall within this window — which is why the bars appear empty. The fire-rate statistics are meaningful when computed across the full test set. The chart demonstrates the mechanism works; the sample selection for benchmarking was optimised for latency measurement, not coverage. Image by author.
The Full Comparison
Seven-dimension comparison of SHAP and the neuro-symbolic approach measured on the Kaggle creditcard dataset. Latency and speedup values are from the 100-sample benchmark. Consistency reflects the deterministic vs stochastic nature of each explanation method. Performance metrics (precision, recall, AUC) are intentionally absent from this table — the two models are deliberately close on those dimensions, and the comparison here is about what happens after the prediction, not the prediction itself. Image by author.
What Should Be Done Differently
The V4 weight collapse. The softmax over rule_weights failed to prevent one rule from accumulating 57% of the symbolic weight. The correct fix is a regularisation term during training that penalises weight concentration. For example, an entropy penalty on the softmax output that actively rewards more uniform distributions across rules. Without this, the symbolic layer can degrade toward a single-feature gate, which weakens the interpretability argument.
The HIGH_AMOUNT threshold. The learned threshold for Amount converged to -0.011 (scaled), which is effectively zero, so the rule fires for almost any non-trivially small transaction, which means it contributes very little discrimination. The problem is likely a combination of the feature being genuinely less predictive on this dataset than domain intuition suggests (V features dominate in the published literature [7, 8]) and the initialisation pulling the threshold to a low-information region. A bounded threshold initialisation or a learned gate that can suppress low-utility rules would handle this more cleanly.
Decision threshold tuning. Both models were evaluated at a 0.5 threshold. In practice, the right threshold depends on the cost ratio between false positives and missed fraud in the deployment context. This is especially important for the neuro-symbolic model where precision is slightly lower. A threshold shift toward 0.6 or 0.65 would recover precision at the cost of some recall. This trade-off should be made deliberately, not left at the default.
Where This Fits
This is the fifth article in a series on neuro-symbolic approaches to fraud detection. The earlier work covers the foundations:
This article adds a fifth dimension: the explainability architecture itself. Not just whether the model can be explained, but whether the explanation can be produced at the speed and consistency that production systems actually require.
SHAP remains the right tool for model debugging, feature selection, and exploratory analysis. What this experiment shows is that when explanation needs to be part of the decision (logged in real time, auditable per transaction, available to downstream systems), the architecture has to change. Post-hoc methods are too slow and too inconsistent for that role.
The neuro-symbolic approach trades a small amount of precision for an explanation that is deterministic, immediate, and structurally inseparable from the prediction itself. Whether that trade-off is worthwhile depends on your system. The numbers are here to help you decide.
Code: https://github.com/Emmimal/neuro-symbolic-xai-fraud/
Disclosure
This article is based on independent experiments using publicly available data (Kaggle Credit Card Fraud dataset) and open-source tools. No proprietary datasets, company resources, or confidential information were used. The results and code are fully reproducible as described, and the GitHub repository contains the complete implementation. The views and conclusions expressed here are my own and do not represent any employer or organization.
References
[1] ULB Machine Learning Group. Credit Card Fraud Detection. Kaggle, 2018. Available at: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (Dataset released under the Open Database License. Original research: Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G., 2015.)
[2] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30. Available at: https://arxiv.org/abs/1705.07874
[3] Shapley, L. S. (1953). A value for n-person games. In H. W. Kuhn & A. W. Tucker (Eds.), Contributions to the Theory of Games (Vol. 2, pp. 307–317). Princeton University Press. https://doi.org/10.1515/9781400881970-018
[4] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. Available at: https://arxiv.org/abs/1106.1813
[5] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning (ICML). Available at: https://arxiv.org/abs/1502.03167
[6] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958. Available at: https://jmlr.org/papers/v15/srivastava14a.html
[7] Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41(10), 4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026
[8] Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., & Bontempi, G. (2018). SCARFF: A scalable framework for streaming credit card fraud detection with Spark. Information Fusion, 41, 182–194. https://doi.org/10.1016/j.inffus.2017.09.005
[9] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR). Available at: https://arxiv.org/abs/1412.6980

