Mar 10, 2025

Real-Time Fraud Detection for a Payments Platform

Fraud losses are a cost of doing business in payments — until they aren’t. A Southeast Asian fintech operating a payment gateway for SME merchants was losing MYR 1.8 million per quarter to fraud that its rule-based detection system consistently missed. The operations team was adding rules faster than fraudsters were finding workarounds. Nematix was engaged to replace the static rule engine with a machine-learning detection layer capable of adapting to novel fraud patterns in near real time.

The outcome: a 61% reduction in fraud loss in the first three months post-deployment, with false-positive rates held flat — meaning legitimate transactions weren’t disrupted in the process.

The Situation

The platform processed approximately 2.4 million transactions per month across four merchant categories: food and beverage, retail, professional services, and e-commerce. The fraud detection stack in place was a manually maintained rule engine: a set of heuristics (transaction value thresholds, velocity checks, geographic anomalies) that the risk team updated when new fraud patterns were identified.

The fundamental problem with rule-based detection is latency — not technical latency, but organisational latency. A new fraud pattern had to cause enough loss to be noticed, get escalated to the risk team, be understood well enough to write a rule, be tested to avoid false positives, and then be deployed. This cycle took between two and six weeks. Sophisticated fraudsters exploited that window systematically.

Three fraud vectors were causing the majority of the losses:

Card-testing attacks — automated probes using stolen card numbers to test validity across many merchants before committing to high-value fraud
Account takeover with behavioural mimicry — attackers who had acquired login credentials and then replicated the victim’s normal transaction patterns closely enough to avoid velocity-based flags
Merchant collusion — a small number of merchants running refund loops and fictitious transactions that looked legitimate in isolation but formed a pattern across the network

The rule engine caught some card-testing attempts but missed the behavioural mimicry and merchant collusion patterns entirely.

The Challenge

Three factors made this harder than a standard fraud model deployment.

Severe class imbalance. Fraud represented 0.04% of transactions. Training a model on imbalanced data without careful handling produces a classifier that predicts “not fraud” for everything and achieves 99.96% accuracy while being entirely useless. Choosing the right sampling strategy and optimising for the right metrics (precision-recall, not accuracy) was non-negotiable.

Latency constraint. The fraud decision had to be returned within 800 milliseconds — the platform’s transaction authorisation timeout. Model inference, feature computation, and the decisioning logic had to fit within that window without degrading user experience.

False-positive sensitivity. The client’s merchant base was SME-heavy. A restaurant owner whose legitimate end-of-day reconciliation batch gets flagged as fraud loses trust immediately. The model had to achieve fraud reduction without increasing the rate at which legitimate transactions were declined.

Our Approach

Weeks 1–4: Data audit and feature engineering

We conducted a retrospective analysis of 14 months of transaction data — 33 million records — with fraud labels applied by the operations team’s manual review process. The audit confirmed the class imbalance (0.04% positive rate) and identified labelling inconsistencies in older records that required cleaning before model training.

Feature engineering produced 87 candidate features across four categories:

Transaction features: amount, merchant category, channel (web/mobile/POS), time of day, day of week
Velocity features: transaction count and volume per card, per merchant, per IP address in rolling 1-hour, 24-hour, and 7-day windows
Behavioural features: deviation from the cardholder’s historical transaction profile (amount percentile, unusual merchant category, new geographic zone)
Network features: card and merchant risk scores derived from co-occurrence in prior fraud cases

Weeks 5–9: Model development

We trained an XGBoost gradient-boosted classifier, chosen for its performance on structured tabular data, interpretability via SHAP values, and inference speed. Handling class imbalance used a combination of SMOTE oversampling on the training set and adjusted class weights in the loss function.

SHAP (SHapley Additive exPlanations) values were central to the approach — not just for model interpretability, but because the risk team needed to understand why a transaction was flagged in order to act on it. A black-box score with no explanation creates operational friction; a score with SHAP feature contributions gives the analyst a starting point.

The model was calibrated against a held-out validation set stratified by time (not random split, to prevent data leakage from temporal features).

Weeks 10–14: Inference infrastructure and integration

The model was deployed as a REST microservice with pre-computed features served from Redis. Velocity counters (rolling windows) were updated in real time as transactions were processed. At inference time, the microservice pulled pre-computed behavioural features from cache, computed the remaining real-time features, and returned a fraud probability score and SHAP explanation payload in under 120ms on the 95th percentile.

Decisioning logic used a dual-threshold approach: transactions above a high-confidence threshold were auto-declined; transactions in a grey zone were routed to a manual review queue; transactions below the low threshold were passed. Thresholds were calibrated to keep the false-positive rate at or below the baseline rule engine rate.

Weeks 15–16: Risk team handover and monitoring

We built a monitoring dashboard tracking fraud rate by category, false-positive rate, model score distribution, and feature drift indicators. Model retraining triggers were defined: a significant drift in score distribution or a spike in false negatives flagged in the review queue would initiate a retraining cycle. The risk team was trained on the SHAP explanation interface and the retraining process.

Outcome

Metric	Baseline (rule engine)	90 days post-deployment
Quarterly fraud loss	MYR 1.8M	MYR 0.70M (−61%)
False-positive rate	0.31%	0.29%
Fraud detection recall	~40% (estimated)	78%
Card-testing attack block rate	~55%	91%
Merchant collusion flagged	0 cases/quarter	14 cases/quarter
Median fraud decision latency	42ms	118ms (within 800ms SLA)

The merchant collusion detection — previously invisible to the rule engine — was the most operationally significant result. Fourteen cases flagged in the first quarter represented a fraud vector that had been running undetected.

Key Takeaways

Optimise for the right metric. Accuracy is the wrong success metric for fraud detection on imbalanced data. The model was evaluated and tuned on precision-recall tradeoff and F-beta score, not overall accuracy. This distinction matters at model selection, training, and business sign-off.

SHAP values are an operational requirement, not a nice-to-have. Fraud analysts who can’t understand why a transaction was flagged can’t act on it efficiently. Explainability reduced manual review time and increased analyst confidence in the model’s outputs.

Rule engines and ML aren’t mutually exclusive. The rule engine was retained for a small subset of high-confidence fraud patterns that required zero-latency decisions. The ML model handled the probabilistic cases. A hybrid architecture extracted the best from both approaches.

This engagement draws on our Data Intelligence & Analytics services. If your payments platform is outpacing your fraud detection capability, let’s talk.