[Capstone] A machine learning system that combines transaction data with clickstream browsing behaviour to score fraud risk and segment transactions into actionable tiers — enabling fraud teams to prioritise investigations while capturing 85%+ of fraud in the High Risk tier.
Fraud teams at financial institutions review thousands of transactions daily. Manual review of every transaction is infeasible — teams need an automated scoring system that prioritises the riskiest cases for investigation. With fraud representing 7.7% of transactions (916 out of 11,903), the challenge is building a model that catches fraud without drowning investigators in false positives.
The single strongest signal. International transactions alone have a 45.3% fraud rate, gift cards 36.8%. When combined, nearly 9 in 10 are fraudulent. This alone could justify an immediate business rule.
Fraudulent sessions visit 2.5x more sensitive pages (payment methods, address changes). They spend 77 seconds on forgot-password vs 32 for legitimate users (struggling with stolen credentials), and move through payment pages faster (74s vs 92s — pre-planned, not browsing).
Fraud rates during early morning hours reach double the daily average, likely because fewer staff are monitoring and fraudsters exploit reduced oversight.
Paired t-tests on 10-fold cross-validation confirm that performance differences between models are statistically significant (p < 0.05), not random noise. Calibration analysis shows predicted probabilities match observed fraud rates.
This project reinforced that the most impactful part of data science often isn't the model — it's the feature engineering. The clickstream behavioural features (navigation entropy, dwell times, sensitive page counts) weren't in the raw data. They came from thinking about what fraud actually looks like from the user's perspective: someone fumbling with stolen credentials, skipping the shopping experience, and rushing through payment pages.
I also learned the importance of temporal validation. A random train/test split gave misleadingly optimistic results because it leaked future patterns into training. Switching to a time-based split produced honest, production-realistic metrics — and revealed that a stacked ensemble can actually underperform individual models when the meta-learner can't adapt to temporal drift.
The complete notebook includes 64 cells covering EDA, modelling, SHAP explainability, calibration analysis, and risk segmentation.