Banking Fraud Detection

The Problem

Fraud teams at financial institutions review thousands of transactions daily. Manual review of every transaction is infeasible — teams need an automated scoring system that prioritises the riskiest cases for investigation. With fraud representing 7.7% of transactions (916 out of 11,903), the challenge is building a model that catches fraud without drowning investigators in false positives.

Results at a Glance

0.95+

ROC-AUC Score

85%+

Fraud Captured (High Tier)

40+

Engineered Features

Visualisations

Key Findings

International + Gift Card = 89.3% fraud rate

The single strongest signal. International transactions alone have a 45.3% fraud rate, gift cards 36.8%. When combined, nearly 9 in 10 are fraudulent. This alone could justify an immediate business rule.

Clickstream behaviour reveals what transactions cannot

Fraudulent sessions visit 2.5x more sensitive pages (payment methods, address changes). They spend 77 seconds on forgot-password vs 32 for legitimate users (struggling with stolen credentials), and move through payment pages faster (74s vs 92s — pre-planned, not browsing).

Early morning fraud peak (1-7 AM)

Fraud rates during early morning hours reach double the daily average, likely because fewer staff are monitoring and fraudsters exploit reduced oversight.

Model performance is statistically validated

Paired t-tests on 10-fold cross-validation confirm that performance differences between models are statistically significant (p < 0.05), not random noise. Calibration analysis shows predicted probabilities match observed fraud rates.

Methodology

Data Integration

Merged 408K clickstream events into session-level behavioural features and joined with transaction records.

Feature Engineering

40+ features: navigation entropy, sensitive page counts, dwell times, financial ratios, temporal patterns, interaction terms.

Temporal Validation

70/30 time-based train/val split simulating real deployment where the model predicts future fraud from past data.

Model Training & Tuning

Four models (LR, DT, RF, XGBoost) with GridSearchCV and a stacked ensemble. Multicollinearity pruning at |r| > 0.90.

Statistical Rigour

Paired t-tests for significance, calibration curves with Brier scores, KS distribution tests for train-test consistency.

Explainability & Deployment

SHAP values for feature attribution, data-driven risk tiers optimised for 85%+ fraud capture, and Tableau-ready exports.

Technical Stack

Python 3.11+ pandas NumPy scikit-learn XGBoost SHAP SciPy matplotlib seaborn Jupyter

What I Learned

This project reinforced that the most impactful part of data science often isn't the model — it's the feature engineering. The clickstream behavioural features (navigation entropy, dwell times, sensitive page counts) weren't in the raw data. They came from thinking about what fraud actually looks like from the user's perspective: someone fumbling with stolen credentials, skipping the shopping experience, and rushing through payment pages.

I also learned the importance of temporal validation. A random train/test split gave misleadingly optimistic results because it leaked future patterns into training. Switching to a time-based split produced honest, production-realistic metrics — and revealed that a stacked ensemble can actually underperform individual models when the meta-learner can't adapt to temporal drift.