HDB Resale Price Prediction

[Data Sprint] A machine learning system that predicts Singapore HDB resale flat prices using structural, locational, and temporal features — enabling WOW! Real Estate Agency to provide data-driven pricing recommendations with an interactive calculator for buyers and sellers.

Type
GA Data Sprint
Dataset
150,634 transactions (2012–2021)
Best R²
0.88
Tools
Python, LightGBM, SHAP, Streamlit

The Problem

WOW! Real Estate Agency operates across Singapore's HDB resale market — the largest public housing market in Southeast Asia with over 80% of the population living in HDB flats. Agents need accurate price estimates to advise buyers on fair market value and sellers on optimal listing prices. Manual valuations are inconsistent and slow. The challenge: build a model that can predict resale prices within ~8% accuracy across 26 towns, 7 flat types, and a decade of market dynamics.

Results at a Glance

0.88
R² Score (Validation)
~8%
MAPE (Avg Error %)
40+
Engineered Features
5
Models Compared

Key Findings

Floor area is the dominant price driver

SHAP analysis confirms floor area as the strongest predictor, with each additional square metre adding approximately SGD 3,800 to the resale price. This structural relationship holds consistently across all towns and flat types, making it the single most important feature for valuation.

Location premiums vary by up to SGD 120,000

Central Area commands the highest premium at ~SGD 120K above average, while estates like Sembawang and Woodlands sit at the lower end. The mature vs non-mature estate distinction alone accounts for SGD 40-60K of this gap, driven by established amenities, school proximity, and transport connectivity.

Remaining lease drives a measurable depreciation curve

Flats with leases starting in the 2000s command 20-30% premiums over 1980s-era flats. With Singapore's 99-year leasehold model, every additional year of remaining lease adds ~SGD 1,200 in value — critical intelligence for buyers weighing older flats in prime locations vs newer flats in developing towns.

MRT proximity adds a transit premium

Each 100 metres closer to an MRT station adds approximately SGD 1,500 to resale value. Flats within 500m of an MRT interchange station show even stronger premiums, reflecting Singapore's transit-oriented development pattern.

Methodology

1
Data Quality Audit
Analysed 78 features across 150K transactions. Identified amenity proximity blanks as "no amenity nearby" (not missing data) and imputed accordingly.
2
Exploratory Analysis
11 visualisations covering price distributions, location premiums, structural drivers, temporal trends, and amenity accessibility effects.
3
Feature Engineering
8 domain-informed features: remaining lease, mature estate flag, MRT accessibility score, amenity density, floor-storey interaction, and log-transformed distances.
4
Model Comparison
5 models benchmarked: Mean Baseline, Ridge Regression, Random Forest, LightGBM, and XGBoost. Fair comparison using identical preprocessing pipeline.
5
Tuning & Validation
RandomizedSearchCV (20 iterations, 5-fold CV) on LightGBM. Cross-validation confirms stable performance with low variance across folds.
6
Interpretation & Deployment
SHAP explainability for feature attribution. Interactive Streamlit calculator for real-time price estimation with feature contribution breakdowns.

Technical Stack

Python 3.10+ pandas NumPy scikit-learn LightGBM XGBoost SHAP Matplotlib Seaborn Streamlit SciPy

What I Learned

Domain Feature Engineering

Singapore's HDB market has unique dynamics (99-year leases, mature vs non-mature estates, MRT-driven development) that require domain knowledge to encode effectively. Generic feature engineering misses these signals.

Missing Data Isn't Always Missing

The amenity proximity columns taught me that blank values can carry real meaning — "no mall within 500m" is information, not a gap. Treating it as missing data would have biased the model.

Model Interpretability Matters

For a real estate agency, knowing that "LightGBM predicts SGD 450K" isn't enough. SHAP values explain why — enabling agents to justify valuations to clients with data-driven reasoning.

End-to-End Thinking

Building the Streamlit calculator forced me to think beyond the notebook — how does the model get deployed? What inputs do end users need? This full-stack perspective strengthened the entire project.

Explore the Full Analysis

View the complete notebook, interactive calculator, or browse the source code.