
🎯 The Objective
This project implements an end-to-end machine learning pipeline to analyze and predict house prices using over 80 structural, geospatial, and quality-based variables. The objective goes beyond simply achieving accurate price predictions; it explores alternative problem framings (classification for budget tiering) and uncovers hidden market structures using advanced unsupervised learning techniques.
🏗️ The Architecture & Methodology
To reflect a real-world machine learning workflow, the architecture was modularized into four distinct phases:
- Data Preprocessing & Engineering: Handled a massive dataset containing heterogeneous features. Dropped high-nullity columns, filtered near-zero variance features, and removed extreme outliers using robust Z-score thresholds. Engineered new, high-impact domain features such as
HouseAgeandTotalBath. - Supervised Regression (Price Prediction): Exhaustively evaluated baseline models against advanced tree-based ensembles to predict the continuous continuous
pricetarget variable. - Classification (Market Tiering): Discretized house prices into Low, Medium, and High categories. This reframing improves interpretability for non-technical stakeholders who need to categorize homes into distinct market segments.
- Unsupervised Learning (Market Segmentation): Utilized density-based clustering to identify natural, hidden groupings in the housing market without relying on price labels.
📊 Key Performance Metrics
After rigorous hyperparameter tuning via GridSearchCV, tree-based ensemble models significantly outperformed baseline linear models in capturing non-linear relationships.
- Top Regression Model: CatBoost
- Test R-Squared: 0.9252
- Test RMSLE: 0.1271
- Test RMSE: $19,465
💡 Core Insights & Business Impact
- Predictive Power: The tuned CatBoost and XGBoost models yield highly accurate and consistent numerical price estimates, explaining over 92% of the variance in the test data.
- Robust Classification: For budget tiering, the XGBoost Classifier provided superior robust generalization, outperforming standard Random Forest models.
- Latent Segments Identified: While baseline K-Means was tested, combining UMAP with HDBSCAN successfully isolated dense, organically shaped market segments while accurately flagging highly irregular properties as market anomalies.
⚙️ Technical Stack
- Languages & Libraries: Python, pandas, NumPy, scikit-learn.
- Algorithms: XGBoost, LightGBM, CatBoost, Random Forest, Linear/Ridge Regression, K-Means, UMAP, HDBSCAN.
- Visualization: Matplotlib, Plotly, Seaborn.