Housing Price Prediction – AI/ML Engineer

See Project with Full Code on GitHub

🎯 The Objective

This project implements an end-to-end machine learning pipeline to analyze and predict house prices using over 80 structural, geospatial, and quality-based variables. The objective goes beyond simply achieving accurate price predictions; it explores alternative problem framings (classification for budget tiering) and uncovers hidden market structures using advanced unsupervised learning techniques.

🏗️ The Architecture & Methodology

To reflect a real-world machine learning workflow, the architecture was modularized into four distinct phases:

Data Preprocessing & Engineering: Handled a massive dataset containing heterogeneous features. Dropped high-nullity columns, filtered near-zero variance features, and removed extreme outliers using robust Z-score thresholds. Engineered new, high-impact domain features such as HouseAge and TotalBath.
Supervised Regression (Price Prediction): Exhaustively evaluated baseline models against advanced tree-based ensembles to predict the continuous continuous price target variable.
Classification (Market Tiering): Discretized house prices into Low, Medium, and High categories. This reframing improves interpretability for non-technical stakeholders who need to categorize homes into distinct market segments.
Unsupervised Learning (Market Segmentation): Utilized density-based clustering to identify natural, hidden groupings in the housing market without relying on price labels.

📊 Key Performance Metrics

After rigorous hyperparameter tuning via GridSearchCV, tree-based ensemble models significantly outperformed baseline linear models in capturing non-linear relationships.

Top Regression Model: CatBoost
Test R-Squared: 0.9252
Test RMSLE: 0.1271
Test RMSE: $19,465

💡 Core Insights & Business Impact

Predictive Power: The tuned CatBoost and XGBoost models yield highly accurate and consistent numerical price estimates, explaining over 92% of the variance in the test data.
Robust Classification: For budget tiering, the XGBoost Classifier provided superior robust generalization, outperforming standard Random Forest models.
Latent Segments Identified: While baseline K-Means was tested, combining UMAP with HDBSCAN successfully isolated dense, organically shaped market segments while accurately flagging highly irregular properties as market anomalies.

⚙️ Technical Stack

Languages & Libraries: Python, pandas, NumPy, scikit-learn.
Algorithms: XGBoost, LightGBM, CatBoost, Random Forest, Linear/Ridge Regression, K-Means, UMAP, HDBSCAN.
Visualization: Matplotlib, Plotly, Seaborn.