Housing Price Prediction

See Project with Full Code on GitHub

🎯 The Objective

This project implements an end-to-end machine learning pipeline to analyze and predict house prices using over 80 structural, geospatial, and quality-based variables. The objective goes beyond simply achieving accurate price predictions; it explores alternative problem framings (classification for budget tiering) and uncovers hidden market structures using advanced unsupervised learning techniques.

🏗️ The Architecture & Methodology

To reflect a real-world machine learning workflow, the architecture was modularized into four distinct phases:

  • Data Preprocessing & Engineering: Handled a massive dataset containing heterogeneous features. Dropped high-nullity columns, filtered near-zero variance features, and removed extreme outliers using robust Z-score thresholds. Engineered new, high-impact domain features such as HouseAge and TotalBath.
  • Supervised Regression (Price Prediction): Exhaustively evaluated baseline models against advanced tree-based ensembles to predict the continuous continuous price target variable.
  • Classification (Market Tiering): Discretized house prices into Low, Medium, and High categories. This reframing improves interpretability for non-technical stakeholders who need to categorize homes into distinct market segments.
  • Unsupervised Learning (Market Segmentation): Utilized density-based clustering to identify natural, hidden groupings in the housing market without relying on price labels.

📊 Key Performance Metrics

After rigorous hyperparameter tuning via GridSearchCV, tree-based ensemble models significantly outperformed baseline linear models in capturing non-linear relationships.

  • Top Regression Model: CatBoost
  • Test R-Squared: 0.9252
  • Test RMSLE: 0.1271
  • Test RMSE: $19,465

💡 Core Insights & Business Impact

  • Predictive Power: The tuned CatBoost and XGBoost models yield highly accurate and consistent numerical price estimates, explaining over 92% of the variance in the test data.
  • Robust Classification: For budget tiering, the XGBoost Classifier provided superior robust generalization, outperforming standard Random Forest models.
  • Latent Segments Identified: While baseline K-Means was tested, combining UMAP with HDBSCAN successfully isolated dense, organically shaped market segments while accurately flagging highly irregular properties as market anomalies.

⚙️ Technical Stack

  • Languages & Libraries: Python, pandas, NumPy, scikit-learn.
  • Algorithms: XGBoost, LightGBM, CatBoost, Random Forest, Linear/Ridge Regression, K-Means, UMAP, HDBSCAN.
  • Visualization: Matplotlib, Plotly, Seaborn.
Scroll to Top