DEV Community

John Wakaba
John Wakaba

Posted on

🏠 Building a Machine Learning Property Price Predictor (From Web Scraping to Deployment

In this project, I built a complete end-to-end machine learning system
that:

  • Scrapes property listings
  • Cleans and engineers features
  • Trains multiple ML models
  • Deploys a pricing app
  • Builds a business-ready dashboard

This article walks through the entire pipeline from raw web data to a deployed ML product.


Step 1 --- Web Scraping

I built a Selenium scraper to extract:

  • Location
  • Property Type
  • Bedrooms
  • Bathrooms
  • Size (sqm)
  • Amenities
  • Price (KES)
  • Listing Date

Sample Scraping Logic

listings = driver.find_elements(
    By.XPATH,
    "//div[contains(@class,'listing') or contains(@class,'property') or contains(@class,'card')]"
)

for listing in listings:
    link = listing.find_element(By.XPATH, ".//a[contains(@href,'/listings/')]")
    property_url = link.get_attribute("href")
Enter fullscreen mode Exit fullscreen mode

Sample Scraping Logic

listings = driver.find_elements(
    By.XPATH,
    "//div[contains(@class,'listing') or contains(@class,'property') or contains(@class,'card')]"
)

for listing in listings:
    link = listing.find_element(By.XPATH, ".//a[contains(@href,'/listings/')]")
    property_url = link.get_attribute("href")
Enter fullscreen mode Exit fullscreen mode

Step 3 --- Exploratory Analysis

Most Expensive Locations

location_prices = df.groupby("Location")["Price (KES)"].median().sort_values(ascending=False)
print(location_prices)
Enter fullscreen mode Exit fullscreen mode

Step 4 --- Modeling

Train/Test Split

from sklearn.model_selection import train_test_split

X = df[["Bedrooms", "Bathrooms", "Size (sqm)", "amenity_score"]]
y = df["Price (KES)"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
Enter fullscreen mode Exit fullscreen mode

Linear Regression (Baseline)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

lr = LinearRegression()
lr.fit(X_train, y_train)

pred = lr.predict(X_test)

mae = mean_absolute_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test, pred))
r2 = r2_score(y_test, pred)
Enter fullscreen mode Exit fullscreen mode

Random Forest

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

XGBoost

from xgboost import XGBRegressor

xgb = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=5,
    random_state=42
)

xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

Step 5 --- Deployment (Streamlit App)

The pricing app allows users to input:

  • Location
  • Bedrooms
  • Bathrooms
  • Size
  • Amenities

And returns:

  • Predicted price
  • Estimated range (± MAE)
  • Explanation of price drivers

Run locally:

streamlit run Streamlit_app.py
Enter fullscreen mode Exit fullscreen mode

Step 6 --- Executive Dashboard

Built using Streamlit with interactive filters.

Includes:

  • Median price by location
  • Monthly price trends
  • Price per sqft comparison
  • Amenity impact analysis

Run:

streamlit run Dashboard.py
Enter fullscreen mode Exit fullscreen mode

Key Insights

  • Size is the strongest determinant of price.
  • Premium neighborhoods significantly increase valuation.
  • Amenities increase value but are secondary drivers.

Top comments (0)