In this project, I built a complete end-to-end machine learning system
that:
- Scrapes property listings
- Cleans and engineers features
- Trains multiple ML models
- Deploys a pricing app
- Builds a business-ready dashboard
This article walks through the entire pipeline from raw web data to a deployed ML product.
Step 1 --- Web Scraping
I built a Selenium scraper to extract:
- Location
- Property Type
- Bedrooms
- Bathrooms
- Size (sqm)
- Amenities
- Price (KES)
- Listing Date
Sample Scraping Logic
listings = driver.find_elements(
By.XPATH,
"//div[contains(@class,'listing') or contains(@class,'property') or contains(@class,'card')]"
)
for listing in listings:
link = listing.find_element(By.XPATH, ".//a[contains(@href,'/listings/')]")
property_url = link.get_attribute("href")
Sample Scraping Logic
listings = driver.find_elements(
By.XPATH,
"//div[contains(@class,'listing') or contains(@class,'property') or contains(@class,'card')]"
)
for listing in listings:
link = listing.find_element(By.XPATH, ".//a[contains(@href,'/listings/')]")
property_url = link.get_attribute("href")
Step 3 --- Exploratory Analysis
Most Expensive Locations
location_prices = df.groupby("Location")["Price (KES)"].median().sort_values(ascending=False)
print(location_prices)
Step 4 --- Modeling
Train/Test Split
from sklearn.model_selection import train_test_split
X = df[["Bedrooms", "Bathrooms", "Size (sqm)", "amenity_score"]]
y = df["Price (KES)"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Linear Regression (Baseline)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
mae = mean_absolute_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test, pred))
r2 = r2_score(y_test, pred)
Random Forest
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
n_estimators=200,
random_state=42
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
XGBoost
from xgboost import XGBRegressor
xgb = XGBRegressor(
n_estimators=300,
learning_rate=0.05,
max_depth=5,
random_state=42
)
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)
Step 5 --- Deployment (Streamlit App)
The pricing app allows users to input:
- Location
- Bedrooms
- Bathrooms
- Size
- Amenities
And returns:
- Predicted price
- Estimated range (± MAE)
- Explanation of price drivers
Run locally:
streamlit run Streamlit_app.py
Step 6 --- Executive Dashboard
Built using Streamlit with interactive filters.
Includes:
- Median price by location
- Monthly price trends
- Price per sqft comparison
- Amenity impact analysis
Run:
streamlit run Dashboard.py
Key Insights
- Size is the strongest determinant of price.
- Premium neighborhoods significantly increase valuation.
- Amenities increase value but are secondary drivers.
Top comments (0)