How I built a free real estate AVM with 375k notary transactions, PostGIS, and a 4€/month VPS

#showdev #python #machinelearning #postgres

A few months ago I wanted to estimate my own apartment's value in Warsaw. Every Polish AVM I found was either paywalled, required signup, or used fuzzy "similar listings" instead of actual transaction prices.

I'm a dev. I had a weekend. So I built one: wyceniarka.online.

It's free, no signup, no ads. Type in an address, get a price in 30 seconds. Trained on 375 286 notary-recorded real estate transactions from a Polish public registry that — as far as I can tell — practically nobody scrapes.

This post is about how I got that data, how the model works, and the surprisingly small infrastructure it runs on.

The dataset nobody uses

Poland has a public registry called Rejestr Cen Nieruchomości (RCN) — "Real Estate Price Registry". Every notarized property transaction lands there with a price, area, transaction date, and coordinates. By law it's public, but in practice almost nobody scrapes it because the access is... let's call it characterful:

The API is a WMS service returning GML, not JSON
Coordinates come in EPSG:2180 (Polish national grid, not WGS84)
Column ordering in the response differs per district — same field can be the 4th column in Mokotów and 7th in Wesoła
No documentation. No rate limits documented. No API key.

I spent a weekend debugging coordinate weirdness alone. Once you have the parser working, you get 375 286 transactions for Warsaw covering 2018–2026. Plus 190 981 from a separate GUGiK national registry.

For real-time pricing context, I also crawl three listing portals nightly:

20 listings (asking price)

All deduplicated by external_id via UPSERT, removed listings flagged with removed_at.

The model

XGBoost regressor with 25 features, using native categorical support (no one-hot encoding):

# Numeric features (no imputation — XGBoost handles NaN natively)
NUMERIC_FEATURES = [
    "m2", "lng", "lat",
    "months_since_2018", "kwartal_trans",
    "rooms_int", "floor_int",
    "district_factor",
    "build_year",
    "dist_metro_m", "dist_park_m", "park_ha", "dist_school_m",
    "dist_road_m", "dist_rail_m", "dist_tram_m",
    "dist_forest_m", "forest_ha",
    "nbp_oferta_zl_m2", "nbp_trans_zl_m2",  # macro anchor
]

CATEGORICAL_FEATURES = [
    "rynek",                # primary / secondary market
    "district",             # 18 Warsaw districts
    "source",               # 'otodom' vs 'rcn_gold' — see below
    "building_type",        # kamienica / blok / apartamentowiec / dom / loft
    "construction_status",  # ready / to_renovation / to_completion
    "building_material",
    "road_class",
]

Spatial features are computed at training time via PostGIS nearest-neighbor lateral joins on:

OpenStreetMap: metro stations, parks, schools
BDOT10k (Polish national geodata): major roads, railways, tramways, forests, building footprints

Performance on a 41 471-row held-out test set:

Metric	Value
MAPE	12.6%
Median APE	6.7%
Within ±10%	64%
Within ±20%	86%
R²	0.745

Median 6.7% is the headline. The mean is dragged up by long-tail outliers (corner cases: ground-floor units with garden, luxury penthouses, etc.).

The offer-to-transaction trick

Here's a calibration detail I find interesting. Training data mixes two price types:

Otodom listings — asking prices (~13k current)
RCN — actual transaction prices (~375k historical)

Otodom asks are systematically ~8% higher than what apartments actually sell for. Instead of dropping one source or doing complicated weighting, I treat source as a categorical feature and let XGBoost learn the offset:

# At inference time, always predict "as if from Otodom listing"
# (richer features, more recent data) — then calibrate down
zl_m2_predicted_as_otodom = float(model.predict(X)[0])
OFFER_TO_TRANSACTION_RATIO = 0.92  # configurable in admin panel
zl_m2_predicted_transaction = zl_m2_predicted_as_otodom * OFFER_TO_TRANSACTION_RATIO

This means the model uses all 388k records for training but produces transaction-space predictions. The 0.92 multiplier is now editable from an admin tab so I can tune without retraining.

Stack

The whole thing runs on a single 4€/month VPS (4GB RAM, 2 vCPU). Stack:

PostgreSQL 15 + PostGIS — primary data store + spatial features
FastAPI — API + server-rendered HTML for SEO landing pages
nginx — TLS termination, static files, rate limiting
Cloudflare in front for caching + DDoS
Docker Compose for everything

Frontend is vanilla HTML + Leaflet for the map. No React/Vue/Svelte. The wycena form page weighs 48KB gzipped, full first paint <500ms on 4G.

Why no SPA? Because:

SEO matters (server-side rendered HTML with proper <meta> and JSON-LD ranks)
The interaction is "fill form → see result", a SPA framework would be 10x the code for zero benefit
Future-me will thank present-me when I touch this in 2 years

Model retrains weekly via cron (~5 min), pickled to disk. Inference is <100ms p99 — most of that is the PostGIS spatial lookup, not XGBoost.

Things I learned the hard way

Google ignores <lastmod> in sitemaps if every URL has the same date. I initially hardcoded <lastmod>{{ today() }}</lastmod> for all 22 URLs. Google explicitly says it treats this as a fake signal. Real per-URL lastmod (computed as MAX(otodom_listings.first_seen) per district) made a measurable difference in crawl frequency.

Building type matters more than I expected. A kamienica (pre-war tenement) and a blok (1970s slab) on the same street can differ by ±15% in price per m². Mapping these to a unified taxonomy was worth it.

IndexNow is two hours of work and zero downside. Bing + Yandex protocol for instant indexing pings. Google ignores it, but Bing is ~5% of Polish traffic — free upside.

Single-file Docker bind mounts are a footgun. I had nginx config mounted as a single file (- ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro). After git pull updated the file (new inode), nginx -s reload did nothing because the container was still pointed at the old inode. Container restart fixed it. Lesson: bind-mount the directory, not individual files.

What's next

Rental price prediction — separate model, harder because rental listings churn 3x faster than sales
Confidence intervals via quantile regression — current spread is a heuristic from comparables
Open-sourcing the RCN scraper — since I genuinely cannot find a working one anywhere

Try it

Estimate price: wyceniarka.online/wycen
Map view (~13k recent transactions): wyceniarka.online/mapa
Per-district analytics (18 districts): wyceniarka.online/dzielnice

Site is Polish-only (it's literally for Warsaw apartments), but happy to answer technical questions in English in the comments. Particularly interested in: