A few months ago I wanted to estimate my own apartment's value in Warsaw. Every Polish AVM I found was either paywalled, required signup, or used fuzzy "similar listings" instead of actual transaction prices.
I'm a dev. I had a weekend. So I built one: wyceniarka.online.
It's free, no signup, no ads. Type in an address, get a price in 30 seconds. Trained on 375 286 notary-recorded real estate transactions from a Polish public registry that — as far as I can tell — practically nobody scrapes.
This post is about how I got that data, how the model works, and the surprisingly small infrastructure it runs on.
The dataset nobody uses
Poland has a public registry called Rejestr Cen Nieruchomości (RCN) — "Real Estate Price Registry". Every notarized property transaction lands there with a price, area, transaction date, and coordinates. By law it's public, but in practice almost nobody scrapes it because the access is... let's call it characterful:
- The API is a WMS service returning GML, not JSON
- Coordinates come in EPSG:2180 (Polish national grid, not WGS84)
- Column ordering in the response differs per district — same field can be the 4th column in Mokotów and 7th in Wesoła
- No documentation. No rate limits documented. No API key.
I spent a weekend debugging coordinate weirdness alone. Once you have the parser working, you get 375 286 transactions for Warsaw covering 2018–2026. Plus 190 981 from a separate GUGiK national registry.
For real-time pricing context, I also crawl three listing portals nightly:
- 20 listings (asking price)
All deduplicated by external_id via UPSERT, removed listings flagged with removed_at.
The model
XGBoost regressor with 25 features, using native categorical support (no one-hot encoding):
# Numeric features (no imputation — XGBoost handles NaN natively)
NUMERIC_FEATURES = [
"m2", "lng", "lat",
"months_since_2018", "kwartal_trans",
"rooms_int", "floor_int",
"district_factor",
"build_year",
"dist_metro_m", "dist_park_m", "park_ha", "dist_school_m",
"dist_road_m", "dist_rail_m", "dist_tram_m",
"dist_forest_m", "forest_ha",
"nbp_oferta_zl_m2", "nbp_trans_zl_m2", # macro anchor
]
CATEGORICAL_FEATURES = [
"rynek", # primary / secondary market
"district", # 18 Warsaw districts
"source", # 'otodom' vs 'rcn_gold' — see below
"building_type", # kamienica / blok / apartamentowiec / dom / loft
"construction_status", # ready / to_renovation / to_completion
"building_material",
"road_class",
]
Spatial features are computed at training time via PostGIS nearest-neighbor lateral joins on:
- OpenStreetMap: metro stations, parks, schools
- BDOT10k (Polish national geodata): major roads, railways, tramways, forests, building footprints
Performance on a 41 471-row held-out test set:
| Metric | Value |
|---|---|
| MAPE | 12.6% |
| Median APE | 6.7% |
| Within ±10% | 64% |
| Within ±20% | 86% |
| R² | 0.745 |
Median 6.7% is the headline. The mean is dragged up by long-tail outliers (corner cases: ground-floor units with garden, luxury penthouses, etc.).
The offer-to-transaction trick
Here's a calibration detail I find interesting. Training data mixes two price types:
- Otodom listings — asking prices (~13k current)
- RCN — actual transaction prices (~375k historical)
Otodom asks are systematically ~8% higher than what apartments actually sell for. Instead of dropping one source or doing complicated weighting, I treat source as a categorical feature and let XGBoost learn the offset:
# At inference time, always predict "as if from Otodom listing"
# (richer features, more recent data) — then calibrate down
zl_m2_predicted_as_otodom = float(model.predict(X)[0])
OFFER_TO_TRANSACTION_RATIO = 0.92 # configurable in admin panel
zl_m2_predicted_transaction = zl_m2_predicted_as_otodom * OFFER_TO_TRANSACTION_RATIO
This means the model uses all 388k records for training but produces transaction-space predictions. The 0.92 multiplier is now editable from an admin tab so I can tune without retraining.
Stack
The whole thing runs on a single 4€/month VPS (4GB RAM, 2 vCPU). Stack:
- PostgreSQL 15 + PostGIS — primary data store + spatial features
- FastAPI — API + server-rendered HTML for SEO landing pages
- nginx — TLS termination, static files, rate limiting
- Cloudflare in front for caching + DDoS
- Docker Compose for everything
Frontend is vanilla HTML + Leaflet for the map. No React/Vue/Svelte. The wycena form page weighs 48KB gzipped, full first paint <500ms on 4G.
Why no SPA? Because:
- SEO matters (server-side rendered HTML with proper
<meta>and JSON-LD ranks) - The interaction is "fill form → see result", a SPA framework would be 10x the code for zero benefit
- Future-me will thank present-me when I touch this in 2 years
Model retrains weekly via cron (~5 min), pickled to disk. Inference is <100ms p99 — most of that is the PostGIS spatial lookup, not XGBoost.
Things I learned the hard way
Google ignores <lastmod> in sitemaps if every URL has the same date. I initially hardcoded <lastmod>{{ today() }}</lastmod> for all 22 URLs. Google explicitly says it treats this as a fake signal. Real per-URL lastmod (computed as MAX(otodom_listings.first_seen) per district) made a measurable difference in crawl frequency.
Building type matters more than I expected. A kamienica (pre-war tenement) and a blok (1970s slab) on the same street can differ by ±15% in price per m². Mapping these to a unified taxonomy was worth it.
IndexNow is two hours of work and zero downside. Bing + Yandex protocol for instant indexing pings. Google ignores it, but Bing is ~5% of Polish traffic — free upside.
Single-file Docker bind mounts are a footgun. I had nginx config mounted as a single file (- ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro). After git pull updated the file (new inode), nginx -s reload did nothing because the container was still pointed at the old inode. Container restart fixed it. Lesson: bind-mount the directory, not individual files.
What's next
- Rental price prediction — separate model, harder because rental listings churn 3x faster than sales
- Confidence intervals via quantile regression — current spread is a heuristic from comparables
- Open-sourcing the RCN scraper — since I genuinely cannot find a working one anywhere
Try it
- Estimate price: wyceniarka.online/wycen
- Map view (~13k recent transactions): wyceniarka.online/mapa
- Per-district analytics (18 districts): wyceniarka.online/dzielnice
Site is Polish-only (it's literally for Warsaw apartments), but happy to answer technical questions in English in the comments. Particularly interested in:
- Anyone scraping similar public-registry data in other countries?
- Ideas for improving the offer→transaction calibration?
- Whether you've found rental-price prediction tractable with public data?
Top comments (0)