<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pszemo Koziniak</title>
    <description>The latest articles on DEV Community by Pszemo Koziniak (@pszemo_koziniak_ee9c886ae).</description>
    <link>https://dev.to/pszemo_koziniak_ee9c886ae</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3944630%2F5fc60256-146f-4195-8eb8-e2dda090a0b8.png</url>
      <title>DEV Community: Pszemo Koziniak</title>
      <link>https://dev.to/pszemo_koziniak_ee9c886ae</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pszemo_koziniak_ee9c886ae"/>
    <language>en</language>
    <item>
      <title>How I built a free real estate AVM with 375k notary transactions, PostGIS, and a 4€/month VPS</title>
      <dc:creator>Pszemo Koziniak</dc:creator>
      <pubDate>Thu, 21 May 2026 17:58:34 +0000</pubDate>
      <link>https://dev.to/pszemo_koziniak_ee9c886ae/how-i-built-a-free-real-estate-avm-with-375k-notary-transactions-postgis-and-a-4eumonth-vps-4375</link>
      <guid>https://dev.to/pszemo_koziniak_ee9c886ae/how-i-built-a-free-real-estate-avm-with-375k-notary-transactions-postgis-and-a-4eumonth-vps-4375</guid>
      <description>&lt;p&gt;A few months ago I wanted to estimate my own apartment's value in Warsaw. Every Polish AVM I found was either paywalled, required signup, or used fuzzy "similar listings" instead of actual transaction prices.&lt;/p&gt;

&lt;p&gt;I'm a dev. I had a weekend. So I built one: &lt;a href="https://wyceniarka.online/" rel="noopener noreferrer"&gt;&lt;strong&gt;wyceniarka.online&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's free, no signup, no ads. Type in an address, get a price in 30 seconds. Trained on &lt;strong&gt;375 286 notary-recorded real estate transactions&lt;/strong&gt; from a Polish public registry that — as far as I can tell — practically nobody scrapes.&lt;/p&gt;

&lt;p&gt;This post is about how I got that data, how the model works, and the surprisingly small infrastructure it runs on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dataset nobody uses
&lt;/h2&gt;

&lt;p&gt;Poland has a public registry called &lt;strong&gt;Rejestr Cen Nieruchomości (RCN)&lt;/strong&gt; — "Real Estate Price Registry". Every notarized property transaction lands there with a price, area, transaction date, and coordinates. By law it's public, but in practice almost nobody scrapes it because the access is... let's call it &lt;em&gt;characterful&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The API is a &lt;a href="https://www.ogc.org/standards/wms" rel="noopener noreferrer"&gt;WMS service&lt;/a&gt; returning GML, not JSON&lt;/li&gt;
&lt;li&gt;Coordinates come in &lt;strong&gt;EPSG:2180&lt;/strong&gt; (Polish national grid, not WGS84)&lt;/li&gt;
&lt;li&gt;Column ordering in the response &lt;strong&gt;differs per district&lt;/strong&gt; — same field can be the 4th column in Mokotów and 7th in Wesoła&lt;/li&gt;
&lt;li&gt;No documentation. No rate limits documented. No API key.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I spent a weekend debugging coordinate weirdness alone. Once you have the parser working, you get 375 286 transactions for Warsaw covering 2018–2026. Plus 190 981 from a separate GUGiK national registry.&lt;/p&gt;

&lt;p&gt;For real-time pricing context, I also crawl three listing portals nightly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;20 listings (asking price)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All deduplicated by &lt;code&gt;external_id&lt;/code&gt; via &lt;code&gt;UPSERT&lt;/code&gt;, removed listings flagged with &lt;code&gt;removed_at&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model
&lt;/h2&gt;

&lt;p&gt;XGBoost regressor with &lt;strong&gt;25 features&lt;/strong&gt;, using native categorical support (no one-hot encoding):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Numeric features (no imputation — XGBoost handles NaN natively)
&lt;/span&gt;&lt;span class="n"&gt;NUMERIC_FEATURES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;m2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lng&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;months_since_2018&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kwartal_trans&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rooms_int&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;floor_int&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;district_factor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build_year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dist_metro_m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dist_park_m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;park_ha&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dist_school_m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dist_road_m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dist_rail_m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dist_tram_m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dist_forest_m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forest_ha&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nbp_oferta_zl_m2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nbp_trans_zl_m2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# macro anchor
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;CATEGORICAL_FEATURES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rynek&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# primary / secondary market
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;district&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# 18 Warsaw districts
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c1"&gt;# 'otodom' vs 'rcn_gold' — see below
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;building_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# kamienica / blok / apartamentowiec / dom / loft
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;construction_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ready / to_renovation / to_completion
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;building_material&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;road_class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spatial features are computed at training time via PostGIS nearest-neighbor lateral joins on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenStreetMap: metro stations, parks, schools&lt;/li&gt;
&lt;li&gt;BDOT10k (Polish national geodata): major roads, railways, tramways, forests, building footprints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Performance on a 41 471-row held-out test set:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MAPE&lt;/td&gt;
&lt;td&gt;12.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median APE&lt;/td&gt;
&lt;td&gt;6.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Within ±10%&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Within ±20%&lt;/td&gt;
&lt;td&gt;86%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;R²&lt;/td&gt;
&lt;td&gt;0.745&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Median 6.7% is the headline. The mean is dragged up by long-tail outliers (corner cases: ground-floor units with garden, luxury penthouses, etc.).&lt;/p&gt;

&lt;h2&gt;
  
  
  The offer-to-transaction trick
&lt;/h2&gt;

&lt;p&gt;Here's a calibration detail I find interesting. Training data mixes two price types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Otodom listings&lt;/strong&gt; — asking prices (~13k current)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RCN&lt;/strong&gt; — actual transaction prices (~375k historical)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Otodom asks are systematically ~8% higher than what apartments actually sell for. Instead of dropping one source or doing complicated weighting, I treat &lt;code&gt;source&lt;/code&gt; as a categorical feature and let XGBoost learn the offset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# At inference time, always predict "as if from Otodom listing"
# (richer features, more recent data) — then calibrate down
&lt;/span&gt;&lt;span class="n"&gt;zl_m2_predicted_as_otodom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;OFFER_TO_TRANSACTION_RATIO&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;  &lt;span class="c1"&gt;# configurable in admin panel
&lt;/span&gt;&lt;span class="n"&gt;zl_m2_predicted_transaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zl_m2_predicted_as_otodom&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;OFFER_TO_TRANSACTION_RATIO&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the model uses &lt;em&gt;all 388k records&lt;/em&gt; for training but produces transaction-space predictions. The 0.92 multiplier is now editable from an admin tab so I can tune without retraining.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;p&gt;The whole thing runs on &lt;strong&gt;a single 4€/month VPS&lt;/strong&gt; (4GB RAM, 2 vCPU). Stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL 15 + PostGIS&lt;/strong&gt; — primary data store + spatial features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; — API + server-rendered HTML for SEO landing pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;nginx&lt;/strong&gt; — TLS termination, static files, rate limiting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare&lt;/strong&gt; in front for caching + DDoS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt; for everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Frontend is &lt;strong&gt;vanilla HTML + Leaflet&lt;/strong&gt; for the map. No React/Vue/Svelte. The wycena form page weighs &lt;strong&gt;48KB gzipped&lt;/strong&gt;, full first paint &amp;lt;500ms on 4G.&lt;/p&gt;

&lt;p&gt;Why no SPA? Because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;SEO matters (server-side rendered HTML with proper &lt;code&gt;&amp;lt;meta&amp;gt;&lt;/code&gt; and JSON-LD ranks)&lt;/li&gt;
&lt;li&gt;The interaction is "fill form → see result", a SPA framework would be 10x the code for zero benefit&lt;/li&gt;
&lt;li&gt;Future-me will thank present-me when I touch this in 2 years&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Model retrains weekly via cron (~5 min), pickled to disk. Inference is &amp;lt;100ms p99 — most of that is the PostGIS spatial lookup, not XGBoost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things I learned the hard way
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Google ignores &lt;code&gt;&amp;lt;lastmod&amp;gt;&lt;/code&gt; in sitemaps if every URL has the same date.&lt;/strong&gt; I initially hardcoded &lt;code&gt;&amp;lt;lastmod&amp;gt;{{ today() }}&amp;lt;/lastmod&amp;gt;&lt;/code&gt; for all 22 URLs. Google &lt;a href="https://developers.google.com/search/blog/2023/06/sitemaps-lastmod-ping" rel="noopener noreferrer"&gt;explicitly says&lt;/a&gt; it treats this as a fake signal. Real per-URL &lt;code&gt;lastmod&lt;/code&gt; (computed as &lt;code&gt;MAX(otodom_listings.first_seen)&lt;/code&gt; per district) made a measurable difference in crawl frequency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building type matters more than I expected.&lt;/strong&gt; A &lt;em&gt;kamienica&lt;/em&gt; (pre-war tenement) and a &lt;em&gt;blok&lt;/em&gt; (1970s slab) on the same street can differ by ±15% in price per m². Mapping these to a unified taxonomy was worth it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IndexNow is two hours of work and zero downside.&lt;/strong&gt; Bing + Yandex protocol for instant indexing pings. Google ignores it, but Bing is ~5% of Polish traffic — free upside.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-file Docker bind mounts are a footgun.&lt;/strong&gt; I had nginx config mounted as a single file (&lt;code&gt;- ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro&lt;/code&gt;). After &lt;code&gt;git pull&lt;/code&gt; updated the file (new inode), &lt;code&gt;nginx -s reload&lt;/code&gt; did nothing because the container was still pointed at the old inode. Container restart fixed it. Lesson: bind-mount the directory, not individual files.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rental price prediction&lt;/strong&gt; — separate model, harder because rental listings churn 3x faster than sales&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence intervals via quantile regression&lt;/strong&gt; — current spread is a heuristic from comparables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-sourcing the RCN scraper&lt;/strong&gt; — since I genuinely cannot find a working one anywhere&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Estimate price:&lt;/strong&gt; &lt;a href="https://wyceniarka.online/wycen" rel="noopener noreferrer"&gt;wyceniarka.online/wycen&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Map view (~13k recent transactions):&lt;/strong&gt; &lt;a href="https://wyceniarka.online/mapa" rel="noopener noreferrer"&gt;wyceniarka.online/mapa&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-district analytics (18 districts):&lt;/strong&gt; &lt;a href="https://wyceniarka.online/dzielnice" rel="noopener noreferrer"&gt;wyceniarka.online/dzielnice&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Site is Polish-only (it's literally for Warsaw apartments), but happy to answer technical questions in English in the comments. Particularly interested in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anyone scraping similar public-registry data in other countries?&lt;/li&gt;
&lt;li&gt;Ideas for improving the offer→transaction calibration?&lt;/li&gt;
&lt;li&gt;Whether you've found rental-price prediction tractable with public data?&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>showdev</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>postgres</category>
    </item>
  </channel>
</rss>
