Architecture of a Rental Aggregator: Scraping and Normalizing 90+ Sources

#python #scraping #architecture #realestate

Building a rental aggregator for Dublin means pulling data from a fragmented market: one dominant portal, a handful of mid-tier sites, dozens of letting agency websites, property management company portals, and a long tail of small sources. Here's how the system is structured and where the interesting problems are.

Why existing portals aren't enough

The standard answer for finding Dublin rentals is Daft.ie. It has good coverage and solid search. The problem is that coverage isn't complete. Letting agencies list exclusively on their own websites. Some landlords use smaller portals. A non-trivial share of listings never hits Daft at all.

If you're only searching Daft, you're seeing maybe 60-70% of what's available. For a renter in a tight market, that gap matters.

An aggregator's value proposition is simple: search everything, show results in one place. The technical challenge is that "everything" means 90+ sources with no shared API, no standard format, and varying levels of scraping difficulty.

Source taxonomy

I classify sources into tiers based on scraping approach:

Tier 1: Structured APIs or feeds. A small number of sources expose RSS feeds or have semi-public JSON endpoints. These are easy. Pull, parse, done.

Tier 2: Consistent HTML structure. Most mid-tier portals have stable enough HTML that a straightforward scraper works reliably. BeautifulSoup or Playwright depending on whether the content is server-rendered or JS-generated.

Tier 3: Letting agency sites. These are the hardest. Each one is different. Some run on estate agent CMSes (Property Hive, Agentbox, PropertyBase), which gives you a consistent structure within a platform family. Others are custom-built or use generic CMS platforms with property listings bolted on. I maintain source-specific extractors for each one.

Tier 4: Social and informal. Facebook Marketplace, some WhatsApp community channels that get scraped via publicly accessible links. Lower data quality, higher volume of noise.

The scraping layer

Each source has a scraper that handles:

Discovery: Finding listing URLs. Pagination, sitemap traversal, or feed parsing depending on source type.
Extraction: Pulling raw data from a listing page. Structured fields where available, falling back to HTML parsing.
Change detection: Tracking whether a listing we've already seen has changed (price update, status change, taken down).

For JS-heavy sources I use Playwright with a headless Chromium instance. The overhead is significant compared to simple HTTP requests, so I only use it where necessary. Most letting agency sites are server-rendered and much cheaper to scrape.

Rate limiting is handled per source. I track request timestamps per domain and enforce minimum intervals. The last thing you want is to get blocked from a source because you hammered it.

`python
class RateLimiter:
def init(self, requests_per_minute: int):
self.min_interval = 60.0 / requests_per_minute
self.last_request: dict[str, float] = {}

async def wait(self, domain: str):
    now = time.time()
    last = self.last_request.get(domain, 0)
    elapsed = now - last
    if elapsed < self.min_interval:
        await asyncio.sleep(self.min_interval - elapsed)
    self.last_request[domain] = time.time()

The normalization layer

Every extractor produces a raw record. The normalization pipeline converts raw records to canonical form. I covered this in detail in the data normalization post, but the key fields are:

Price: always monthly EUR, with a qualifier (exact/from/on_application)
Bedrooms: integer, separated from "box room" counts
Location: normalized to neighborhood + postal district + lat/lng
Pet policy: boolean or null (not inferred when absent)
Available date: ISO 8601 or null
Source provenance: which extractor, when fetched, source URL

The normalization step is where most of the bugs live. Edge cases in source formatting surface constantly. A source starts using a new price format. A letting agency relaunches their website with different HTML structure. I run normalization in a separate pass from extraction so I can reprocess raw records when normalization logic changes without re-scraping.

Deduplication

Cross-source deduplication is essential. The same property often appears on multiple sources. Without deduplication, a user would see the same listing three times with slightly different data.

The deduplication approach is blocking plus similarity scoring:

Block by price band (+/- 10%) and geographic area (same district or adjacent)
Within blocks, score similarity on address string, price, beds, baths, available date
Pairs above the threshold get merged. The canonical record keeps the richest data from all matching sources.

This runs as a batch job after normalization completes for a crawl cycle.

Storage and freshness

Listings go into Postgres. I keep full history: every time a listing changes, the previous state is preserved with a timestamp. This lets me track price changes over time, which surfaces useful signals (listings that drop in price are often still available and motivated to rent).

Each listing has a freshness_score that decays over time. The score starts high when a listing is first seen or updated, and drops on a schedule tuned to how frequently that source typically updates. Stale listings get surfaced to users with a staleness label rather than hidden completely, because they might still be available.

The crawl cycle runs on a schedule per source. High-value sources (the main portals) run every 15-30 minutes. Long-tail sources run every few hours. The full catalog is refreshed within a 24-hour window at minimum.

Where this breaks down

A few honest failure modes:

Listings that go fast. A listing can be posted and let within hours. If a source is on a 4-hour crawl cycle, the user might never see it. For the highest-priority sources I push the crawl frequency as high as the site tolerates. But there's no fix for a source that updates infrequently.

Data accuracy. Scraped data is only as accurate as the source. Listings sometimes have wrong prices, wrong bed counts, or outdated availability dates. There's no reliable way to independently verify these without viewings. I surface source data as-is with the provenance visible.

CAPTCHAs and bot detection. Some sources actively block scraping. I don't fight these. If a source blocks me, I either find a publicly accessible feed or exclude that source.

I wrote a more user-facing view of how the aggregation works at https://homescout.io/guide/tools-find-apartment-dublin. This post is the technical layer underneath that.

Caspar Bannink. Founder of HomeScout.io. Building AI-powered rental search for Dublin.