Data Normalization Across Dublin Rental Portals: How to Make Listings Comparable

#data #dataengineering #sideprojects #webscraping

Data Normalization Across Dublin Rental Portals: How to Make Listings Comparable

Dublin rental listings are fragmented even across the main portals. Daft.ie and Rent.ie use different structures, labels, price conventions, and quirks, which makes direct comparison harder than it should be.

When I built the comparison layer for HomeScout, the aggregation part turned out to be straightforward compared to normalization. Here's what the normalization problem actually looks like, and how I approached it.

The raw data problem

Consider something as simple as price. Across sources you'll see:

"€1,750 per month"
"1750 pcm"
"£1750/month" (some UK-registered portals still do this for Dublin listings)
"From €1,700" (minimum of a range)
"Price on application"
"1,750" with currency implied by context
"€1,750 per month + utilities" (you have to decide whether to strip the utility note or flag it)

That's one field. Beds has similar variation: "2 bed", "2 bedroom", "Two bedrooms", "2BR", "2+1" (2 beds plus a box room). Some sources omit beds entirely and describe the property type instead.

Area/location is the worst. Sources use different geographic taxonomies. One uses Dublin postal codes (D4, D6, D6W). Another uses neighborhood names (Rathmines, Ranelagh, Rathgar). Another uses the street address and nothing else. Some use both but inconsistently. The same property can appear as "Ranelagh, Dublin 6" on one source and "Dublin 6" on another, and you have to know those are the same area.

The normalization pipeline

Each source gets a custom extractor that produces a raw record. The raw record has whatever fields the source provides, with light cleaning (strip HTML, trim whitespace, decode entities). No interpretation yet.

The normalization step runs after extraction. It takes the raw record and produces a canonical record with typed, standardized fields.

Price normalization:

def normalize_price(raw_price: str) -> tuple[int | None, str]:
    """
    Returns (monthly_eur, price_qualifier)
    qualifier: 'exact' | 'from' | 'on_application' | 'unknown'
    """
    if not raw_price:
        return None, 'unknown'

    raw = raw_price.lower().strip()

    if 'application' in raw or 'poa' in raw:
        return None, 'on_application'

    # Extract numeric value
    amount = re.sub(r'[^\d,.]', '', raw)
    amount = amount.replace(',', '')

    try:
        value = int(float(amount))
    except ValueError:
        return None, 'unknown'

    # Weekly to monthly conversion
    if 'per week' in raw or '/week' in raw or 'pw' in raw:
        value = round(value * 52 / 12)

    qualifier = 'from' if raw.startswith('from') else 'exact'
    return value, qualifier

The qualifier field matters for display. A "from" price should be labeled differently than an exact price in comparison views.

Bedroom normalization:

Word-to-number mapping handles written numbers. The "+1" box room convention gets flagged separately so you can filter on actual bedrooms vs. "bedrooms including box room."

Geographic normalization:

This is the hard part. My approach:

Extract any Dublin postcode from the raw location string (regex for D1-D24, D6W)
If no postcode, attempt a fuzzy match against a neighborhood lookup table
If that fails, geocode the street address and assign to the containing postal district
If all else fails, store the raw string and flag for manual review

The neighborhood lookup table is a maintained JSON file with aliases. "Ranelagh" maps to D6. "Rathmines" maps to D6. "Rathgar" maps to D6. "Harold's Cross" maps to D6W. And so on. It's not glamorous but it works.

{
  "ranelagh": {"district": "D6", "canonical_name": "Ranelagh"},
  "rathmines": {"district": "D6", "canonical_name": "Rathmines"},
  "rathgar": {"district": "D6", "canonical_name": "Rathgar"},
  "harolds cross": {"district": "D6W", "canonical_name": "Harold's Cross"},
  ...
}

Deduplication

The same property often appears on multiple sources. Deduplication is a separate pass after normalization.

I use a blocking strategy: only compare listings within the same price band (+/- 10%) and same area (same district or neighboring districts). Within a block, I compute a similarity score based on:

Address string similarity (Levenshtein on the normalized address)
Price match (exact or within 5%)
Bed/bath match
Available date proximity

Score above threshold: mark as duplicate, keep the richest record (the one with more fields populated), store source provenance for both.

The threshold needs tuning. Too low and you collapse distinct listings. Too high and you miss obvious duplicates. I landed on a score that errs toward keeping separate records when uncertain, because a false merge (showing one listing when there are two) is worse than a false non-merge (showing the same property twice).

Making listings comparable in the UI

After normalization, every listing has the same fields in the same format. The comparison view is then straightforward: pick listings to compare, render their canonical fields side by side.

The useful columns for Dublin rentals turned out to be: price, beds, baths, area (with DART/Luas proximity calculated from lat/lng), included utilities, pet policy, available date, and lease term. I surface which fields came from structured source data vs. which were inferred from description text, because inferred data has lower reliability.

I wrote a more user-facing version of this at https://homescout.io/guide/how-to-compare-dublin-apartments-without-spreadsheet if you want to see what the normalized comparison looks like in practice.

The normalization work is not exciting. It's the kind of thing that takes three times longer than you expect and surfaces a new edge case every week. But it's the foundation everything else sits on.

Caspar Bannink. Founder of HomeScout.io. Building AI-powered rental search for Dublin.