Caspar Bannink

Posted on Jun 2

What I Learned Normalizing Dublin Rental Listings from Messy Public Sources

#dublin #renting #ireland #proptech

I started HomeScout because Dublin renting is painful in a very specific way: the market is fragmented, fast-moving, and full of near-duplicates. At first I thought the hard part would be the AI layer. It was not.

The hard part was turning messy rental listings into a dataset that was stable enough for software to reason over.

An "AI rental search" product sounds like a natural-language interface problem. In practice, the useful version is mostly data engineering: normalize listing fields, detect duplicates, infer locations, spot stale records, preserve provenance, and only then let an AI system rank or explain anything.

This is what I learned building the listing pipeline behind HomeScout.

A rental listing is not a clean object

The naive schema is simple:

title
address
price
beds
baths
description
source_url
photos

That works for a demo. It breaks as soon as you combine sources.

One source may expose the postal district as a clean field. Another may bury it in the title. One listing says "Dublin 8"; another says "Kilmainham"; another says "near Heuston"; all three may refer to the same practical search area. Some listings include BER ratings, some do not. Some include available dates. Some have "contact agent" instead of a direct email. Some silently change price without changing URL.

The first lesson: a listing is not one object. It is a current observation of a property-like thing from a source.

That distinction matters. I ended up treating each source listing as an observation, then linking observations to a normalized listing candidate.

source_observation
  source
  source_listing_id
  source_url
  raw_title
  raw_address
  raw_price
  raw_description
  first_seen_at
  last_seen_at
  raw_payload_hash

normalized_listing
  canonical_title
  normalized_price_eur
  beds
  baths
  inferred_area
  inferred_postal_district
  geo_confidence
  dedupe_group_id
  availability_state

That extra layer sounds boring, but it prevents a lot of downstream mistakes.

Deduplication is fuzzy, not exact

Duplicate rental listings rarely match perfectly.

The same apartment can appear with:

slightly different titles
reordered address fragments
different photo counts
one source saying "2 bedroom apartment" and another saying "2 bed flat"
a price changed by 50 euro
a letting agent reposting after a stale listing expires

Exact URL matching is not enough. Exact address matching is not enough either, because addresses are often incomplete or phrased inconsistently.

The dedupe approach that worked best was a scored match across multiple weak signals:

same_or_similar_address       +35
same_postal_district          +10
same_bed_count                +15
price_within_small_delta      +15
title_similarity_high         +10
shared_photo_fingerprint      +30
same_agent_or_agency          +10
seen_within_recent_window     +10

The important part is that no single signal is trusted absolutely. Address is strong, but not always present. Photos are strong, but not always stable. Price is useful, but listings change price. Agent identity helps, but large agencies list many similar apartments.

I also keep the dedupe decision explainable. If two observations are grouped, the system stores the reason and score. That makes it possible to undo bad merges later.

Bad dedupe is worse than no dedupe. If you merge two different apartments, every ranking, alert, and user note attached to that listing becomes suspect.

Stale listings need a state machine

Rental listings disappear quickly. Some are removed. Some are reposted. Some become stale without ever being explicitly marked unavailable.

The first version of my pipeline treated "not found in latest scrape" as unavailable. That was too aggressive. Sources can fail, pages can be incomplete, and rate limits can produce partial results.

The better model is a small state machine:

active -> missing_once -> missing_repeatedly -> stale -> archived
active -> price_changed
active -> content_changed
stale -> active_reposted

That gives the system tolerance for noisy crawls. A listing does not vanish because one run missed it. It only becomes stale after repeated evidence.

This also matters for alerts. A reposted stale listing is not the same as a new listing, but for a renter it might still be relevant. A price drop is not a new listing either, but it can be more important than a new listing.

So the event stream needs more nuance than "new property found."

Address ambiguity is the hardest product problem

Dublin addresses are messy from a search perspective.

Users think in areas, commutes, postal districts, landmarks, and transport lines. Listings use a mix of all of those.

Examples:

"Dublin 6"
"Rathmines"
"near Ranelagh Luas"
"city centre"
"Docklands"
"Grand Canal"
"Dublin 2"

These overlap but are not interchangeable.

I split location handling into three layers:

Raw location text from the source.
Inferred structured labels: area, postal district, locality.
Geographic confidence: exact, approximate, area-level, unknown.

The confidence field is important. If a listing only says "Dublin city centre", the system should not pretend it has precise coordinates. It can still be useful, but the UI and ranking need to know the location is approximate.

This also affects natural-language search. If a user says "near the DART", that should not be solved by an LLM inventing areas. It should resolve through a deterministic lookup table of stations, corridors, and distance bands.

LLMs are useful for translating messy user intent into structured constraints. They are not a good source of geographic truth.

Price normalization is not just parsing euros

Most listings are monthly rent, but the raw text still needs care.

Common problems:

commas and periods in different places
"per month" omitted
bills included or excluded
sharing listings mixed with whole-property listings
parking or utility fees mentioned in description
price changes on the same URL

For HomeScout, I normalize rent to monthly EUR and store the raw price string separately. If a price changes, that is an event, not just a field update.

I also avoid over-normalizing things I cannot prove. If a description says "bills included", that becomes a flag with source evidence. If it only implies bills might be included, it stays unknown.

This is where a lot of AI products quietly go wrong: they convert uncertainty into false certainty because clean fields are easier to rank.

Provenance matters more than it sounds

Every normalized field should know where it came from.

For example:

beds = 2
source = parsed_title
confidence = high

area = Rathmines
source = address_text
confidence = medium

pet_friendly = unknown
source = no_positive_evidence
confidence = low

That makes the AI explanation layer much safer.

Instead of saying:

This listing is pet friendly.

it can say:

I did not find a pet policy in the listing text.

That difference matters because users act on these explanations. If the system cannot distinguish "false" from "unknown", it will mislead people.

The AI layer should be downstream

The architecture that has worked best is:

collect observations
normalize fields
dedupe candidates
infer location with confidence
track listing state
build user-specific hard filters
rank candidates
use AI to explain, draft, and summarize

The LLM is not the database. It is not the source of truth. It sits after the deterministic pipeline.

For example, if a renter says:

"2 bed near the Luas under 2200, ideally not too far from Grand Canal"

the system should parse that into:

{
  "beds_min": 2,
  "max_price_eur": 2200,
  "transport": ["luas"],
  "soft_area_preference": ["Grand Canal"]
}

Then normal database queries and geographic lookups do the heavy lifting. The AI can help explain why a listing matched, draft an inquiry email, or summarize tradeoffs, but it should not hallucinate inventory.

What I would do earlier next time

If I were starting again, I would invest earlier in three things.

First, raw observation storage. Keep the raw payloads or at least stable hashes and extracted raw fields. You will need them when a normalized decision looks wrong.

Second, confidence scores. Not ML confidence in the fancy sense, just explicit quality labels for inferred fields. Exact address is not the same as inferred area. Unknown is not false.

Third, event history. Renters care about changes: new listing, price drop, repost, stale, reactivated. A snapshot table alone loses that.

The main lesson is that AI is only useful if the underlying data model is honest about uncertainty.

For rental search, the hard technical problem is not making a chatbot that talks about apartments. It is building a data pipeline that knows what it knows, knows what it guessed, and does not blur the two.

That is the part I underestimated.

I am building HomeScout for Dublin renters: https://homescout.io

DEV Community