I started HomeScout because Dublin renting is painful in a very specific way: the market is fragmented, fast-moving, and full of near-duplicates. At first I thought the hard part would be the AI layer. It was not.
The hard part was turning messy rental listings into a dataset that was stable enough for software to reason over.
An "AI rental search" product sounds like a natural-language interface problem. In practice, the useful version is mostly data engineering: normalize listing fields, detect duplicates, infer locations, spot stale records, preserve provenance, and only then let an AI system rank or explain anything.
This is what I learned building the listing pipeline behind HomeScout.
A rental listing is not a clean object
The naive schema is simple:
title
address
price
beds
baths
description
source_url
photos
That works for a demo. It breaks as soon as you combine sources.
One source may expose the postal district as a clean field. Another may bury it in the title. One listing says "Dublin 8"; another says "Kilmainham"; another says "near Heuston"; all three may refer to the same practical search area. Some listings include BER ratings, some do not. Some include available dates. Some have "contact agent" instead of a direct email. Some silently change price without changing URL.
The first lesson: a listing is not one object. It is a current observation of a property-like thing from a source.
That distinction matters. I ended up treating each source listing as an observation, then linking observations to a normalized listing candidate.
source_observation
source
source_listing_id
source_url
raw_title
raw_address
raw_price
raw_description
first_seen_at
last_seen_at
raw_payload_hash
normalized_listing
canonical_title
normalized_price_eur
beds
baths
inferred_area
inferred_postal_district
geo_confidence
dedupe_group_id
availability_state
That extra layer sounds boring, but it prevents a lot of downstream mistakes.
Deduplication is fuzzy, not exact
Duplicate rental listings rarely match perfectly.
The same apartment can appear with:
- slightly different titles
- reordered address fragments
- different photo counts
- one source saying "2 bedroom apartment" and another saying "2 bed flat"
- a price changed by 50 euro
- a letting agent reposting after a stale listing expires
Exact URL matching is not enough. Exact address matching is not enough either, because addresses are often incomplete or phrased inconsistently.
The dedupe approach that worked best was a scored match across multiple weak signals:
same_or_similar_address +35
same_postal_district +10
same_bed_count +15
price_within_small_delta +15
title_similarity_high +10
shared_photo_fingerprint +30
same_agent_or_agency +10
seen_within_recent_window +10
The important part is that no single signal is trusted absolutely. Address is strong, but not always present. Photos are strong, but not always stable. Price is useful, but listings change price. Agent identity helps, but large agencies list many similar apartments.
I also keep the dedupe decision explainable. If two observations are grouped, the system stores the reason and score. That makes it possible to undo bad merges later.
Bad dedupe is worse than no dedupe. If you merge two different apartments, every ranking, alert, and user note attached to that listing becomes suspect.
Stale listings need a state machine
Rental listings disappear quickly. Some are removed. Some are reposted. Some become stale without ever being explicitly marked unavailable.
The first version of my pipeline treated "not found in latest scrape" as unavailable. That was too aggressive. Sources can fail, pages can be incomplete, and rate limits can produce partial results.
The better model is a small state machine:
active -> missing_once -> missing_repeatedly -> stale -> archived
active -> price_changed
active -> content_changed
stale -> active_reposted
That gives the system tolerance for noisy crawls. A listing does not vanish because one run missed it. It only becomes stale after repeated evidence.
This also matters for alerts. A reposted stale listing is not the same as a new listing, but for a renter it might still be relevant. A price drop is not a new listing either, but it can be more important than a new listing.
So the event stream needs more nuance than "new property found."
Address ambiguity is the hardest product problem
Dublin addresses are messy from a search perspective.
Users think in areas, commutes, postal districts, landmarks, and transport lines. Listings use a mix of all of those.
Examples:
- "Dublin 6"
- "Rathmines"
- "near Ranelagh Luas"
- "city centre"
- "Docklands"
- "Grand Canal"
- "Dublin 2"
These overlap but are not interchangeable.
I split location handling into three layers:
- Raw location text from the source.
- Inferred structured labels: area, postal district, locality.
- Geographic confidence: exact, approximate, area-level, unknown.
The confidence field is important. If a listing only says "Dublin city centre", the system should not pretend it has precise coordinates. It can still be useful, but the UI and ranking need to know the location is approximate.
This also affects natural-language search. If a user says "near the DART", that should not be solved by an LLM inventing areas. It should resolve through a deterministic lookup table of stations, corridors, and distance bands.
LLMs are useful for translating messy user intent into structured constraints. They are not a good source of geographic truth.
Price normalization is not just parsing euros
Most listings are monthly rent, but the raw text still needs care.
Common problems:
- commas and periods in different places
- "per month" omitted
- bills included or excluded
- sharing listings mixed with whole-property listings
- parking or utility fees mentioned in description
- price changes on the same URL
For HomeScout, I normalize rent to monthly EUR and store the raw price string separately. If a price changes, that is an event, not just a field update.
I also avoid over-normalizing things I cannot prove. If a description says "bills included", that becomes a flag with source evidence. If it only implies bills might be included, it stays unknown.
This is where a lot of AI products quietly go wrong: they convert uncertainty into false certainty because clean fields are easier to rank.
Provenance matters more than it sounds
Every normalized field should know where it came from.
For example:
beds = 2
source = parsed_title
confidence = high
area = Rathmines
source = address_text
confidence = medium
pet_friendly = unknown
source = no_positive_evidence
confidence = low
That makes the AI explanation layer much safer.
Instead of saying:
This listing is pet friendly.
it can say:
I did not find a pet policy in the listing text.
That difference matters because users act on these explanations. If the system cannot distinguish "false" from "unknown", it will mislead people.
The AI layer should be downstream
The architecture that has worked best is:
collect observations
normalize fields
dedupe candidates
infer location with confidence
track listing state
build user-specific hard filters
rank candidates
use AI to explain, draft, and summarize
The LLM is not the database. It is not the source of truth. It sits after the deterministic pipeline.
For example, if a renter says:
"2 bed near the Luas under 2200, ideally not too far from Grand Canal"
the system should parse that into:
{
"beds_min": 2,
"max_price_eur": 2200,
"transport": ["luas"],
"soft_area_preference": ["Grand Canal"]
}
Then normal database queries and geographic lookups do the heavy lifting. The AI can help explain why a listing matched, draft an inquiry email, or summarize tradeoffs, but it should not hallucinate inventory.
What I would do earlier next time
If I were starting again, I would invest earlier in three things.
First, raw observation storage. Keep the raw payloads or at least stable hashes and extracted raw fields. You will need them when a normalized decision looks wrong.
Second, confidence scores. Not ML confidence in the fancy sense, just explicit quality labels for inferred fields. Exact address is not the same as inferred area. Unknown is not false.
Third, event history. Renters care about changes: new listing, price drop, repost, stale, reactivated. A snapshot table alone loses that.
The main lesson is that AI is only useful if the underlying data model is honest about uncertainty.
For rental search, the hard technical problem is not making a chatbot that talks about apartments. It is building a data pipeline that knows what it knows, knows what it guessed, and does not blur the two.
That is the part I underestimated.
I am building HomeScout for Dublin renters: https://homescout.io
Top comments (0)