When I started building Ghost Workforce — a Dubai real estate intelligence dashboard — I assumed scraping the Dubai Land Department (DLD) was going to be the hard part.
It wasn't. The DLD publishes most of what you need: transaction-level data going back 15 years, M-codes (a building registry), Ejari rent registrations, RERA escrow status. The data is public. APIs exist. CSVs are downloadable.
The hard part was bridging it. Specifically — bridging a transaction's listed building name to the actual physical building.
This post is about that problem.
The shape of the data
Each DLD transaction record looks roughly like this:
{
"transaction_id": "...",
"transaction_date": "2018-04-12",
"area_name": "Marina",
"building_name_en": "Marina Pearl",
"developer_name": "Select Group",
"price_aed": 2150000,
"size_sqft": 1240,
"rooms": 2
}
After 15 years of records, you have:
- 533K+ transaction rows
- 4,200+ distinct
building_name_enstrings - 1,387 actual physical buildings
The 4,200 is the problem. Most buildings have 2-7 spelling variants in the records.
Examples of the naming inconsistency
A single Marina tower can appear in DLD records as any of:
- "Marina Pearl"
- "Pearl Marina"
- "Marina Pearl Tower"
- "MARINA PEARL TWR"
- "Al Lulu Marina" (Arabic transliteration drift)
- "Marina Pearl - 1" (phase numbering)
- "Marina Pearl 1"
- "Tower Marina Pearl"
Some of these look like obvious dedupes. Some aren't. "Al Lulu" means "the pearl" in Arabic — same building, different transliteration choices made by different DLD operators over the years.
Worse: developers rename buildings post-handover, especially after sale-leasebacks or rebrandings. A building registered in 2014 as "Damac Marina Heights" can show up in 2022 transactions as "Marina One" because the developer assigned that name to a different tranche after a refinancing.
If you don't deduplicate properly:
- Yields per building become noise
- Transaction velocity is impossible to compute
- Comparable-sales analysis is broken
What didn't work
Naive string normalization
def normalize(name):
return name.lower().replace("tower", "").replace("twr", "").strip()
Catches the easy 60%. Misses everything that involves numbering, transliteration, or rebranding.
Plain Levenshtein
Too aggressive — clusters distinct buildings together when the operator drops the building number. "Marina Pearl 1" and "Marina Pearl 2" are different physical towers. Levenshtein wants to merge them.
Embeddings only
I tried sentence embeddings on the names. Helps for transliteration ("Al Lulu Marina" ↔ "Marina Pearl") but produces false positives between buildings in the same tower complex (Marina Heights 1 vs Marina Heights 2).
What worked: a hybrid pipeline
The pipeline I ended up with:
1. Geo-bucketing first. Every DLD record has a transaction lat/lng (from the M-code geocode lookup). I bucket transactions into 50m-radius geo-cells. Two records can only refer to the same building if they fall in the same cell.
2. Within-cell name canonicalization. Per geo-cell, run a fuzzy match (rapidfuzz token_set_ratio) between all distinct names. Names with score > 90 cluster together.
3. Embedding tiebreaker. When the fuzzy score sits between 75 and 90, use sentence-transformers (all-MiniLM-L6-v2) similarity. Threshold > 0.85 → same building. Below → flagged for review.
4. M-code anchor when present. DLD M-codes are the gold standard. When a transaction has an M-code, that overrides everything — every transaction with that M-code points to the same building, regardless of name spelling.
5. Manual review queue. After the pipeline, ~140 ambiguous clusters remained. I reviewed those by hand against satellite imagery (lat/lng + Google Maps street view). About 4 hours of work.
The Postgres schema
CREATE TABLE buildings (
id SERIAL PRIMARY KEY,
canonical_name TEXT NOT NULL,
area_name TEXT,
developer_name TEXT,
m_code TEXT UNIQUE,
lat NUMERIC(10, 7),
lng NUMERIC(10, 7),
cluster_tier TEXT -- icon_ultra, prime, mid_upper, mid_market, budget
);
CREATE TABLE building_aliases (
id SERIAL PRIMARY KEY,
building_id INT REFERENCES buildings(id),
alias TEXT NOT NULL,
source TEXT, -- 'dld_transaction', 'rera_registry', 'ejari_lease'
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE transactions (
id BIGSERIAL PRIMARY KEY,
building_id INT REFERENCES buildings(id),
transaction_date DATE,
price_aed NUMERIC(14,2),
size_sqft NUMERIC(10,2),
rooms INT,
raw_name TEXT -- preserved for audit
);
Critical: every transaction keeps its raw_name so we can audit aliases retroactively and re-cluster if developer renames trigger a future drift.
Real yield calculation, after dedup
Once buildings are clean, computing real yield per building is straightforward:
WITH sales AS (
SELECT building_id, AVG(price_aed / size_sqft) AS px_per_sqft_med
FROM transactions
WHERE transaction_date > NOW() - INTERVAL '24 months'
AND transaction_type = 'sale'
GROUP BY building_id
HAVING COUNT(*) >= 5
),
rents AS (
SELECT building_id, AVG(annual_rent / size_sqft) AS rent_per_sqft_med
FROM ejari_registrations
WHERE registration_date > NOW() - INTERVAL '24 months'
GROUP BY building_id
HAVING COUNT(*) >= 10
)
SELECT
b.canonical_name,
s.px_per_sqft_med,
r.rent_per_sqft_med,
ROUND((r.rent_per_sqft_med / s.px_per_sqft_med) * 100, 2) AS gross_yield_pct
FROM buildings b
JOIN sales s ON s.building_id = b.id
JOIN rents r ON r.building_id = b.id;
The HAVING COUNT(*) >= 5 (sales) and >= 10 (rents) thresholds are what produce the MED/HIGH confidence labels in the product. Fewer than that and the yield is too noisy to publish.
The lesson
Public data feels like a shortcut. It is — but only after you bridge it. The bridge is the moat.
For Ghost Workforce, the dedup pipeline + M-code anchor + manual review queue is what separates "I scraped DLD" from "I have an analyst-grade Dubai building dataset." Anyone can pull the CSV. The 4 hours of satellite-imagery sanity-checks are what make the yield numbers usable.
If you're working on emerging-market real estate, government data sources usually have this exact shape: messy public, clean private. The technical edge is inverting that.
Ghost Workforce is live at app.ghostworkforce.com — DLD-bridged data on 1,387 Dubai buildings, free tier.
Top comments (0)