Marcos Cal

Posted on Apr 28

Bridging 533K Dubai Land Department transactions to 1,387 buildings — developer naming is the real problem

#webdev #datascience #postgres #ai

When I started building Ghost Workforce — a Dubai real estate intelligence dashboard — I assumed scraping the Dubai Land Department (DLD) was going to be the hard part.

It wasn't. The DLD publishes most of what you need: transaction-level data going back 15 years, M-codes (a building registry), Ejari rent registrations, RERA escrow status. The data is public. APIs exist. CSVs are downloadable.

The hard part was bridging it. Specifically — bridging a transaction's listed building name to the actual physical building.

This post is about that problem.

The shape of the data

Each DLD transaction record looks roughly like this:

{
  "transaction_id": "...",
  "transaction_date": "2018-04-12",
  "area_name": "Marina",
  "building_name_en": "Marina Pearl",
  "developer_name": "Select Group",
  "price_aed": 2150000,
  "size_sqft": 1240,
  "rooms": 2
}

After 15 years of records, you have:

533K+ transaction rows
4,200+ distinct building_name_en strings
1,387 actual physical buildings

The 4,200 is the problem. Most buildings have 2-7 spelling variants in the records.

Examples of the naming inconsistency

A single Marina tower can appear in DLD records as any of:

"Marina Pearl"
"Pearl Marina"
"Marina Pearl Tower"
"MARINA PEARL TWR"
"Al Lulu Marina" (Arabic transliteration drift)
"Marina Pearl - 1" (phase numbering)
"Marina Pearl 1"
"Tower Marina Pearl"

Some of these look like obvious dedupes. Some aren't. "Al Lulu" means "the pearl" in Arabic — same building, different transliteration choices made by different DLD operators over the years.

Worse: developers rename buildings post-handover, especially after sale-leasebacks or rebrandings. A building registered in 2014 as "Damac Marina Heights" can show up in 2022 transactions as "Marina One" because the developer assigned that name to a different tranche after a refinancing.

If you don't deduplicate properly:

Yields per building become noise
Transaction velocity is impossible to compute
Comparable-sales analysis is broken

What didn't work

Naive string normalization

def normalize(name):
    return name.lower().replace("tower", "").replace("twr", "").strip()

Catches the easy 60%. Misses everything that involves numbering, transliteration, or rebranding.

Plain Levenshtein

Too aggressive — clusters distinct buildings together when the operator drops the building number. "Marina Pearl 1" and "Marina Pearl 2" are different physical towers. Levenshtein wants to merge them.

Embeddings only

I tried sentence embeddings on the names. Helps for transliteration ("Al Lulu Marina" ↔ "Marina Pearl") but produces false positives between buildings in the same tower complex (Marina Heights 1 vs Marina Heights 2).

What worked: a hybrid pipeline

The pipeline I ended up with:

1. Geo-bucketing first. Every DLD record has a transaction lat/lng (from the M-code geocode lookup). I bucket transactions into 50m-radius geo-cells. Two records can only refer to the same building if they fall in the same cell.

2. Within-cell name canonicalization. Per geo-cell, run a fuzzy match (rapidfuzz token_set_ratio) between all distinct names. Names with score > 90 cluster together.

3. Embedding tiebreaker. When the fuzzy score sits between 75 and 90, use sentence-transformers (all-MiniLM-L6-v2) similarity. Threshold > 0.85 → same building. Below → flagged for review.

4. M-code anchor when present. DLD M-codes are the gold standard. When a transaction has an M-code, that overrides everything — every transaction with that M-code points to the same building, regardless of name spelling.

5. Manual review queue. After the pipeline, ~140 ambiguous clusters remained. I reviewed those by hand against satellite imagery (lat/lng + Google Maps street view). About 4 hours of work.

The Postgres schema

CREATE TABLE buildings (
  id SERIAL PRIMARY KEY,
  canonical_name TEXT NOT NULL,
  area_name TEXT,
  developer_name TEXT,
  m_code TEXT UNIQUE,
  lat NUMERIC(10, 7),
  lng NUMERIC(10, 7),
  cluster_tier TEXT  -- icon_ultra, prime, mid_upper, mid_market, budget
);

CREATE TABLE building_aliases (
  id SERIAL PRIMARY KEY,
  building_id INT REFERENCES buildings(id),
  alias TEXT NOT NULL,
  source TEXT,  -- 'dld_transaction', 'rera_registry', 'ejari_lease'
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE transactions (
  id BIGSERIAL PRIMARY KEY,
  building_id INT REFERENCES buildings(id),
  transaction_date DATE,
  price_aed NUMERIC(14,2),
  size_sqft NUMERIC(10,2),
  rooms INT,
  raw_name TEXT  -- preserved for audit
);

Critical: every transaction keeps its raw_name so we can audit aliases retroactively and re-cluster if developer renames trigger a future drift.

Real yield calculation, after dedup

Once buildings are clean, computing real yield per building is straightforward:

WITH sales AS (
  SELECT building_id, AVG(price_aed / size_sqft) AS px_per_sqft_med
  FROM transactions
  WHERE transaction_date > NOW() - INTERVAL '24 months'
    AND transaction_type = 'sale'
  GROUP BY building_id
  HAVING COUNT(*) >= 5
),
rents AS (
  SELECT building_id, AVG(annual_rent / size_sqft) AS rent_per_sqft_med
  FROM ejari_registrations
  WHERE registration_date > NOW() - INTERVAL '24 months'
  GROUP BY building_id
  HAVING COUNT(*) >= 10
)
SELECT
  b.canonical_name,
  s.px_per_sqft_med,
  r.rent_per_sqft_med,
  ROUND((r.rent_per_sqft_med / s.px_per_sqft_med) * 100, 2) AS gross_yield_pct
FROM buildings b
JOIN sales s ON s.building_id = b.id
JOIN rents r ON r.building_id = b.id;

The HAVING COUNT(*) >= 5 (sales) and >= 10 (rents) thresholds are what produce the MED/HIGH confidence labels in the product. Fewer than that and the yield is too noisy to publish.

The lesson

Public data feels like a shortcut. It is — but only after you bridge it. The bridge is the moat.

For Ghost Workforce, the dedup pipeline + M-code anchor + manual review queue is what separates "I scraped DLD" from "I have an analyst-grade Dubai building dataset." Anyone can pull the CSV. The 4 hours of satellite-imagery sanity-checks are what make the yield numbers usable.

If you're working on emerging-market real estate, government data sources usually have this exact shape: messy public, clean private. The technical edge is inverting that.

Ghost Workforce is live at app.ghostworkforce.com — DLD-bridged data on 1,387 Dubai buildings, free tier.

DEV Community