Forrest Miller

Posted on Jun 1 • Originally published at validatrip.com

Six layers to canonicalize 'FiDi', 'Wall Street area', and 'Lower Manhattan' as one neighborhood

#ai #showdev #webdev #javascript

The problem

A user pastes their travel list into our travel-paste validator. They got recs from a friend, a blog, an AI itinerary, and a Reddit thread. The list mentions Lower Manhattan four ways:

"FiDi"
"Wall Street area"
"Financial District NYC"
"Lower Manhattan financial district"

Our Neighborhoods tab groups by area so users can plan a day at a time. Without canonicalization, those four labels become four area cards with one place each. The user can't tell that they're looking at one walkable district. The product looks broken.

This is a small example. The full surface across 145 cities is thousands of variants per city — formal vs informal, English vs local, Wikipedia title vs administrative subdivision, polygon-level granularity vs guidebook vocabulary. Solving it in one pass with a single API or a single LLM call is the obvious wrong answer.

We solved it in six layers. Each layer covers a class of variants the next layer can't see.

Layer 1: spatial reverse-geocode

When a validated item has (lat, lng), the polygon-based name overrides whatever text label the upstream Google Places match returned.

// lib/providers/places/spatial-reverse.ts
const r = await fetchAddressDescriptors(lat, lng, GOOGLE_GEOCODING_API_KEY);
const areas = r?.addressDescriptors?.areas ?? [];

// areas are returned smallest-to-largest with WITHIN / NEAR / OUTSKIRTS
// take the smallest WITHIN — the most specific containing area
return areas.find((a) => a.containment === 'WITHIN')?.displayName?.text ?? null;

If Google's call fails, we fall through to Mapbox v6 reverse with types=neighborhood,locality,place. Result is cached per rounded-coord (4 decimal places, ~10 m precision) so adjacent items in the same neighborhood share a single billable call.

This collapses the trivial split: two items that physically sit inside Tribeca will both come back labeled "Tribeca", even if one was pasted as "near the Holland Tunnel exit" and the other as "by the Mysterious Bookshop."

Layer 2: bulk alias mining via Gemini

The spatial pass handles items with coordinates. About 2.5% of real-world pastes have no coordinates because the title alone failed to match any place ("our friend's apartment in BoCoCa", "the Marais restaurant Maya recommended"). For those, we lean on a per-city alias dictionary.

We mined the dictionary once with a single Gemini 2.5 Flash call per city:

// scripts/places/mine-aliases-gemini.mjs
const prompt = `For each canonical neighborhood in ${city}, list the
informal abbreviations, prefix-stripped forms, article variations,
multilingual transliterations, and common misspellings real travelers use.

Return strict JSON: { "<canonical>": ["alias1", "alias2", ...] }

Canonical list:
${canonicalEntries.join("\n")}`;

We tried this against the Wikidata Action API first. The anon wbsearchentities + wbgetentities flow rate-limits at roughly 1 request per second with 11-second Retry-After headers under load. At ~75K canonical entries with ~4 fetches each, the wall-clock cost is days.

The Gemini call returns aliases for an entire city in one shot with hallucination guards: a returned alias is dropped if it matches another canonical entry in the same city (those are distinct neighborhoods), and aliases containing "X and Y" are dropped when the canonical doesn't (catches "Camden Town and Regent's Park" pseudo-combinations).

Result: FiDi → Financial District. DUMBO → Down Under the Manhattan Bridge Overpass. Le Marais → Marais. La Roma → Roma. 渋谷 → Shibuya. The 4-spelling NYC FiDi example collapses at this layer.

Layer 3: Overture Maps polygon point-in-polygon

Spatial reverse-geocode handles coords. The alias dictionary handles labels. Neither handles the case where a Google Places result returns the right polygon but at the wrong level of administrative granularity.

NYC is the canonical example. Overture's NYC neighborhood polygons are at Community Board granularity. CB 5 covers Midtown East, Midtown West, AND Murray Hill — three guidebook neighborhoods folded into one admin zone. If we wrote the polygon's raw name to the item, the user would see a "Manhattan Community Board 5" area card instead of the three distinct neighborhoods they care about.

We pre-bake the polygons once per release with DuckDB against Overture's S3:

duckdb -c "
  COPY (
    SELECT id, names.primary AS name, subtype, bbox, geometry
    FROM read_parquet('s3://overturemaps-us-west-2/release/2026-05-20.0/theme=divisions/type=division_area/*')
    WHERE subtype IN ('neighborhood','microhood','macrohood','borough','localadmin')
      AND ST_Intersects(geometry, ST_GeomFromText('${cityBbox}'))
  ) TO 'polygons-${slug}.geo.json' (FORMAT 'GeoJSON');
"

At validate time, we lazy-load the city's .geo.json, build a flatbush R-tree from pre-computed bboxes, run @turf/boolean-point-in-polygon to find every enclosing polygon, and pick the most specific one by subtype rank:

const SUBTYPE_RANK: Record<string, number> = {
  microhood: 0,
  neighborhood: 1,
  macrohood: 2,
  borough: 3,
  localadmin: 4,
};

Per-city "too-coarse" skip patterns prevent the granularity downgrade. NYC skips Community Board names so the LLM resolver (Layer 5) keeps its more specific guess. London skips borough names like "City of Westminster" because that single borough covers Mayfair / Soho / Marylebone / Covent Garden / Westminster / St James's — six guidebook neighborhoods. Mexico City skips the 13 alcaldía names.

The skip list is per-name, not per-subtype, because Overture occasionally labels admin areas at the neighborhood subtype.

Layer 4: POI gazetteer (Wikidata-derived)

Some titles are famous enough that the title itself tells you the neighborhood. "Whitney Museum of American Art" implies Meatpacking District. "Tate Modern" implies South Bank. We mined a per-city (title → neighborhood) gazetteer from Wikidata SPARQL:

SELECT ?poi ?poiLabel ?neighborhood ?neighborhoodLabel WHERE {
  VALUES ?landmarkType { wd:Q33506 wd:Q1244442 wd:Q12876 wd:Q35112 wd:Q860861 }
  ?poi wdt:P31/wdt:P279* ?landmarkType .
  ?poi wdt:P131 ?neighborhood .
  { ?neighborhood wdt:P131 wd:${cityQid} }
    UNION
  { ?neighborhood wdt:P131 ?parent . ?parent wdt:P131 wd:${cityQid} }
  FILTER NOT EXISTS { ?poi wdt:P576 ?dissolved }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

That gives us roughly 8,600 entries across 15 cities. Lookup is foldKey-exact only (we tried substring matching once — 14 of 23 hits on a live Paris audit were false positives, so exact-only is the conservative choice). The gazetteer file is module-cached after first load.

This catches the long tail Layer 3 polygons can't catch: a paste that says "Whitney" with no coordinates and no neighborhood label resolves to Meatpacking District because the gazetteer recognizes the abbreviation.

Layer 5: per-trip LLM canonical resolver

After the four deterministic layers, some labels still don't match. "Historic center of Mexico City", "Colonia Centro", "Centro Histórico", and "Centro" all describe the same neighborhood but the alias dictionary only had three of them.

The fifth layer runs ONE gpt-4o-mini call per trip that maps every distinct raw label to the canonical guidebook name:

// lib/validation/resolve-canonical-neighborhoods.ts
const prompt = `Map each raw label to the closest canonical neighborhood in ${city}.

If a label doesn't clearly match any canonical entry, return it unchanged.
Don't guess across cities. Don't invent new canonical names.

Canonical: ${canonicalNames.join(", ")}
Raw labels in this trip: ${rawNeighborhoods.join(", ")}

Return JSON: { "<raw>": "<canonical or raw>", ... }`;

Two defensive filters at the boundary: the response is filtered against Set(rawNeighborhoods) AND Set(canonicalNames). Any hallucinated key that isn't in the input set is dropped silently. Any value that isn't in the canonical set is dropped silently. The LLM can never hand the database a name we didn't ask about.

One call per trip, not per item. A 30-item trip costs the same as a 3-item trip. Latency is amortized into the existing validate step, which already takes 8 seconds.

If OPENAI_API_KEY is missing or the call fails, the layer is a no-op. The four deterministic layers above are sufficient for the common cases; the LLM is a long-tail backstop.

Layer 6: per-city consolidation maps

After all five layers, polygon-real names sometimes still aren't the guidebook canonical. Overture subdivides Polanco into "Polanco 3ª Sección", "Polanco 4ª Sección", "Polanco 5ª Sección". Paris arrondissements arrive in 50+ surface forms ("14th arrondissement of Paris", "Paris 14e Arrondissement", "14th Arrondissement").

A small per-city map collapses each polygon-real subdivision into the guidebook canonical via foldKey lookup:

export const CANONICAL_CONSOLIDATIONS: Record<string, Record<string, string>> = {
  'mexico city|MX': {
    'polanco 3a seccion': 'Polanco',
    'polanco 4a seccion': 'Polanco',
    'polanco 5a seccion': 'Polanco',
    morelos: 'Centro Histórico',
    guerrero: 'Centro Histórico',
    buenavista: 'Centro Histórico',
    // 30+ more
  },
  'paris|FR': {
    '14th arrondissement of paris': '14th Arrondissement',
    'paris 14e arrondissement': '14th Arrondissement',
    // 48+ more arrondissement variants
  },
};

When a new polygon raw name appears in verify-pass-d that should fold to an existing canonical, the fix is to add it here, not to teach the LLM to handle it stochastically.

Why six layers, and not one LLM call

A single LLM call would work for any individual case. It would also cost more per validate, take longer, hallucinate occasionally, and degrade silently when the model drifts between releases.

Each deterministic layer handles a class of variants the cheaper and more stable way:

Layer	Strength	Cost
1. Spatial reverse-geocode	Exact answer when coords exist	~1 Geocoding API call per unique coord
2. Gemini alias mining	Handles label-only no-coord cases	Mined ONCE per city, $0 per validate
3. Overture polygon PIP	Handles weird Google Places labels	Lazy-load + R-tree, no API call
4. POI gazetteer	Famous landmarks by title alone	foldKey-exact lookup, no API call
5. LLM canonical resolver	Long-tail rephrasings	One `gpt-4o-mini` call per trip
6. Consolidation map	Polygon-real → guidebook canonical	Static map lookup

The LLM is layer five, not layer one, because by the time we reach it, 95% of the labels have been resolved deterministically. The remaining 5% are exactly the cases where its strength — pattern-matching across spellings — matters most.

That same logic applies to most "the data is messy" problems in travel software. Real pastes are heterogeneous because real users are heterogeneous. A London trip has Tube-stop names, Wikipedia titles, blog rephrasings, and Reddit short-hand all in the same paste. The cheapest path to a clean Neighborhoods tab is to peel the layers in order: geometry first, dictionaries second, structured catalogs third, LLM last.

If you want to see the result, paste anything messy into a new trip and open the Neighborhoods tab. The canonicalization runs on every validate. The full canonical guidebook lives at our destinations page if you want to see what we're matching against, or check a ChatGPT itinerary if you want to see the same pipeline applied to LLM-generated travel plans.

Top comments (1)

Harjot Singh • Jun 1

i totally get the challenge of having multiple names for the same area - it can really confuse users trying to plan their visits. at Moonshift, we help developers get a full next.js + postgres + auth app deployed quickly, and you own the code on your github. if you're interested, i can offer you a free run to see how it works.