Daniel Romitelli

Posted on Mar 19 • Edited on Mar 23 • Originally published at craftedbydaniel.com

The Six‑Tier Enrichment Cascade: How I Stop “Helpful” Data From Overwriting True Data

#dataquality #pipelines #provenance #python

I noticed something that felt impossible at first: the more enrichment I added, the more often my records got worse.

Not catastrophically. Not in a way that threw errors. Just… quietly. A city would flip to something nearby. A state would “correct” itself. A phone number would appear, but it was clearly the corporate HQ line, not the local office. And the worst part was that every one of those changes looked superficially reasonable—exactly the kind of “helpfulness” that sneaks past review.

This is Part 3 of “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”. In Part 2, I showed why I accept variance upstream and then validate hard downstream. This post is the other half of that philosophy: when you enrich data from multiple sources, you need a system that can accumulate truth without regressing it.

The core decision: I built a six-tier enrichment cascade with per-field provenance tracking. Each tier is allowed to fill blanks, but it is not allowed to overwrite a better source.

The key insight (and why the naive approach fails)

Single-source enrichment is seductive because it’s clean: one API call, one response, one “truth.” In production, it’s also how you end up with 30–40% data gaps—because no single provider returns everything you need, consistently, across the messy variety of real emails and real companies.

So the naive “fix” is obvious: add more sources.

That’s where things break.

If you simply merge dictionaries in the order you call APIs, you create a system that can’t tell the difference between:

a scraped street address from a web source, and
a location inferred from a phone area code approximation.

Both are “values.” One is a fact. The other is a guess. If you don’t encode that distinction, your pipeline will happily overwrite facts with guesses and call it an improvement.

My solution is deliberately unclever: every field gets provenance, and every source gets a numeric priority. Lower-priority sources can fill missing fields, but they can’t overwrite higher-priority data.

That’s the engineer’s job here: not “add AI,” but build the rules that keep the system honest.

How the cascade works under the hood

The research node in my LangGraph pipeline runs an enrichment priority chain:

Firecrawl enrichment (web scrape)
Bing Search
Domain variation retry (with a failed-domain cache)
Serper API
Azure Maps POI
Phone area code approximation

Two design choices make this reliable:

A source priority table (numeric, explicit)
Per-field source tracking (city can come from one tier while state comes from another)

Source priority: one table, no ambiguity

I keep the priority model explicit in code. The research node doesn’t “kind of” prefer one source—it has a single map that defines what wins.

# app/langgraph_manager.py:136-143

LOCATION_SOURCE_PRIORITY = {
    "firecrawl": 5,
    "bing": 4,
    "domain_variation": 3,
    "serper": 2,
    "azure_maps_poi": 1,
    "phone_area_code": 0,
}

The non-obvious win here is that I can reason about conflicts without reading the whole pipeline. When I’m debugging a bad record, I’m not asking “which function ran last?”—I’m asking “what priority was allowed to write this field?”

Per-field provenance: the merge rule that prevents regression

The second piece is per-field source provenance tracking. I don’t track “this record came from Bing.” I track city_source, state_source (and the same idea applies to other fields in the enrichment payload).

That means one tier can contribute a city while another tier contributes a phone, and neither one is allowed to stomp the other unless its priority is higher.

# app/langgraph_manager.py:146-210

# Per-field provenance tracking is handled in the research node.
# Each field has an associated "*_source" that records which tier populated it.
# Lower-priority sources fill empty fields but never overwrite higher-priority fields.

# Example fields tracked independently:
# - city_source
# - state_source

What surprised me when I first shipped this is how often “partial truth” is the norm. A web scrape might nail the company name but miss the city. A maps lookup might nail the city but not the phone. Provenance lets me accept that reality without letting the system devolve into last-write-wins chaos.

Tracing one record through all six tiers

When I explain this internally, I don’t describe it as “six APIs.” I describe it as a relay race where each runner is only allowed to carry the baton forward—not swap it for a different baton.

Here’s the flow at the level that matters: who gets to write which fields, and under what constraints.

flowchart TD
  extracted[Extracted company hints] --> firecrawl[Firecrawl enrichment]
  firecrawl --> bing[Bing Search]
  bing --> domainVariation[Domain variation retry]
  domainVariation --> serper[Serper API]
  serper --> azureMapsPoi[Azure Maps POI]
  azureMapsPoi --> phoneAreaCode[Phone area code approx.]

  firecrawl -->|priority 5| merge[Provenance merge]
  bing -->|priority 4| merge
  domainVariation -->|priority 3| merge
  serper -->|priority 2| merge
  azureMapsPoi -->|priority 1| merge
  phoneAreaCode -->|priority 0| merge

  merge --> enriched[Enriched record]

The merge step is the heart of it. Every tier produces candidate values; the merge logic decides whether each value is allowed to populate a field based on:

whether the field is empty
what source last populated it
the numeric priority of the incoming source

Tier 1: Firecrawl (web scrape)

Firecrawl is the highest priority in the chain (firecrawl: 5). When it returns data, it should win against everything else.

But I also made a business decision explicit in code: Firecrawl enrichment is disabled by default behind a feature flag because it’s a cost gate.

# app/config/feature_flags.py

ENABLE_FIRECRAWL_ENRICHMENT = false

That boolean looks small, but it’s a real line in the sand: “We can run the system without the most expensive enrichment tier, and the cascade will still behave predictably.”

The engineering implication is important: when the best tier is turned off, the rest of the pipeline shouldn’t start overwriting each other out of desperation. Provenance keeps the behavior stable.

Tier 2: Bing Search (fallback enrichment)

When Firecrawl is missing fields or has low confidence, the chain can enhance using Bing Search.

The decision rule is explicit in the research node: low confidence or missing key fields triggers the fallback.

# app/langgraph_manager.py

# Check if we need improvement: low confidence OR missing key fields
needs_improvement = (research_result.get('confidence', 0) < 0.7) or \
                    not research_result.get('company_name') or \
                    (isinstance(research_result.get('company_name'), str) and research_result['company_name'].lower().startswith('unknown')) or \
                    not research_result.get('phone') or \
                    not research_result.get('city')

The detail that matters: I’m not using Bing to “compete” with Firecrawl. I’m using it to fill the holes Firecrawl leaves behind. And because Bing has lower priority (bing: 4), it can’t overwrite Firecrawl-populated fields.

Tier 3: Domain variation retry (with failed-domain cache)

Domains are messy. A scraped domain might be a vanity domain. It might redirect. It might be wrong.

So I added a domain variation retry mechanism—and I made it remember what failed.

# app/langgraph_manager.py:3246-3330

# Domain variation retry is handled in the research node.
# It retries enrichment using domain variants and maintains a failed-domain cache
# so the system doesn’t keep paying for the same dead ends.

This is one of those features that reads like a minor optimization until you run it at scale. Without a failed-domain cache, a pipeline like this will happily re-try the same bad domain across many runs, because “it might work this time.” With the cache, failure becomes a learned fact.

Tiers 4–5: Serper and Azure Maps POI (filling mid-priority gaps)

Serper (serper: 2) and Azure Maps POI (azure_maps_poi: 1) occupy the middle of the cascade. They exist for the same reason: earlier tiers often leave location and company metadata partially filled, and these two services are good at plugging those specific holes.

Neither tier does anything exotic. Serper adds another web-search vote when Bing and domain variation came up short. Azure Maps grounds location fields (city, state) against a geographic index when scraping didn't produce them. Both follow the same provenance rule as every other tier: fill blanks, never overwrite a higher-priority source.

The value of making them explicit tiers—rather than lumping them into a generic "try more APIs" step—is debuggability. When a record's city_source says azure_maps_poi, I know exactly which tier populated it and at what confidence level. That traceability is the whole point of the cascade.

Tier 6: Phone area code approximation

Phone area code approximation is the lowest priority (phone_area_code: 0) for a reason: it’s sometimes the only thing you have, but it’s also the easiest way to inject plausible nonsense.

And this is where per-field provenance earns its keep.

If a city came from Firecrawl, a phone-based guess can never overwrite it.

If city is empty, the phone tier can fill it—with the source recorded as phone_area_code.

That last part matters operationally: when a recruiter sees a location, I want the system to know whether it’s scraped, searched, maps-grounded, or inferred. The provenance fields are how I keep that honesty intact.

What went wrong before I built provenance

Before I tracked sources per field, I had a classic failure mode: the pipeline would “improve” a record by adding a missing phone number, and in the same merge step it would overwrite the city with a weaker guess.

It looked like progress because the payload had more filled fields.

But it was actually a regression: a high-quality location got replaced by a low-quality one, and nobody noticed until downstream workflows started behaving strangely.

That’s the trap: enrichment failures rarely look like failures. They look like slightly-wrong facts.

The provenance model turns those silent failures into something you can reason about:

If city_source is phone_area_code, I know it’s an approximation.
If city_source is firecrawl, I treat it as higher confidence.
If a field flips sources unexpectedly, that’s a debugging signal.

Nuances that make this pattern hold up

1) Priority is numeric because ties are poison

I chose numeric priorities (5 down to 0) because it forces total ordering. If you allow “these two sources are both good,” you eventually end up with tie-breaking logic scattered across the codebase.

A single map makes conflict resolution mechanical—and mechanical is what you want in production.

2) Provenance is per-field because reality is per-field

If you track provenance only at the record level, you’ll end up lying to yourself.

Real enrichment is patchwork:

one source knows the company name
another knows the phone
another knows the city

Per-field provenance is the only representation that matches that reality.

3) Feature flags are part of architecture, not “ops stuff”

ENABLE_FIRECRAWL_ENRICHMENT = false isn’t a deployment toggle. It’s an encoded business decision: the system must still function when the expensive tier is off.

That constraint shapes the cascade:

If Firecrawl is off, Bing becomes the top tier.
If Bing is thin, domain variation retry becomes more important.
If everything fails, the pipeline still returns something—with honest provenance.

That last clause is the difference between “resilient” and “quietly wrong.”

Closing

The cascade isn’t impressive because it calls six services; it’s impressive because it refuses to confuse guesses with facts. Once every field carries its own provenance and every source has an explicit priority, enrichment stops being a gamble and starts being a controlled accumulation of truth—exactly the kind of unglamorous engineering that keeps an enterprise AI system from slowly drifting into confident nonsense.

In Part 4, I’ll take the same principle one step further: user corrections always win, and the system learns from them without ever pretending the correction was the model’s idea in the first place.

🎧 Listen to the audiobook — Spotify · Google Play · All platforms
🎬 Watch the visual overviews on YouTube
📖 Read the full 13-part series with AI assistant

DEV Community