SARAN TEJA MALLELA

Posted on Apr 9

Your Customer Table Has Duplicates You Can't See With SQL How I Built a Cross-Platform Identity Resolution Layer for a Dark Kitchen Data Platform

#dataengineering #apachespark #kafka #deltalake

I'm Saran Teja Mallela — a data engineer based in Houston, TX. I build batch and streaming pipelines on Azure during the day at URL Systems, and at night I've been working on GhostKitchen: a full data platform for cloud kitchen operations.

GhostKitchen simulates 50 dark kitchens across 10 Texas cities — Houston, Dallas, Austin, San Antonio, and six others — each running 3–5 virtual restaurant brands. Orders flow in from three delivery platforms (Uber Eats, DoorDash, and the kitchen's own app), alongside kitchen IoT sensor readings, delivery GPS pings, and menu change events. The platform processes 7,500+ events per minute across four Kafka topics and lands everything in Delta Lake using a Medallion Architecture.

But here's the thing — none of that engineering mattered until I could answer a deceptively simple question: who is this customer?

The Problem Nobody Warns You About

Let me show you exactly what I mean. Here's a real scenario from my Houston test data:

Uber Eats order, 7:42 PM:

{
  "customer_uid": "ue_cust_48291",
  "customer_email": "Saran.Mallela@Gmail.com",
  "customer_name": "Saran Mallela",
  "total_amount": 18.75
}

DoorDash order, 8:15 PM, same evening:

{
  "dasher_customer_id": "dd_u_73625",
  "customer_email": "saran.mallela@gmail.com",
  "customer_name": "Saran T Mallela",
  "order_value": 22.40
}

OwnApp order, next day:

{
  "user_id": "app_saran_482",
  "customer_email": "SARAN.MALLELA@GMAIL.COM",
  "customer_name": "Saran Teja Mallela",
  "amount_cents": 1650
}

Three orders. Three platforms. Three different customer IDs (customer_uid, dasher_customer_id, user_id). Three email capitalizations. Three name variations. Two currency formats (dollars vs cents). Zero shared keys.

The kitchen thinks it served three different people. It served me three times.

Run a JOIN ON customer_id — zero matches. Try JOIN ON email — the casing kills it. Even after you lowercase everything, 2% of my test events have null emails entirely (I injected those deliberately to simulate real-world incomplete data).

Without resolving this, customer lifetime value is understated by roughly 35%. Cohort analysis breaks. Personalization is impossible. The business literally doesn't know who its best customers are.

Why I Chose Lambda Over Kappa for This

Quick architecture context before the resolution logic.

GhostKitchen uses Lambda Architecture — dual batch + streaming paths. I know the data engineering internet has strong opinions about this, so here's my reasoning:

Orders are stateful. A single order transitions through up to 7 states: placed → confirmed → preparing → ready → picked_up → delivered (or cancelled). Getting exact daily revenue requires knowing each order's final state, which sometimes needs late corrections — a "delivered" status might arrive hours after the order was placed.

The streaming path (Kafka → Spark Structured Streaming → Delta Lake) gives approximate numbers in ~30 seconds. The batch path (Airflow → PySpark → Delta Lake MERGE) recomputes exact numbers overnight. A reconciliation DAG at 02:00 UTC makes batch authoritative.

For identity resolution specifically, this dual-path matters: a customer's first order might arrive via streaming with a null email. A later batch correction fills it in. The batch path catches matches that streaming missed.

I built PulseTrack (a healthcare IoT pipeline on Azure) with pure Kappa, because wearable sensor readings are append-only — a heart rate of 72 bpm at 2:34 PM never gets "corrected." Different data semantics → different architecture. That distinction matters more than any blanket rule about which pattern is "better."

The Data Model That Makes Resolution Possible

Before I could resolve identities, I needed a model that could handle three conflicting schemas without forcing premature alignment.

Silver layer: Data Vault 2.0. This is where the magic happens.

Data Vault separates data into three table types:

Hubs — core business entities with their business keys. One row per unique entity. hub_customer holds the resolved customer identity.
Links — relationships between hubs. link_order_customer connects which customer placed which order.
Satellites — descriptive attributes with full history (SCD Type 2). sat_customer_profile tracks name, email, and platform IDs over time.

Why not Star Schema in Silver? Because Uber defines a customer as customer_uid + total_amount (float). DoorDash uses dasher_customer_id + order_value (float). OwnApp uses user_id + amount_cents (integer). Star Schema would force me to pick one representation and lose the others. Data Vault keeps every source's raw attributes in separate satellites while unifying identity in hubs.

Gold layer: Star Schema. Once identities are resolved, analysts need simple queries: SELECT sum(order_total) FROM fact_order JOIN dim_kitchen GROUP BY city. Star makes that a 2-table join. Dimensions are flat (kitchen, brand, delivery zone) — no deep hierarchies that would justify a Snowflake Schema. PII stays locked in Silver; only hashed surrogate keys propagate to Gold.

The Resolution Pipeline: Four Stages

Stage 1: Email Normalization

Strip whitespace, lowercase everything, remove dots before the @ in Gmail addresses (Gmail ignores dots — saran.mallela@gmail.com and saranmallela@gmail.com deliver to the same inbox).

After normalization, Saran.Mallela@Gmail.com, saran.mallela@gmail.com, and SARAN.MALLELA@GMAIL.COM all become saranmallela@gmail.com.

This alone resolves about 82% of cross-platform matches in my test data.

Stage 2: Deterministic Grouping via MD5 Hash

Hash the normalized email with MD5. All records sharing the same hash receive the same customer_hk (hash key) in the Data Vault hub.

Why hash instead of a database-generated auto-increment ID? Because three platforms run as independent Spark jobs. There's no central coordinator to hand out sequential IDs. A hash is deterministic — same email always produces the same key, on any node, any platform, independently. No coordination needed.

from pyspark.sql import functions as F

df = df.withColumn(
    "email_normalized",
    F.lower(F.regexp_replace(
        F.trim(F.col("customer_email")),
        r"\.(?=.*@gmail\.com$)", ""
    ))
)

df = df.withColumn(
    "customer_hk",
    F.md5(F.col("email_normalized"))
)

After this stage, my three Houston test orders — the pad thai on Uber Eats, the burrito bowl on DoorDash, and the fried chicken on OwnApp — all share one customer_hk. They're one person in the system now.

Stage 3: Fuzzy Matching for the Remaining 18%

What about the 2% with null emails? And the cases where the same person uses different email addresses across platforms?

This is where I added probabilistic matching using Jaro-Winkler string similarity on name + delivery address combinations:

"Saran Mallela" vs "Saran T Mallela" → similarity: 0.94 → match
"Saran Mallela" vs "Sarah Miller" → similarity: 0.68 → no match

I also incorporated Soundex phonetic matching as a fallback — it catches typos and transliterations that string distance alone misses. "Mallela" and "Malela" produce the same Soundex code.

The confidence threshold is 0.88. Above that → automatic merge. Between 0.70 and 0.88 → flagged for manual review. Below 0.70 → treated as distinct customers.

Stage 4: Confidence Scoring and Audit Trail

Every single match gets a confidence score and a matched_via field:

customer_hk	platform	platform_id	email	matched_via	confidence
a3f8c2...	uber_eats	ue_cust_48291	saran.mallela@gmail.com	email_hash	1.00
a3f8c2...	doordash	dd_u_73625	saran.mallela@gmail.com	email_hash	1.00
a3f8c2...	own_app	app_saran_482	saran.mallela@gmail.com	email_hash	1.00
b7d1e9...	uber_eats	ue_cust_91034	NULL	fuzzy_name_addr	0.91

The audit trail is critical. When an analyst questions a number, you can trace it back to exactly why two records were merged and how confident the system was about it. No black boxes.

Stress-Testing With Deliberately Dirty Data

Here's what most identity resolution tutorials skip: if you test on clean data, you're testing your assumptions, not your algorithm.

I injected five types of noise into my data generators:

5% duplicate events — simulating Kafka consumer retries and at-least-once delivery
2% null emails — simulating incomplete user profiles on mobile sign-ups
3% late-arriving events — orders showing up 6–24 hours after they happened
Name variations — middle names, initials, typos (Mallela vs Malela vs mallela)
Address inconsistencies — "Houston, TX" vs "Houston, Texas" vs "HTX"

Against this deliberately hostile test data, the resolution pipeline achieves 94% match accuracy across 200 customer identities.

The remaining 6%? Documented as known unresolvable ambiguity — cases where two records have no email, different names, and addresses that are close but not close enough. That's not a failure. That's honest engineering. Because 100% accuracy in identity resolution doesn't exist. You're always trading precision for recall, and pretending otherwise is how you end up merging two different people into one profile.

Handling Late-Arriving Data

The 3% late-arriving events deserve their own section because they interact with identity resolution in a way that isn't obvious.

The streaming path uses a 24-hour watermark. Any event older than 24 hours gets routed to a Dead Letter Queue (DLQ). The nightly reconciliation DAG picks up DLQ events and reprocesses them through the batch path.

Why does this matter for identity? Imagine this sequence:

8:00 PM — An order arrives via streaming with a null email. Gets assigned a new customer_hk as an unmatched record.
2:00 AM — A batch correction arrives with the email filled in.
Reconciliation — Batch path re-runs identity resolution, discovers this "new" customer actually matches an existing one. Delta Lake MERGE updates the Gold layer. The temporary unmatched record gets absorbed into the correct customer profile.

Without the batch path, that customer remains a phantom forever — inflating your customer count and deflating per-customer metrics. This is why Lambda exists for this use case. It's not about speed. It's about eventual correctness.

What I'd Do Differently in Production

Three things I'd add for a production system at scale:

1. Graph-based resolution. My current approach is pairwise — compare record A to record B. A graph database would let me do transitive matching: if A matches B, and B matches C, then A matches C — even if A and C share zero attributes. I actually built this pattern in PulseTrack, where patient identity resolution chains 7 identifier types in a multi-hop graph: device_account → email → MRN → pharmacy_id → insurance_id → phone_hash → ssn_hash.

2. ML-based scoring. Replace hand-tuned Jaro-Winkler with a trained classifier that learns match weights from confirmed true matches. The 0.88 threshold is hand-tuned — a model could optimize it per-field and adapt as data distributions change.

3. Real-time resolution in the streaming path. Currently, fuzzy matching only runs in the batch path because it's computationally expensive. With a pre-computed blocking index in Redis, the streaming path could do approximate matching at ingestion time instead of waiting for the nightly batch.

Try It Yourself

The entire platform is open source and runs locally with one command:

docker-compose up -d

GitHub: github.com/Nerdboss-stm/ghostkitchen
Live demo: ghostkitchen-portfolio.vercel.app

The live demo lets you run the full pipeline, explore the schema (including the identity resolution layer), browse the Data Vault Silver and Star Schema Gold tables, and see 43 automated data quality checks per pipeline run.

The Gold layer has fact_order (2,274 rows), fact_sensor_hourly, dim_customer (576 resolved profiles from ~800 raw platform records), dim_kitchen, dim_brand, dim_date, dim_time, and more — all with full data lineage from source to serving.

If you work on multi-source identity problems — in food delivery, retail, fintech, healthcare, or any multi-marketplace domain — I'd genuinely like to hear how you're solving it. What matching strategies work for your data? Where do you draw the precision/recall line?

I'm Saran Teja Mallela — data engineer, University of Houston MS grad (Data Science, 4.0 GPA), currently building data platforms at URL Systems in Houston. You can find me on LinkedIn or GitHub.

I write about data architecture decisions, not just code. Because the hardest part of engineering is never the syntax — it's deciding what "good enough" means.

DEV Community