Higgie

Posted on May 6

How We Accidentally Built a Customer Data Platform

#dataengineering #architecture #webdev #saas

A case study in identity resolution, multi-source data stitching, and why vertical infrastructure beats horizontal platforms for specialized industries.

The Problem Nobody Warned Us About

We were hired to build a call tracking dashboard. A legal services organization managing 230+ law firms needed a single pane of glass for call intelligence — ingest data from their phone system, run it through AI for classification, and surface it on a dashboard their operations team could review.

Simple enough. Until the data started arriving from multiple systems, in different formats, with no shared identifier.
Their phone platform (CallRail) sent call data with one type of company ID. Their form intake vendor (Intaker) sent lead notifications with a completely different identifier — a URL-style slug embedded in the email body. Website contact forms came in from firm-specific domains. And the same real-world law firm could show up as three different entities across these systems, with no obvious way to connect them.

We didn't set out to build a Customer Data Platform. We set out to build a dashboard. But the data reconciliation problem forced an architecture that, by the time we were done, looked remarkably like the identity resolution layer at the core of enterprise CDPs that charge six figures a year.

The Canonical Identity: One ID to Rule Them All

Every CDP has a concept of a canonical identifier. Adobe calls theirs an ECID. Segment calls it an anonymous ID. Ours is a COM-prefixed UUID — a format inherited from the phone platform's API, which returns company identifiers in that shape.

The critical decision was adopting this format as the universal namespace. Every record in the system — whether it originated from a phone call, a web form, or a manual entry — carries this canonical ID. If a record doesn't have one, it hasn't been resolved yet, and it can't enter the primary data tables.

For firms that exist in the phone platform, the canonical ID comes from their API. For firms that exist only in the form pipeline (no phone tracking), we mint our own UUID in the same format. The downstream system can't tell the difference, and it doesn't need to.
This is the first principle of any identity graph: one namespace, multiple minters, zero branching logic downstream.

The Resolution Graph

With the canonical ID established, the next problem is mapping external identifiers to it. Each data source speaks its own language:
The phone platform issues two identifiers for the same company — a UUID through its API, and a numeric ID through its webhook payloads. Same firm, two formats, delivered through two different channels. We discovered this when 212 calls landed in our database but were invisible on the dashboard. The data was there. The key was wrong.
The form intake vendor embeds a slug in notification emails — a URL-friendly string like lawofficesofsmithjones. This slug is the vendor's internal identifier for the firm. It bears no resemblance to the phone platform's ID.

The firms themselves have website domains. When a prospect fills out a contact form on smithjoneslaw.com, the form notification email arrives from that domain. The domain is an implicit identifier.
Individual senders build up history over time. After an operations team member approves a few submissions from forms@smithjoneslaw.com, the system learns that this sender is associated with a specific firm.

Each of these — numeric ID, slug, domain, sender email — is a spoke in the identity graph, pointing at the canonical UUID hub. The resolution logic tries them in priority order:

Phone platform numeric ID → translation table → canonical UUID
Notification email domain → domain registry → canonical UUID
Vendor slug → domain registry (different source type) → canonical UUID
Sender email → sender history → canonical UUID
No match → quarantine / manual review queue

A single firm might have five spokes. A simpler firm might have two. The graph is thinner for some nodes, but the resolution logic is identical regardless of how many identifiers exist.

The Translation Table: When Your Sources Disagree

The most instructive bug we encountered was the numeric-to-UUID translation gap. The phone platform's REST API returns company identifiers as UUIDs. The same platform's webhooks return the same companies as numeric IDs. No documentation warned us about this. We discovered it when a client reported that a specific firm showed no recent activity, despite the firm receiving calls daily.

Our database had the calls. Our dashboard couldn't find them. The calls were stored under numeric IDs; the dashboard queried by UUID. Two hundred twelve calls, invisible.

The fix was a dedicated translation table with a composite primary key: (account_id, numeric_company_id) → canonical_uuid. One hundred eighty-nine mappings across thirteen accounts. When a webhook arrives, the handler resolves the numeric ID through this table before the record touches any other part of the system. If resolution fails after all fallback attempts, the record lands in a quarantine table with the full payload preserved, returning an HTTP 202. It never enters the primary data store with an unresolved identity.

This pattern — resolve before write, quarantine on failure — eliminated an entire class of invisible-data bugs. Records can't exist in the system under the wrong key because unresolved records can't enter the system at all.

Tombstones: When Identities Die

Firms merge. A manually-created entry turns out to be a duplicate of a phone-platform-synced entry. Two records that looked independent are actually the same real-world firm.

You can't just delete the old record. Data keeps arriving that references it — a webhook fires, a form notification lands, a background sync runs. If the old identifier resolves to nothing, the record errors out or creates a ghost entry. If it resolves to the deleted firm, you get data attached to a nonexistent entity.

Tombstones solve this. When firm A merges into firm B, we write a tombstone entry: "A is dead; redirect to B." Every ingest path — API sync, webhook handler, email poller, manual backfill — checks the tombstone registry before writing. If a new record arrives referencing the dead UUID, it gets silently redirected to the survivor before it touches the database.

Some tombstones are pure blocks, not redirects. We discovered that one phone platform account contained a medical practice mixed in with law firms — same account, unrelated business. That entity got a tombstone with no redirect target. The sync sees it, checks tombstones, finds the block, skips it. Permanently filtered with zero ongoing maintenance.

The subtlety that caught us (and that our external code reviewer flagged) was chain resolution. If firm A merges into firm B, then firm B later merges into firm C, a single-hop lookup resolves A to B — which is also dead. The resolution function now walks the chain with cycle detection: A→B→C resolves all the way to the living node. Without this, data silently strands at intermediate tombstones, and nobody notices until a client asks why their numbers are wrong.

The Multi-Tenant Constraint: Nodes That Can't Form Edges

The most expensive bug in the project's history came from a constraint we didn't know existed until it bit us.

Some vendors in the ecosystem are multi-tenant — they send notifications on behalf of multiple firms from a single email address. The form intake vendor, for example, sends every notification from the same noreply@ address regardless of which firm the lead belongs to. The phone platform sends call notifications the same way.

Our trust system was designed for single-tenant senders. When the operations team approves a submission from forms@smithjoneslaw.com, the system learns: "this sender is associated with Smith & Jones." Future submissions from that sender auto-route. This works beautifully for firm-specific senders.

But when someone approved one notification from the multi-tenant vendor, the system learned: "this vendor's address is associated with Smith & Jones." Every subsequent notification from that vendor — leads for completely unrelated firms in different practice areas — auto-routed to Smith & Jones. Sixteen leads misrouted before we caught it.

The fix was a denylist: a set of domains that are structurally forbidden from forming single-firm edges in the identity graph. These nodes can't be trusted to any one firm because they represent multiple firms by definition. The denylist is dual-gated — it blocks both the promotion path (preventing the trust record from being written) and the consumption path (preventing existing trust records from being read). Our first fix only blocked promotion; an external reviewer caught the consumption-side gap within hours.

In graph terms, these are nodes with a constraint annotation: "this node may connect to the graph through content-based resolution (parsing the notification body for a firm-specific slug), but never through sender-based resolution (trusting the From address)." The constraint is structural, not heuristic. The same class of bug cannot recur because the edges it requires are forbidden at the schema level.

What Makes This a CDP (and What Doesn't)

Strip away the marketing language and a Customer Data Platform is three things:

Ingest data from multiple sources — phone calls, web forms, email notifications, manual entries
Resolve it to a canonical identity — the COM-prefixed UUID namespace with multi-source resolution
Make it queryable from a single pane — one dashboard showing calls and forms unified under one firm

Our system does all three. The identity graph stitches phone platform numeric IDs, vendor slugs, website domains, and sender email addresses to a single canonical identifier. Tombstones handle merges. The multi-tenant denylist enforces structural constraints. The quarantine system ensures no record enters the primary store without a resolved identity.

What it doesn't do — and what separates it from a horizontal CDP like Segment or Adobe Real-Time CDP — is generalize. Our resolution logic is purpose-built for legal intake. The identifier types (CallRail IDs, Intaker slugs, law firm domains) are domain-specific. The multi-tenant constraint (shared legal marketing vendors) reflects an industry-specific data topology.

But that specificity is a feature, not a limitation. A horizontal CDP would require months of custom configuration to replicate the same resolution logic — mapping vendor slugs, handling multi-tenant notification services, building tombstone chains for firm merges. Our system does it out of the box because it was built for exactly this domain.

The Architecture That Emerged

None of this was designed top-down. The identity table was an incident response to invisible webhook data. The tombstone system was built when a firm merge created duplicate dashboard entries. The multi-tenant denylist was a P0 fix for sixteen misrouted leads. The chain resolution was flagged by an external code reviewer during a post-incident audit.

Each layer was added because production broke without it. The result is an identity resolution system that's been hardened by real failures at real scale — not theorized on a whiteboard and deployed hopefully.

The canonical UUID namespace came from the phone platform. The translation table came from a format mismatch between API and webhook payloads. The domain registry came from the need to route form submissions. The sender trust system came from the desire to reduce manual review. The denylist came from a trust system that worked too well on the wrong entity.

Every production incident in the first month contributed a principle to our engineering constitution — a document now containing 40+ invariants, each one citing the incident that earned it. That's not a fragile system. That's a system that has been broken in every way it can break, and sealed each time.

What This Means for Vertical SaaS

The lesson isn't "you should build a CDP." The lesson is that identity resolution is an inevitable problem in any multi-source data platform, and the architectural decisions you make when solving it determine the ceiling of everything you build on top.
If you're ingesting data from more than one external system and presenting it through a unified interface, you will eventually need:

A canonical identifier namespace
A resolution layer that maps external IDs to canonical IDs
A quarantine system for unresolvable records
A merge/tombstone system for identity lifecycle
Constraints on which nodes can form which edges

You can buy this from a horizontal CDP vendor and spend months configuring it for your domain. Or you can build exactly what your domain requires, battle-test it against real production data, and own the infrastructure that makes your product defensible.

We chose the second path. Not because we planned to — but because the data left us no choice.

Built by Areopagus AI Engineering. We build and operate recurring-retainer SaaS products for clients who need AI-powered data infrastructure without the enterprise price tag.

Interested in what a vertical CDP could look like for your industry? Get in touch.

DEV Community

How We Accidentally Built a Customer Data Platform

Top comments (0)