What Phone Sanitization Revealed About Our Data

#database #data #normalization #phone

I Thought This Would Take a Day

The task: normalize phone numbers across a sharded production system. Multiple country databases, years of inconsistent formats from different integrations, the usual archaeology. I figured I'd write a script Monday, handle edge cases Tuesday, open a PR Wednesday.

It took the whole week. Not because phone normalization is hard, but because the problem kept exposing things I hadn't thought about.

The framework I didn't need

When I read the schema I saw patterns everywhere. Phone columns scattered across dozens of tables with different names, different formats, different roles in the data model. My brain immediately went to: this should be generic.

Within a few hours I had a design for dynamic schema discovery, automatic phone column detection, a table classification layer, configurable skip policies, dynamic country resolution per shard. The kind of system that could handle any similar migration in any future codebase.

I started building it. The abstractions felt justified. Then I asked myself: why does this actually need to be general?

It didn't. This was a maintenance command that would run once, maybe twice, against one specific system I understood. The reusability I was designing for was completely hypothetical. YAGNI applies to migrations too, maybe more so, because the cost of wrong abstractions in a migration isn't just complexity, it's also the surface area for subtle bugs in code that touches production data.

I deleted most of it.

The final command was explicit about everything: scan the accounts table in each country shard, normalize the phone fields used for identity, detect duplicates using a temporary SQL table, store the conflicting account IDs, normalize phone columns in downstream tables, skip rows belonging to duplicate accounts since those need human review, produce a structured JSON report. Every table name in the config. Every column name spelled out. No dynamic resolution, no introspection.

It felt almost embarrassingly simple. Then it started feeling right.

There's a principle I keep running into that I'd call the maintenance command corollary: tools that run infrequently and touch sensitive data should optimize for auditability over extensibility. The engineer running this six months from now, under pressure, needs to be able to read it and trust it quickly. They don't need it to be clever.

Normalization changes identity

This was the part I hadn't fully thought through, and it reframed the entire project.

Two phone numbers that look different can represent the same subscriber. +1 (800) 555-0123 and 18005550123 are the same number once you apply E.164 normalization. When you collapse those representations across millions of account records, accounts that have been treated as distinct for years suddenly resolve to the same person.

We had duplicates nobody knew about. They had existed quietly in the system across multiple shards because no one had ever compared canonical forms at scale. The normalization didn't create the duplicates, it just made them visible for the first time.

This is the thing about data migrations that makes them genuinely different from feature work. A feature has an off switch. A migration that has run against production data usually doesn't. If you normalize a phone number and discard the original, you've made a permanent decision. If your normalization logic has a bug, you may not be able to reconstruct what the original value was. The blast radius of a wrong decision is bounded in different ways than application bugs.

This is also why duplicate detection ended up being the majority of the actual work. The normalization logic was straightforward. The harder problem was: once we find accounts that now share a canonical phone number, what do we do? Automatically merge? Pick one? The answer was neither: surface them in the report and skip any downstream normalization that touches those accounts until a human reviews them. Automation should stop at the boundary of decisions it doesn't have enough context to make.

Pushing aggregation into the database

My first instinct for duplicate detection was in-process: stream records, build a map of canonical phone to account IDs, flag any key with more than one entry. Reasonable approach, and for small datasets it would have been fine.

The problem was scale combined with the shard architecture. Some country databases have tens of millions of accounts. Holding all canonical phone values in memory simultaneously, across a migration that needed to be resumable, was going to be awkward and wasteful.

What I did instead: buffer inserts of (canonical_phone, account_id) tuples into a temporary table, flush in batches, then run a single GROUP BY canonical_phone HAVING COUNT(*) > 1 at the end. The database does the aggregation, uses its own indexes, and produces exactly the set I need. Total memory footprint on the application side stays flat regardless of table size.

It's not a novel pattern. But I notice engineers, myself included, sometimes reach for in-process solutions out of habit even when the data is already in a database that is extremely good at this specific class of problem. The embarrassing version of this mistake is reimplementing GROUP BY in a hash map.

Same column type, different semantics

I went in assuming I could treat phone numbers uniformly. They're all phone numbers. Normalize them all the same way. That assumption broke down quickly.

Some phone fields are identity fields. They are how the system finds and matches accounts. Normalizing them is what drives the deduplication logic. Others are contact preferences or communication records that happen to store a phone number but play no role in identity resolution. Others live in event logs that represent historical facts and probably shouldn't be rewritten at all.

The data type is the same. The semantics are completely different, and the right migration behavior depends on the semantics, not the type.

This also came up in a subtler way when I was writing the config schema. I had a field called owner_uid_column because I was thinking about the mechanism: this table has a foreign key to the account UID. But what I actually meant was "the column that tells me which account owns this row." I renamed it to owner_column. Small change, but the original name encoded an assumption about key type that wasn't relevant to what the config was trying to express. Config that describes intent ages better than config that describes implementation.

The broader principle here is that column relationships don't imply identical business rules. Two tables can both have a foreign key to the same parent entity and still require completely different handling in a migration. The relationship tells you where data comes from, not what it means.

Behavioral equivalence as a refactoring strategy

I refactored this several times over the week, removing abstractions in each pass. The thing that made that safe was a specific validation approach: I compared structured outputs between versions rather than reviewing code.

After each refactor, I ran both versions against the same dataset and diffed the results: duplicate groups found, account IDs flagged, changed rows, unchanged rows, skipped rows, invalid rows, the final report structure. They had to be byte-for-byte identical. Not functionally similar, identical.

This is a stronger guarantee than code review. Code review catches things that look wrong. Output comparison catches things that are wrong. The implementation can change arbitrarily as long as the observable behavior doesn't change. That's the actual definition of a correct refactor, and having a machine verify it rather than a human read it is more reliable.

It also changes the psychology of simplification. Deleting an abstraction feels risky when you're relying on your own judgment that the replacement is equivalent. It feels much less risky when you have a diff that shows zero behavioral difference. I deleted a few hundred lines this way that I wouldn't have been confident deleting based on reading alone.

Where this landed

The version I shipped was smaller and more boring than what I had on Tuesday. Every table explicit. Every column explicit. Config that reads like a description of the business problem rather than a description of the implementation. A report that tells you exactly what happened, row by row, so the person running it next time can audit the results without reading the code.

I've been writing software long enough to know that "boring" and "correct" are not in tension. For a migration tool that runs occasionally and touches data you can't easily recover, boring is exactly what you want.