J.S_Falcon

Posted on Apr 26 • Edited on Apr 28

"Beating 250,000 Mental Comparisons: A Cross-Domain Engineer's Entity Resolution Case Study"

#dataengineering #architecture #ai #productivity

TL;DR

Operations/Systems engineer recently moved to the software side via AI collaboration.
Built a domain-specific entity resolution tool in a handful of evening sessions with Claude Code.
Caught about 99.2% of human-detected reconciliation errors when replayed against 8 weeks of historical data.
Turned a "skilled-veterans-only" weekly task into something anyone on the team can run.
Design retrofitted unexpectedly well to dual process theory, Gestalt psychology, and anchoring-bias defense.
Source business records never reached an LLM. Deterministic pipeline + human review only.

1. The Hidden Problem: When 500 × 500 Becomes a Cognitive Wall

Many companies maintain the same business entities across multiple systems.

A retailer tracks SKUs in an internal master AND on Amazon / Rakuten / Shopify exports.
A clinic carries patient records in both an EMR and an insurance billing system.
A manufacturer holds internal inventory but also receives partner inventory feeds.
An accounting team reconciles general ledger entries against bank statements.

These pairs need periodic reconciliation. In the technical literature this is Entity Resolution or Data Reconciliation — a universal problem that nearly every mid-to-large business hits eventually.

The case study here uses the retail SKU vs marketplace listing framing. (The actual industry I work in is intentionally abstracted, but the structure transfers cleanly.) Two systems, ~500 rows each, weekly reconciliation. Skilled humans needed about 3 hours per week. Newcomers, half a day to a full day. Hidden detail: the small row count masks the real difficulty.

Why is 500 × 500 hard?

The 250,000 problem

Manually reconciling 500 × 500 pairs forces a person to evaluate up to 250,000 combinations in their head. Not 1,000 — 250,000. Plus typo tolerance, format variation (full-width vs half-width, mixed scripts, abbreviations, punctuation), and partial matches. Each pairwise judgment is not O(1).

Brute-forcing this is computationally similar to running a 1,000-node full-mesh ping check vs a flat 1,000-node liveness check. Order-of-magnitude different load.

Working memory overflow

Miller's "magical number" puts our short-term memory at 7 ± 2 chunks (Miller, 1956). Hunting matches across 1,000 candidates with format drift continuously overflows working memory and pegs System 2 (slow thinking) for the entire session. The 3-hour exhaustion experienced by veterans isn't a complaint — it's a neurological inevitability.

"Short to do" doesn't equal "easy to do" for cognitive labor.

Reproducibility decay

A one-off reconciliation can be brute-forced. But when the task repeats weekly across 10+ weeks, judgment drift becomes unavoidable:

"Last week I matched 'A Co.' and 'A. Company' as the same entity. This week I treated them as different."
"Last week I tolerated typo X. This week I rejected it."

This drift is what really breaks data quality long-term. It's the same structural failure mode as "config review standards differ by reviewer" in infrastructure operations.

The actual target

So the real problem the tool solved was not "shorten 3 hours per week" but:

250,000 judgments × 10 weeks of consistent reproducibility — a quality bar humans can't physically sustain — backed by a deterministic machine.

Plus removing the skill dependency. "Only one veteran can do this in 3 hours" is a single point of failure. After the tool: anyone could run it with consistent quality.

2. Background: Who I Am and What I Was Solving

I'm an Operations/Systems engineer. Configuration, validation, runbook authoring, monitoring, troubleshooting — that side of the house. Software development was not my primary craft, though scripting was always part of the job.

I'd recently moved into a new business domain (about 2 months in) and the tooling target system was something I'd only been touching for ~1 month. From the user side I'd seen the workflow longer, but not as a developer.

Translation: design / validation / runbook discipline solid. Python and application development essentially unfamiliar.

This article is not a "look what I shipped" piece. It's a record of how operations-side disciplines transferred unchanged into AI-assisted software work in an unfamiliar domain.

Who this article is for

Reader	Useful sections
Operations / SRE engineers exploring AI assistance	Everything
Mid-career engineers moving across technical domains	Background, Architecture, Cognitive Design
Engineers new to AI-assisted development	Architecture, Cognitive Design, PII
Managers thinking about AI for their teams	Results and the cognitive-load argument

3. PII / Compliance Considerations

A question that always comes up in comments on entity-resolution articles: where does the data go? Worth answering up front.

In this implementation:

Source business records never reach any LLM. Both input files (internal master + external system export) are read locally by a Python script.
Matching is fully deterministic. Pandas, openpyxl, and difflib.SequenceMatcher for similarity. No embedding API. No remote inference at runtime.
The LLM's role is code-side, not data-side. Claude Code helped write the matching logic, the validation scripts, the design review, and the documentation. None of the actual records were ever sent.
For testing only, masked synthetic data was used in prompts. Real names, amounts, and addresses were replaced with synthetic equivalents before any prompt left the local environment.
Edge cases stay with humans. When the deterministic pipeline can't decide, it surfaces a flagged row for human review — not for LLM second opinion.

This separation is intentional. The matching task is well-suited to deterministic logic. LLMs would only add cost, latency, and compliance exposure for no quality gain.

If your team has even a soft "no business data into external AI" policy, this pattern is fully compatible.

4. Architecture: Two-Stage Matching + Cognitive Gates

Stack

Python 3.11
pandas + openpyxl (Excel I/O, color-coded output)
difflib.SequenceMatcher for fuzzy similarity
Rule-based throughout. No machine learning.
~1,100 lines, single script.

Phases

Phase 1: Match by exact stakeholder name (or alias group)
Phase 2: Cross-match by name similarity ≥ 0.6 (rescue typos)
Phase 3: Last-name-only + structural match (single-typo tolerance)
Phase 4: Duplicate-registration detection (same stakeholder + similarity ≥ 0.8)
Phase 5: Rescue rows with no stakeholder name (attribute match)
Phase 5.5: Attribute-mismatch pair rescue (identifier similarity ≥ 0.7, stage 2)
Phase 6: Row generation + color decision

The score function (key gates)

def compute_score(row_a, row_b):
    # Hard gate: region must match — kills cross-region false positives
    if region_a != region_b:
        return 0.0
    # Hard gate: numeric attribute must be close enough
    if abs(value_a - value_b) > THRESHOLD:
        return 0.0
    # Identifier gate: row_b's identifier must be embeddable in row_a's identifier
    if not is_identifier_match(addr_a, identifier_b):
        return 0.0
    # Sub-identifier gate: anchoring-bias defense
    if sub_id not in addr_a:
        return 0.0
    # Soft scoring (only after every hard gate passed)
    score = max(identifier_match_score, similarity, value_fallback)
    return score if score >= 0.6 else 0.0

Why this shape?

The retail SKU framing helps here. The same product on a marketplace might appear as iPhone15 in your master and iPhone 15 Pro Max on the marketplace. Same item family, different surface form. Two key insights:

Hard gates first. "Different region" or "value difference > N" are absolute disqualifiers. Run them before any expensive similarity computation.
Soft scoring last. Once hard gates pass, compute similarity — but cap below 0.6 as "uncertain, surface to human."

Why not ML / Vector DB / embeddings?

Deterministic rule-based was chosen on purpose. Auditability was the requirement. When a flagged row is wrong, the operations team has to be able to trace exactly which gate fired and why. A black-box similarity score of 0.81 with no explanation cannot be reviewed, cannot be unit-tested, and cannot be defended in a compliance audit.

ML is a fine choice when you have labeled training data, training infrastructure, and a continuous evaluation pipeline. None of these applied here. The operating constraint was: "anyone on the team should be able to read the code and know why it decided what it decided." That constraint forces deterministic logic.

Abstracted structure

Domain-specific term	Abstract concept
Item / SKU	Entity
Stakeholder (vendor / agent)	Stakeholder attribute
Price / Amount	Primary numeric attribute
Address / Location	Identifier (multi-attribute)
Building / SKU name	Auxiliary identifier
Detail number / barcode	Sub-identifier
Format variation (kana/latin/case)	Data quality issue
Domain judgment	Tacit knowledge

This is a universal "match entities across two systems with format drift" problem. The pattern reappears in EC, healthcare, HR, accounting, manufacturing, publishing — anywhere two systems represent the same business object differently.

5. Cognitive-Science Design Principles (the Twist)

I didn't design this thinking about cognitive science. I built it, it worked, and only afterwards in a structured Gemini conversation did the underlying principles surface. The retrofit fits unsettlingly well.

5.1 Dual process theory (Daniel Kahneman)

The two phases map onto two thinking modes:

System 1 (fast) = Phases 1–5. Fuzzy "is this roughly the same thing?" — similarity scores, identifier matching, attribute closeness.
System 2 (slow) = determine_color(). Strict checks for value mismatch, format inconsistency, identifier mixing.

Color-coded human review gets the System 1 fuzzy pass plus the System 2 strictness annotation, which is exactly the input shape humans need to make a final call.

5.2 Gestalt psychology

Humans recognize "wholes," not character sequences. iPhone15 and iPhone 15 Pro Max feel like the same product family even though strict string equality fails. So:

def is_identifier_match(addr_a, identifier_b):
    """Recognize chunked identity even with mixed scripts and separators."""
    chunks = re.split(r'[A-Za-z0-9\s\-_]+', identifier_b)
    return all(chunk in addr_a for chunk in chunks if len(chunk) >= 2)

Matching by chunks survives whitespace, separator, and script variation.

5.3 Anchoring & confirmation bias defenses

Hard gates exist to deny human-style intuitive shortcuts:

"Same price, must be the same item" — rejected by sub-identifier gate.
"Same name, must be the same person" — rejected by region gate.

The machine's job is to be coldly skeptical exactly where humans get over-confident.

5.4 Reducing human cognitive load (Human-in-the-Loop)

When a human is asked to confirm a flagged row, they don't get an opaque "match score 0.62". They get a one-line annotation:

Same entity matched | [Value mismatch] diff ¥2,000,000 (5.4%)
(A: ¥34,900,000 / B: ¥36,900,000) · identifier format inconsistent

The human doesn't waste cycles re-deriving why the row was flagged. Cognitive load drops sharply.

5.5 Don't automate the ghost

This part borrows from Ghost in the Shell. Some judgments depend on tacit business knowledge that can't be reduced to rules. Don't build heuristics that pretend to encode them. Surface the row as a caution signal and let a human apply the tacit layer.

Tightening the logic isn't a path to recreating the ghost.
It's a path to revealing where the ghost is needed.

Mapping summary

Cognitive concept	Implementation
System 1 (fast)	Phases 1–5 (fuzzy matching)
System 2 (slow)	`determine_color()` strict checks
Two-stage / dual-pass	Stage 1 + Stage 2 (Phase 5.5)
Gestalt grouping	`similarity` / `is_identifier_match`
Anchoring defense	Sub-identifier gate, identifier gate
Cognitive load reduction	Aggregated `[reason] diff X` annotations
Human-in-the-Loop	Caution signals for tacit-knowledge zones

6. Results

Recall on 8 weeks of historical data

Metric	Value
Errors flagged by humans (excluding outlier weeks)	~130
Errors caught by the tool	~129
Recall	~99.2%

The single missed case was annotated by the human reviewer as "even a human couldn't decide here." Effectively the tool catches every case where a human commits a confident verdict.

(Caveat: this is recall against 8 weeks of one team's data, not a benchmark claim. Different domains will need their own measurement.)

Time and skill load

Item	Before	After
Skilled veteran throughput	~3 hrs/week	~30 min/week (review only)
Newcomer throughput	half a day to full day	~30 min/week
Skill dependency	Yes (single point of failure)	No (anyone can run it)

The time number understates the value. The real shift is breaking the skill SPOF. Veteran out sick, leaves, or buried in another priority — work continues at the same quality.

A note on false positives

Recall is 99.2%, but the tool is intentionally tuned for higher recall over higher precision. False positives — pairs flagged for human review that turn out to be fine — are accepted as the trade-off. The ~30 min/week of human review handles them without strain.

In a no-human-in-the-loop deployment this trade-off would be very different. Here, false positives are cheap (a glance from a human reviewer) and false negatives (missed reconciliation errors) are expensive (data drift propagates into business reports).

7. The Flowchart

Drawing the judgment flow as diagrams surfaced things the code review didn't. Below are the four phases as separate figures, in execution order.

7.1 Phase 1: Hard Gates (sequential disqualifiers)

Region → numeric value → auxiliary identifier → sub-identifier. Each gate is an absolute disqualifier: any "No" drops the pair. The order matters — cheapest disqualifiers run first.

7.2 Phase 2: Soft Match

Once a pair clears all hard gates, compute_score evaluates a soft similarity. Below 0.6 → drop. At or above → lock the pair as the same entity.

7.3 Phase 3: Parallel Flag Checks

For confirmed matches, six independent checks fire in parallel. Each surfaces a "this matched, but here's a discrepancy" signal. Tags are aggregated; there is no early-return contamination between checks.

7.4 Phase 4: Final Verdict and Drop Aggregation

Aggregate the tags into a color verdict. Drops from Phase 1 and Phase 2 converge into the "Unmatched" lane, surfaced standalone in the human-review output.

Things visible only after rendering as a diagram

These were invisible while reading code, only obvious once drawn:

Phase 1 hard gates are ordered by computational cost. Region → numeric → auxiliary → sub-identifier. I placed them by intuition; the diagram showed they were already optimal — cheapest disqualifiers first.
Phase 3 parallel flag checks are genuinely independent. Six checks fire in parallel with no early-return contamination. The diagram confirmed there was no silent dependency between them.
All Drop1–Drop5 paths converge to the same Unmatched node. I was throwing away the drop reason. Re-running "why was this pair rejected?" was impossible. Fix: log the drop reason in the row annotation.

Drawing the flowchart is roughly the same act as drawing an infrastructure topology before going live. The diagram is the rubber duck.

8. Wrap-up

Three transferable lessons from this build:

Cognitive load is the hidden cost of "short" repetitive judgment tasks. Headcount-hour math undersells the burnout reality and skill-SPOF risk.
Cognitive science principles fall out of good design retroactively. I didn't design with them in mind; the principles became visible only through structured review (with a second AI). If your design retrofits to known principles, that's confirmation. If it doesn't, that's a smell.
LLMs do NOT have to touch your data. Most entity resolution work doesn't need them at all. Use them for code, design review, and documentation. Keep the business records local and deterministic.

The implementation itself is internal-use only and won't be open-sourced. The patterns generalize cleanly to any two-system entity reconciliation: EC, healthcare, HR, accounting, manufacturing, publishing.

9. What's Next

Coming in Part 2: how this whole thing got built in the first place — the AI collaboration patterns, the anti-patterns I hit, and the cross-domain disciplines that transferred from operations to software development. (Link to A2 once published.)

Comments on entity resolution, cognitive load in repetitive tasks, or cross-domain engineering experiences are welcome.

DEV Community