DEV Community: J.S_Falcon

SaaS Hacking with an AI Director: Build a Sample, Hand It Off to Specialists

J.S_Falcon — Thu, 14 May 2026 21:24:47 +0000

TL;DR

A renewal notice arrived for our attendance SaaS: 300 people, ¥200/person/month, ¥720,000 annually.
The number was acceptable. The problem was fit, not cost — two requirements (actual travel-cost reconciliation; network-based in-office determination) fell outside the standard SaaS spec.
I rebuilt the system in Google Apps Script (GAS) in a few hours, with AI as the coding resource. I call this human role an "AI Director."
Beyond cost: agility (same-day requirement-to-system), data ownership (lock-in avoidance), and a new development flow — build a working sample, then hand it off to a SIer for production lift.
Bonus thoughts on where programmer knowledge stays valuable (specialist domains, post-AI quality review). A gold-rush analogy.

Chapter 1: Requirements — Why the SaaS Stopped Fitting

A generic SaaS is designed around business processes shared by many companies. That's a sound design philosophy. But for an organization carrying company-specific logic, the company's business doesn't fit what the SaaS provides.

Two requirements fell outside the standard spec in our case:

Travel-expense reconciliation against actuals: Specific rules on commuter-pass routes and per-day route determination. The SaaS's default aggregation couldn't handle it.
Attendance determination by network: We wanted to decide "remote vs. in-office" by whether the punch came through the corporate network. Location-based determination wasn't precise enough.

Two ways to do this with a generic SaaS: pay for a custom development option, or use a CSV export + post-processing workaround. The former inflates cost. The latter produces a kind of inverse vendor lock-in — your operations get locked into manual workflows that only you understand.

The decision was simple. Hire AI as an external coding resource and rebuild the system in GAS to fit our requirements.

From implementation start to operational verification, it took a few hours. The AI wrote the code. Humans handled requirements and specification adjustment. In this article, I'll call this human role an "AI Director." I use the term in the sense of an individual practitioner directing AI as a coding resource — not the executive "Director of AI" role established in larger organizations. This role overlaps significantly with what's framed as a "Citizen Developer using AI" in adjacent discussions; I emphasize the directing relationship with the AI rather than the citizen vs. professional dichotomy.

Chapter 2: Breaking Through Physical Constraints (Ver.1 to Ver.4)

During development, the location-acquisition approach switched four times. Each switch was a discovered constraint and a design decision to work around it.

Ver.1: HTML5 Geolocation

GPS on smartphones gives reasonable precision. But punches mostly come from PCs, where Geolocation falls back to Wi-Fi triangulation or IP-based estimation. The errors landed in the kilometer range.

Not enough precision for in-office determination. Rejected.

Ver.2: IP-Based Estimation

I mimicked the fallback approach used by many SaaS products. Call an external API to estimate geographic location from client IP, then convert to an address string.

The errors here were also large. Carrier routing causes "home" to surface as "near Tokyo Station" in many cases. Not viable.

Ver.3: Google Maps Reverse Geocoding (Built into GAS)

GAS includes a Maps service. Maps.newGeocoder().reverseGeocode(lat, lng) performs reverse geocoding at no cost. This removed the need for an external API key for lat/lng-to-address conversion.

Precision issues remained, but on cost, this had a clear edge over the SaaS.

Ver.4: Master Matching as Proof of Attendance

The final solution wasn't to improve location precision. It was to change the determination axis itself.

When a punch comes through the corporate network, the global IP is the company router's fixed IP. Match this IP against a master of corporate-site addresses. If it matches, "in-office" is confirmed. Location precision becomes irrelevant to the question.

This reduces to deterministic IP-master lookup, so error doesn't structurally arise. For the purpose of attendance determination alone, IP matching is more accurate to the requirement than Geolocation.

Chapter 3: What the System Does

The final form has:

One-tap punch: Browser, one button, clock-in/out recorded.
Dynamic location recording: IP master confirms "in-office"; otherwise Reverse Geocoding records the address string.
Automated monthly aggregation: Per-person attendance-day counts, computed from corporate-site master matching.
Data ownership: Punch data accumulates in our own Google Spreadsheet. No export operation. Even after cancelling the SaaS, the data remains.

The last point matters for vendor lock-in avoidance. With a SaaS, cancellation typically means losing access to historical data, or being constrained to CSV dumps. Internal development places the data structure itself under our control.

Chapter 4: AI's Asymmetric Leverage

Most AI-coding-support articles frame productivity as "the programmer becomes N× faster."

That's a vertical-axis story. Within the same role, throughput rises. Faster typing, faster code review, faster refactoring.

What happened here was a different direction of leverage.

When a person with domain knowledge uses AI, a business system that previously went through "business side → engineer outsourcing → implementation → review" can be built directly, skipping the intermediate steps.

User	How AI is used	Effect
Programmer	Throughput improvement in the same role	N× productivity along the vertical axis
Domain expert	Skipping intermediate steps, reaching implementation directly	"Couldn't implement" becomes "can implement"; non-linear arrival

The second form of leverage is asymmetric in market terms. Programmer productivity gains accelerate competition among programmers. The domain expert's AI use crosses the boundary of professional categories entirely.

Acceleration along the vertical axis, versus crossing the boundary. These are leverages of different types. The AI Director embodies the latter.

Chapter 5: Using the Sample as a Deliverable — The Specialist Handoff Option

Pack all the attendance requirements into a sample system that runs for at least one person. From there, you could spend time building login management, change-request approval flow, database management, and other production-grade elements yourself.

But at this point, handing the work off to a SIer (system integrator) is a viable option.

You have everything the business wants captured, a working sample, and the GAS code. This amounts to having completed requirements definition, part of the basic design, functional design, and detailed design.

If you simply ask the SIer to turn this into a system that all employees can use, the cost should be significantly lower than traditional outsourcing.

Further, if a proper SIer delivers a production-quality system along with specifications and system-design documents as deliverables, the only remaining changes are scale adjustments (employee-count changes) and rare-case incorporation. By feeding the specifications and system design back into AI as the current environment, you may be able to handle those changes yourself.

Build the logic yourself, let specialists raise it to production quality, and own the deliverables.

The current AI isn't a magical do-anything tool — it's an extension of your own thinking. Recognize that your own limit is AI's limit. When a problem doesn't yield in 30 minutes, list it as an open issue. Then have others solve it. Or have them review and suggest a path forward.

The usual system development, with AI in the middle reducing the issue count. This is what I see future system development trending toward.

Chapter 6: Conclusion

Saving ¥720,000 annually isn't the primary outcome here.

The primary outcome is the ability to rewrite the system the moment business requirements shift. Instead of waiting for the SaaS vendor to add features, the system follows on the day you write the requirements. This agility isn't fully captured by cost calculations.

In the AI coding era, the market value of "being able to write code" is in relative decline. What rises is the ability to define what should be built, instruct the AI correctly, and verify the output.

It's less a new profession than the arrival of an era in which people who already hold business knowledge can — by having AI as an implementation layer — restructure their own business domain as a system.

The ability to operate as an AI Director — defining requirements, having AI build, and integrating into operations — fits as one of the basic survival strategies in today's business environment.

Bonus: Where Programmer Knowledge Stays Valuable

This article leans toward the AI Director perspective. Let me also organize patterns where programmer knowledge stays valuable.

Specialization in Areas Where AI Falls Short

Real-time control, embedded systems, security, formal verification, mission-critical domains. Running AI-written code without verification isn't socially acceptable in these areas. Fields requiring depth of specialist knowledge and robustness remain programmer territory.

This isn't a domain an AI Director can cross into. It's a direction in which programmers become purely stronger.

Monetizing Specialist Knowledge

Separately, I had AI draft a contract and asked a lawyer for the final review. With 2-3 comments, it was done for ¥5,000.

This is one illustration of how specialists sit in the AI era. Even when AI does the drafting, specialist review retains its value — legal responsibility, hallucination mitigation, expert judgment.

The same structure applies to programmers. The sample that an AI Director builds needs SIer/programmer review at the stage of lifting it to production quality. The role shifts from "time spent writing code from scratch" to "time spent evaluating code and raising it to production quality."

In an era where production has become easy, you don't have to stay on the production side. You can move to the side that raises quality, or builds the systems. A gold-rush analogy fits: rather than swinging a pickaxe to dig for gold, you can be the one selling pickaxes, or refining the gold.

AI now covers the areas that have historically consumed programmers' time. The opened-up time can be redirected toward deepening knowledge of methodology and algorithms. Let AI handle production. Move to the side that critiques as a specialist. This may be the model pattern for programmers and, by extension, all specialists.

Why 'Trivial' Tech Travels Across Industries: The 5-Layer Diagonal-Axis Engineer Framework

J.S_Falcon — Sun, 10 May 2026 06:55:29 +0000

TL;DR

I'm a non-coding engineer (operations background) who builds tools through AI collaboration.
This post argues that non-coding engineers — not programmers — are positioned as AI's biggest beneficiaries in the current era.
The diagonal-axis engineer thrives via 5 layers: dimensional crossover, value asymmetry, handover ability, ownership avoidance (AI-to-AI handover spec), and diagram-as-source.
"Trivial" tech (pandas + Excel batch files) wins in business contexts because operators can pick it up and adapt it.
The market value of the AI era boils down to the ability to travel across industries.

1. An Ode to "Trivial" Tech, Continued from This Morning

This morning I wrote about solving the "can't send business data to LLMs" problem with one line of pandas + Faker. (See the earlier post: How I Built a Masking Tool Without Showing AI Any Real Data.)

In the programmer world, that's:

"pandas? Yeah, sure."
"Faker? Old hat."
"Column-wise shuffling? Standard de-identification."

— completely trivial tech.

But in the non-engineer world:

"This eliminates 8 hours of work per month."
"I can discuss it with AI without sending real data."
"I have a scaffold for the masking tool now."

is the response. The same tech becomes either "trivial" or "revolutionary" depending on the evaluation axis.

This value asymmetry is no accident — it's a structural feature of the AI collaboration era. The thesis of this post:

Non-coding engineers, not programmers, are positioned to become the biggest beneficiaries of the AI era. This is realized by arming themselves with the 5-layer structure of the diagonal axis.

2. Vertical, Horizontal, Diagonal: Three Patterns of AI Use

First, let's organize 4 positions on 2 axes (coding ability × domain knowledge):

quadrantChart
    title Position of the Diagonal-Axis Engineer
    x-axis "Coding Low" --> "Coding High"
    y-axis "Domain Low" --> "Domain High"
    quadrant-1 "Diagonal-Axis Engineer"
    quadrant-2 "Domain Owner"
    quadrant-3 "Non-Coding Engineer"
    quadrant-4 "Programmer"

The movement from the bottom-left "Non-Coding Engineer" to the top-right "Diagonal-Axis Engineer" is the theme of this post. Building on this matrix, let's look at three patterns of AI use:

Axis	User	AI's Role	Effect
Vertical (same-dimension acceleration)	Programmer	Speed-up of same skill	2x faster typing / refactoring
Horizontal (same-dimension expansion)	Programmer	Skill width expansion	Frontend specialist learning DB design
Diagonal (dimension crossover)	Non-coding engineer	Compensating missing capabilities	"Can design but can't code" → reaches implementation independently

Vertical + Horizontal = linear expansion that doubles the slope of the graph.
Diagonal = a non-linear shift where the line itself diverges (paradigm shift).

The leverage effect is Diagonal >>> Vertical + Horizontal. 2x speedup vs 0 → 1 crossover — the latter is overwhelmingly larger.

The diagram of how to reach the diagonal-axis engineer position:

graph LR
    ZERO["Non-Coding Engineer<br>Coding:× Domain:×"]
    DOM["Domain Owner<br>Coding:× Domain:○"]
    PROG["Programmer<br>Coding:○ Domain:×"]
    DIAG["Diagonal-Axis Engineer<br>Implementation via AI<br>Domain:○"]

    ZERO -->|Field Experience| DOM
    ZERO -->|Vertical: Coding Study| PROG
    DOM ==>|Diagonal: AI Collaboration - Short Path| DIAG
    PROG -.->|Diagonal: Domain Acquisition - Slow Path| DIAG

    style DIAG fill:#9f9,stroke:#393,stroke-width:3px
    style DOM fill:#ff9

The thick line shows the short path (Domain Owner via AI collaboration). The dashed line shows the slow path (Programmer acquiring domain knowledge). The defining feature of the AI collaboration era: the latter requires time and motivation, while the former has its technical gap filled by AI.

3. The 5-Layer Structure of the Diagonal-Axis Engineer

"Can't code but can reach implementation via AI" alone isn't enough armor. Winning in the market requires all 5 layers:

#	Layer	Effect
1	Dimension Crossover	Reach implementation without coding
2	Value Asymmetry	Trivial tech matters in business contexts
3	Handover Ability	Simple implementations spread across industries
4	Ownership Avoidance	AI-generated specs let maintenance also depend on AI
5	Diagram-as-Source	Common language between AI and humans

The 5-layer dependency:

graph TD
    L1["Layer 1 Dimension Crossover<br>Reach implementation without coding"]
    L2["Layer 2 Value Asymmetry<br>Trivial tech matters in business"]
    L3["Layer 3 Handover Ability<br>Spreads across industries"]
    L4["Layer 4 Ownership Avoidance<br>AI specs let maintenance depend on AI"]
    L5["Layer 5 Diagram-as-Source<br>Common language between AI and humans"]

    L1 --> L2 --> L3 --> L4 --> L5
    L5 -.-> NEW["New Paradigm<br>Diagram is truth, code is AI's derivative"]

    style L1 fill:#fef
    style L5 fill:#fc9
    style NEW fill:#9f9,stroke:#393,stroke-width:3px

Layer 1 "implementation reach" is the starting point; the higher you stack, the stronger your market position. Past Layer 5, you exit into a new development paradigm.

Let me walk through each layer.

Layer 1: Dimension Crossover

For someone who can design but can't code, AI is the implementation means itself. Areas they previously had to delegate to others, they can now reach independently. 0 → 1 crossover happens.

Layer 2: Value Asymmetry

The opening point. Implementations the programmer world rates as "trivial" or "low-grade" land as business impact in the non-engineer world. The same tech reverses value depending on the evaluation axis.

In other words, the more "low-grade" tech is judged to be, the wider its market lands in business contexts. Programmers tend to lock themselves into a world that competes on technical sophistication, ending up clinging to a narrow market.

Layer 3: Handover Ability

Implementation Complexity	Required Personnel	Handover Ability	Industry Reach
High / Complex	High-skill programmer	Low (locked-in)	Trapped in 1 industry
Medium	General engineer	Medium	Mid-scale deployment
Simple (batch + Excel)	Operator	High	Industry-wide

The simpler the implementation, the more it reaches the field and becomes the gateway to the industry. The artifacts a vibe coder produces structurally land at "simple implementations" (since the coder doesn't read the code, they can't make it complex). Asking AI to "build it as a double-clickable batch file" → it converges naturally to moderate complexity → field handover ability emerges as a side effect.

Layer 4: Ownership Avoidance

Even with simple implementations and field handover ability, maintenance is a separate problem. AI collaboration structurally avoids the ownership trap that programmers have struggled with for a hundred years.

What is an AI-generated spec: A handover document for another account's AI (a different Claude / Gemini etc.) to read and become capable of modifying / maintaining the code. Even if the original vibe coder leaves, a new maintainer can bring in a different AI and ask "please maintain this code" — and the system lives on. This isn't the traditional "human → human" handover; it's an "AI → AI" handover by design.

This builds a 2-layer maintenance structure:

Layer 1: Simple implementation → operator can touch it directly
Layer 4: AI-generated spec → new maintainer / AI can take over

The revolutionary point of AI specs is being interactive, not static. The new maintainer's AI can be asked "how does this system work?". The system can be understood via AI without reading the code.

Layer 5: Diagram-as-Source

Layer 4 ensures "AI → AI" subject-axis handover, but for in-progress collaboration where humans and AI must share understanding, one more layer is needed.

So we make diagrammatic specs (Mermaid flowcharts / logic trees) the source of truth:

Natural language spec = soft (volatile, interpretation drifts)
Diagrammatic spec = hard (the structure itself is the source, humans and AI look at the same diagram)

Humans and AI can converse in the same diagram language. Instructions like "change the Yes branch of this gate" land instantly. Visual patterns make dead-code paths and unused nodes obvious.

4. Prerequisites — Without a Verification Eye, the Diagonal Axis Becomes an AI-Dependent Caricature

The 5-layer structure says "you can win without coding," but this is easy to misread.

Element	Necessity
Ability to write code	Not required
Ability to structure the system as a whole	Required
Verification eye for output validity	Required
Ability to cross-reference industry terminology (survey-driven)	Required

If these are missing, you can't write good prompts for AI, can't verify results, and end up mass-producing buggy deliverables.

In other words, "Domain Owner using AI ≠ Diagonal-Axis Engineer." Between them sits a filter of these 3 elements; those who can't cross it remain at "AI-dependent caricature" level.

Vibe coder = a non-programmer with a verification eye — that's the proper redefinition.

5. Devil's Advocate

The 5-layer structure isn't bulletproof:

AI-generated spec reliability: AI may write things that are factually wrong, with no verification possible
Commercial LLM dependency: service shutdown or model change can collapse reproducibility
Diagram literacy prerequisite: diagrammatic specs require "the ability to read them"
Short-term vs long-term: as programmers extend into the diagonal axis, the advantage erodes

Each requires individual countermeasures (human review / local LLM redundancy / coexisting static documents), but none of these overturn the post's core thesis.

Responses to Anticipated Pushback

"Can't anyone become a diagonal-axis engineer just by using AI as a domain owner?"

→ No. Without the 3 elements shown in Section 4 (structuring ability / verification eye / survey ability), you become a "vibe coder without a verification eye" who blindly trusts AI output and mass-produces buggy deliverables. Domain Owner + AI ≠ Diagonal-Axis Engineer. There's a prerequisite filter between them.

"Won't programmers eventually become diagonal-axis engineers by acquiring domain knowledge?"

→ Yes, but it takes time. For programmers to acquire specific business domain knowledge, on-the-job training + experience accumulation + dialogue with the business side are required, and this is dramatically slower than AI collaboration's immediate effect (the technical gap filled by AI). The dashed line in the path diagram represents this time gap. The defining feature of the AI collaboration era is this asymmetry: domain knowledge acquisition still takes time, while the technical gap is instantly filled by AI. That's why domain-owner-originated diagonal-axis engineers hold first-mover advantage.

6. My Case — Operations × Cross-Domain Double Cross

In my case, I sit at the top-right of the matrix by acquiring business knowledge through on-the-job experience first, then having AI write the code. Engineers with system construction or operations experience can usually map out routine business flows themselves, and reaching the diagonal-axis position only requires consulting AI for the parts that can be made more efficient.

I (the author) come from an operations background, and I move with the stance of bringing operational discipline (flow establishment / efficiency / monitoring / handover) into business improvements in other domains.

That is:

Layer	Structure
1 (Diagonal Axis)	Can't code, but reach implementation via AI
2 (Domain Cross)	Operations expertise × cross-domain business flow

Diagonal Axis × Domain Cross = double leverage. AI collaboration compensates the diagonal-axis weakness (can't code), and operations expertise brings a fresh perspective into other domains.

This is the core market value of the cross-domain engineer. The reason I feel "I can travel across industries" lies in this double cross.

7. Conclusion — A New Development Paradigm

Programmer era: Code is truth, documentation is auxiliary
Vibe coder era: Diagram is truth, code is AI's derivative, natural language is auxiliary

And the market value of the AI collaboration era boils down to the ability to travel across industries. Tech that draws "is that all?" responses is what becomes the gateway to industry adoption.

An ode to "trivial" tech, you might call it.

How I Built a Masking Tool Without Showing AI Any Real Data: Column-wise Shuffling as the Scaffold

J.S_Falcon — Sun, 10 May 2026 00:27:53 +0000

TL;DR

I never write code or send real data to LLMs — but I built a complete data-masking tool through AI collaboration.
The technique: column-wise independent shuffling (Japan PPC's official anonymization method) plus Faker replacement.
Four phases: send column names → run shuffling batch → manually craft sample CSV → send sample for Faker batch + structural review.
Key discipline: survey naive ideas in industry terminology before having AI implement — that alone compresses code 10x.
The output is a tool I trigger by double-click. I never read the Python.

1. The "Can't Send to LLM" Wall

Across my field notes, I've kept saying the same things:
"Don't send business data to LLMs."
"Only sanitized samples go to AI."

But how exactly do I sanitize the data?
That methodology has never been spelled out. So here it is —
a self-asked, self-answered post.

I wanted to build a new masking tool. I wanted to discuss it
with Claude or Gemini, showing real data and asking
"how would you mask this column?"

But the rule is firm: no business data goes to LLMs.

Just describing the logic verbally doesn't land —
LLMs need to see the data shape.
Hand-crafting fake data is torture (you have to reproduce
empty-cell patterns, spelling variants, full-width/half-width
character mixes, and so on).

What I needed: data that looks real but can't identify anyone.

2. The Naive Idea: Column-by-Column Shuffle

My first idea was simple:

"What if I shuffle each column independently?"

If you shuffle each column on its own:

Each value remains real (format perfectly preserved)
Row-level combinations are destroyed (records can't be reconstructed)
Per-column statistical properties are preserved (distributions intact)

For 100 customer rows, shuffle the name column, address column,
and amount column separately. The combination
"John Smith / 123 Main St / $12,345" disappears,
but each value still exists somewhere.

That should make individual identification impossible.

But before implementing, I surveyed first.

3. The Survey Reveals: Industry Standard

"Naive idea → immediate implementation" is forbidden discipline
(see my earlier field guide on
ops discipline in AI-assisted coding).
Translate the naive idea into industry terminology, then search.

Searching "column-wise shuffle + anonymization + technical term":

Column-wise Independent Shuffling
A de-identification technique offered by Oracle Data Safe,
Talend, Tonic.ai, and others.

And surprisingly, Japan codifies it too:

Japan's Personal Information Protection Commission (PPC) lists
"shuffling" explicitly in its anonymization guidelines:
"Probabilistically swapping records constituting the personal
information database among themselves."

So my naive idea was literally PPC's official method.
Survey complete. Time to implement.

4. AI Collaboration in Four Phases

A premise I should make explicit — I don't write a single line of code.
As a vibe coder, I have AI write it for me.

But the rule "no business data to LLMs" applies, so I can't just send
the real data and say "shuffle this please."
So I do it in four phases.

Phase 1: Send Only Column Names → Get a Tool Built

I can't send the real data, but I can send the column names
(structure, not PII).

Prompt to LLM:

Schema: customerID / name / address / building name / company / amount
Requirement: Shuffle each column independently, destroy row combinations
Build it as a batch file (.bat) that runs on double-click

The LLM produced a batch file + internal script + input/output folders
as a complete bundle. What lands on my desk: a tool that runs on double-click.
I don't read the Python inside.

Phase 2: Verify Operation → One Bug Surfaces

I drop real data into the input folder, double-click the batch file,
open the output CSV in Excel.

Something's off. The shuffle is supposedly happening, but row-level
combinations look intact — each row resembles its original ordering.

I report to the LLM:

The double-click ran fine, but the output CSV doesn't look shuffled.
Each row resembles the original order.

The LLM's instant reply:

The internal seed is shared across all columns. We need a different
seed per column. Fixing.

I receive the fixed batch file, double-click → combinations are now
destroyed. OK.

What looked correct on paper failed in practice.
The AI confidently said "looks right on paper" too,
so practical verification is the human's role.

Phase 3: Build the Sample CSV

From the shuffled output, I pull just 10 rows and manually replace the
surnames and building names with arbitrary characters in Excel.
This erases the last traces of real data.

The sample CSV now has only the column structure and shape of data —
no real-data trace remains. Only at this point does it become material
I can send to the LLM.

Phase 4: Send the Sample CSV → Get the Faker Batch Built

I send the sample CSV to the LLM with a follow-up request:

Based on this sample, add a Faker-based replacement step for
name / address / building / company. Same batch file should handle it.

The LLM integrated Faker (ja_JP locale, but the same applies in any
locale) and, for fields Faker doesn't support (e.g., apartment building
names like "Alpha Omega Place"), wrote a custom generator using
katakana + suffixes (producing names like "Nikikenawatower").

While reading the sample, the LLM also notices:

Your "product" keyword rule for Faker-replacement is over-matching:
"productID", "productStock", "productCategory" are getting hit too.
Switch to a two-stage detection (include + exclude keywords).

This wasn't a perspective I would have spotted alone.
I use AI twice — once for the shuffling batch (built from column
names alone), and once for the Faker batch + structural review (built
from the sample CSV).

The LLM rewrote the matching logic from "keyword-match → apply" into
"keyword-match → exclusion-check → apply" before producing the Faker
batch. Double-click the new batch file → Faker processing completes
without any over-match. Done.

The Four-Phase Role Split

Phase	What I do	What AI does
Phase 1: Build shuffling batch	Send column names as prompt	Build the complete batch tool
Phase 2: Verify operation → Fix bug	Click / verify in Excel / report	Identify bug cause and fix
Phase 3: Build sample CSV	Pull 10 rows / manually edit surnames and building names	(not involved)
Phase 4: Build Faker batch	Send sample CSV to LLM / click / verify	Build Faker batch + structural review (resolve over-match)

I never read the Python. I never send real data to LLMs.
Double-click → open in Excel → report to AI. Four phases through
this loop and the tool is finished.

This is what AI collaboration looks like.

5. Legal Positioning (Internal Use OK, Outsourcing Gets Tricky)

A brief touch on the legal positioning.

Internal use (LLM discussion / internal analysis) is generally fine.
The scrambled output is unidentifiable enough to substantially reduce
privacy risk in most jurisdictions.

Handling client data as a contractor is where it gets tricky.
The framing differs by jurisdiction:

Japan classifies this as "entrusted processing" under Article 27(5)(i) of the Personal Information Protection Act, an exception to third-party transfer rules.
EU/UK treats it as a Data Processor / Data Controller relationship under GDPR, with a Data Processing Agreement (DPA) under Article 28 specifying the processing scope.
US uses HIPAA's Business Associate Agreement (BAA) for healthcare data, or contractual data-handling clauses for general PII.

The common pattern: contract language determines compliance.
Whichever jurisdiction you operate in, have your legal team review
the "sanitization purpose and scope of use" clauses explicitly.

In short: contracts get complicated, so legal review is recommended
for contract work. I won't go deeper than that here.

6. Closing

What the AI collaboration era needs is a scaffold tool that converts
"data you can't share" into "samples you can share."
That tool fits in one line of pandas plus Faker.

And the post's thesis:

Don't have AI implement your naive idea immediately —
survey it in industry terminology first, then implement.
That discipline compresses code volume by 10x.

If I hadn't known that "column-wise shuffle" is PPC's official
"shuffling" method, I would have asked the LLM to "generate random
names with Faker, build a consistency dictionary, maintain referential
integrity, ..." — full from-scratch implementation.
In reality, one line of pandas was enough.

Survey-driven discipline. In the AI collaboration era, what matters
on the human side is the ability to cross-reference industry knowledge.

Why I Run Two AIs Against Each Other: An Ops Engineer's View on AI Governance

J.S_Falcon — Sat, 02 May 2026 01:04:53 +0000

TL;DR

I run two different AIs (Claude and Gemini) against each other, with myself as a human router carrying messages between them. No auto-orchestration framework.
The setup is a complementary view to Dev-centric AI automation tools, not a replacement. Right tool for the right job.
Operations background suggested treating this as a two-layer design: internal diversity (prompts within one model) and external diversity (cross-vendor models). Both layers contribute, neither alone is enough.
Five practices and five caveats below — drawn from one engineer's one-month operation, framed as a hypothesis, not a recipe.
Each part of this series stands alone — Part 1 is the entity resolution case study, Part 2 is the AI collaboration patterns, this is the architecture-level view.

1. The Problem: When Multi-AI Becomes an Echo Chamber

Multi-agent AI setups have been everywhere for the last year. AutoGen, CrewAI, LangGraph, and others let you spin up several agents with role prompts (planner, reviewer, executor) and let them talk to each other. Useful, fast, automated.

There's a quiet failure mode, though. When all the agents share the same underlying model, "multi-agent" can drift into "multi-prompt-on-one-model." The agents end up reasoning from the same training distribution, hitting the same blind spots, agreeing too quickly. You get the appearance of diverse perspectives without actual diversity.

I noticed this on my own setup. I was using Claude Code for development and asking the same Claude to play "devil's advocate" against its own proposals. Most of the time the devil's advocate was thoughtful, but on hard questions it tended to go quiet — the model couldn't really argue against itself when the alternatives lived in the same prompt context.

That's where the cross-vendor experiment started.

2. An Operations View: Internal vs External Diversity

I come from an operations / systems engineering background. When I look at an AI workflow, I tend to ask the questions an SRE would ask: where are the single points of failure? What happens on Day 2? What does the audit trail look like?

Through that lens, multi-AI setups have two layers of diversity, and they're not interchangeable.

Internal diversity is what you get from prompts inside a single model. "You are now arguing the opposite case." "Critique this design as a security reviewer." "List three reasons to reject this proposal." The model switches voices but keeps the same underlying reasoning substrate. Useful, cheap, fast.

External diversity is what you get from a different vendor's model. Claude and Gemini don't share weights. Their training data overlaps but isn't identical. Their bias profiles differ. When Claude proposes a design and Gemini critiques it, the critique comes from a different statistical posture, not a different mood from the same speaker.

These are not equivalent. They're substantially comparable for the surface task (producing diverse views), but they differ on operational properties:

Property	Internal diversity	External diversity
Audit independence	Single context log	Independent logs per model
Vendor lock-in	High (one vendor's model)	Low (cross-vendor)
Failure isolation	Context corruption affects all roles	Independent failure domains
Real parallelism	Sequential within one context	True parallel calls possible

So the design I ended up with is two-layer: internal diversity inside each model, plus external diversity across vendors. The two layers compound. Neither alone reproduces what both together produce.

The transport between models, in my setup, is email. Claude proposes; I copy the proposal into a script that emails it to a Gemini-readable inbox; Gemini reads, critiques, and responds; I paste the response back to Claude. Slow on purpose. We'll come back to that.

Part 1: Five Practices

Practice 1 — Force adversarial framing into prompts explicitly

When you ask a model "what do you think of this?" the default answer is usually a polite agreement. That's not what cross-review is for.

In every cross-vendor exchange, I include explicit phrasing that requires the responder to argue against. The wording I actually use, more or less verbatim:

Identify the weakest link in this design.
Give three reasons to reject this proposal.
Where would this hypothesis fail?
If you find yourself agreeing, name the assumption you are least sure about.

Without that prompt-level push, the second AI tends to mirror the first, which is the failure mode the whole setup is supposed to prevent.

This sounds obvious until you watch a model respond to a soft prompt and realize how much friction the polite default introduces.

Practice 2 — Adopt an audit-first protocol (Seq / Re style) with hard-stop on gap

Every message between Claude and Gemini in my system carries a header: [MAAR-Session: <topic> | Seq: N | Re: M]. Sequence number from the sender, reply-to number for the message being answered. Borrowed straight from TCP and email threading.

The point is not the format. The point is that any message is self-documenting about which conversation thread it belongs to, which message it answers, and what came before. When something goes wrong — a misrouted reply, a missing message, an out-of-order paste — the gap is visible.

When the gap shows up, the protocol response is hard-stop: pause the conversation, request retransmission, do not paper over the missing piece. That's the operations-side instinct showing up — a missing log line stops the change window, period. "Best-effort continue" is the failure mode that produces silent data loss in production; it's no better in AI workflows.

In a fully automated multi-agent system, this kind of bookkeeping is implicit in the framework. In a human-routed system, you need it explicitly. The cost is small (a few characters in each header). The benefit is the audit trail you actually get, plus the explicit stop signal when reality drifts from the protocol.

Practice 3 — Use vendor-neutral persona prompts

The prompts I send to Claude and Gemini for the same role (e.g., "critique this design") are kept as close to identical as possible. Same wording, same structure, same evaluation criteria. Differences in the responses come from differences in the models, not from differences in how I asked.

This matters because it lets me actually compare the two outputs. If Claude pushes back hard and Gemini agrees, I know that's a real signal about the proposal — not an artifact of having asked Gemini in a softer way.

The temptation is to tune each prompt to the strengths of the model. Resist it. Vendor-neutral prompts give you a comparable signal across vendors, which is the whole point.

Practice 4 — Switch between polite mode and adversarial mode

Most of my cross-reviews run in what I call "polite mode": measured language, restrained framing, "please consider" phrasing. That's the right default for normal review.

But sometimes the second AI agrees with everything I say. That's when I switch to "adversarial mode" deliberately: explicit framing that this is a thought experiment, instructions to drop politeness, demand for a forced disagreement, even (carefully) raising questions about whether the model has biases that explain its agreement.

The mode switch is intentional, time-boxed, and announced inside the prompt. It's not the default — overdoing it produces performative dissent (more on that in the caveats). But used sparingly, it's the mechanism that breaks the over-agreement spiral when it shows up.

Practice 5 — Hold to hypothesis discipline (sample-size humility)

This whole setup is N=1. One engineer, one month, one workflow. That doesn't make it wrong — but it doesn't make it a recipe either.

In every external description of this setup, including the article you're reading, I try to keep the framing as "individual observation, presented as a hypothesis." Avoid words like paradigm, grand theory, the right way. The discipline is partly about honesty (these claims aren't tested at scale yet) and partly about staying open to evidence that breaks the model.

If the hypothesis is right, it'll show its strength against contrary cases over time. If it's wrong, the framing makes it easier to walk it back without losing face.

Part 2: Five Caveats

Caveat 1 — Sample size of one

This is one engineer's one-month experiment. Patterns described as "practices" here might fail to generalize, might depend on my specific workflow or domain, might not survive contact with a different setup. Read the practices as hypotheses, not recommendations.

Caveat 2 — Right tool for the right job

Auto-orchestration frameworks (AutoGen, CrewAI, LangGraph) exist for good reasons. They're faster, cheaper, and better-suited to many use cases. The human-routed setup described here is a complement, not a replacement. If your task fits an autonomous loop, use one. The two-layer design is most useful where audit independence and cross-vendor signal matter more than throughput.

Caveat 3 — Internal and external diversity are not fully equivalent

Internal diversity (prompt-based persona switching) covers a substantial portion of what external diversity provides — but not all of it. Audit independence, vendor lock-in resistance, failure isolation, and real parallelism are properties that internal diversity simply cannot match. Claiming "we got the same effect with one model" is a category error.

Caveat 4 — Performative dissent risk

If you push the second model into adversarial mode too often, it learns to manufacture disagreement. You get pushback that's syntactically critical but substantively empty. The mode switch only works because it's the exception, not the default. Used as a routine technique, it produces noise instead of signal.

Caveat 5 — Maintenance overhead is real

Audit trails, mode switching, header conventions, hypothesis framing — this is more discipline than a casual workflow. The overhead is justified for high-stakes decisions and design reviews, less so for everyday tasks. If you adopt the practices, calibrate which ones are worth the cost in your context.

Wrap-Up

Three things I've taken from running this for a month:

Internal and external diversity compose. Either alone is a partial defense against single-model bias. Together they cover more of the failure surface than the sum of the parts would suggest.
The transport doesn't have to be fancy. Email and copy-paste are slow on purpose. If that reads as primitive — that's the design. The friction isn't from inability to automate the transport; it's a deliberate cognitive checkpoint between the two models. Slowness is the feature: it gives me time to actually read the response before forwarding it, and it forces the audit trail to be a thing I look at, not a thing the framework hides from me.
Discipline beats automation for governance work. Auto-orchestration is faster. Auto-orchestration is also harder to audit, harder to debug, and harder to explain to a compliance reviewer. For governance-shaped tasks, the slow path wins on the metrics that matter.

The whole construction is a hypothesis. I'd trade it for a tool that does the same job better tomorrow.

What's Next

Part 1 of this series — the entity resolution case study — is the concrete build that prompted this whole reflection.

Part 2 — the AI collaboration patterns from an operations lens — sits alongside this one and covers the session-time discipline.

A future part will cover the protocol design itself: the Seq/Re headers, the TTL-based session control, the gap detection and hard-stop rules. That's where the human-routed design earns its keep.

Comments welcome — particularly:

Cross-vendor multi-AI patterns you've tried, and what surprised you.
Cases where internal diversity was enough (counter-evidence to the two-layer claim).
Audit and governance experiences with auto-orchestration frameworks (where they shone, where they didn't).

What Operations Discipline Brings to AI-Assisted Coding: A Cross-Domain Field Guide

J.S_Falcon — Wed, 29 Apr 2026 13:13:13 +0000

TL;DR

I moved from operations / systems engineering into the software side via AI collaboration. Part 1 of this series (the entity resolution case study) is the build; this is the methodology.
Five practices and five anti-patterns, filtered through an ops lens — but the lessons generalize.
Not "AI tips you've heard." Patterns that fall out naturally if you treat AI sessions like config reviews, runbooks, and validation procedures.
Each piece is paired with a real misstep I made building Part 1.
Each part of this series stands alone. Read in any order.

Why an "Operations Discipline" Lens

Operations engineers spend their careers internalizing four habits:

Plan before you build — designs, runbooks, change requests.
Verify before you declare done — validation procedures, post-change checks.
Document state — configs, design docs, postmortems.
Suspect numbers — every monitoring datapoint hides an artifact.

These habits transfer directly to working with AI coding assistants. The disciplines you learned debugging routers, filing change requests, and reviewing configs are the same ones that prevent AI sessions from sliding off the rails.

I'm framing this through ops because that's the lens I learned from. Most of these patterns generalize beyond ops — software engineers, data engineers, and SREs will recognize them. The ops version just happens to package them tightly.

Part 1: Five Practices

Practice 1 — Treat your CLAUDE.md (or system prompt) as a design-spec preamble

In ops, every change procedure has a preamble: prerequisites, scope, rollback steps, validation checks. Same energy in AI work.

CLAUDE.md is Claude Code's persistent instruction file. (Other assistants have equivalents — system prompts, custom instructions, etc.) Use it the way you'd use a runbook preamble:

## Operating principles
- Always plan before implementing.
- Confirm ambiguous instructions before coding.
- Always provide a counter-argument when proposing a design.
- Never report a metric without showing how it was measured.
- Distinguish "should work" from "actually verified to work."

Once written, every future session inherits these rules. You stop re-explaining yourself. This is the same template-then-reuse pattern that saves you from rewriting a runbook for every change window.

Practice 2 — Demand a Devil's Advocate, every time

Design reviews exist because group-think kills production systems. Force the AI to argue against itself in every proposal.

Three asks I bake into every meaningful design conversation:

What's the worst-case failure mode of this design?
What use case did you not consider?
Give me three reasons to reject this design.

Bake this requirement into your CLAUDE.md and you stop seeing pure agreement. An AI that only agrees with you is a single point of failure.

Practice 3 — Force ambiguous instructions to be confirmed before implementation

In ops requirements gathering, "implement loose specs" is a known disaster pattern. The same is true for AI sessions, where ambiguity gets resolved silently — and usually wrong.

Real example from Part 1: I said "treat the ID and the display name as a pair, match if either is present." The AI interpreted that as two independent search keys. Half the matcher had to be rebuilt.

Lesson, written into CLAUDE.md: if an instruction has two valid readings, ask which one I mean before writing code.

This is the same habit as a senior network engineer asking "do you mean inbound or outbound?" before touching the firewall.

Practice 4 — Separate "theoretical evaluation" from "real-world evaluation"

Ops engineers know the gap between "the spec says it works" and "I've watched the LED light up." The same gap exists in AI work, and it's wider than you'd think.

Real example from Part 1: the AI claimed about 99.2% recall based on past-data pattern analysis. I asked for an actual run on the real dataset. The actual recall came back at 55%.

The lesson is not "the AI lied." The lesson is that pattern-analysis predictions are not the same as a real execution result. Every claim that sounds like a measurement deserves the question: was this measured, or estimated? If estimated, label it that way and move on; if measured, show the run.

Practice 5 — Have the AI write its own verification scripts

If the AI says "this code achieves 99% recall," ask it to write the script that measures that recall. Then run it.

This converts:

A claim → a script.
A script → an audit trail.
An audit trail → reproducibility.

It is the same pattern as runbooks: a change procedure and a validation procedure, always paired. The validation script becomes a permanent artifact you can hand to the next person — or to your future self when something regresses.

Part 2: Five Anti-Patterns

Anti-Pattern 1 — "Just build me a tool"

The AI equivalent of "fix the network." Without scope, the AI invents one. Worse, it pursues the invented scope confidently, so the wrong direction is pursued aggressively.

Treat session start like requirements gathering: rough goal, key constraints, what's explicitly out of scope. Five minutes of scoping saves five hours of rework.

Anti-Pattern 2 — Trusting headline numbers without verifying composition

"99% recall" sounds great until you discover it was measured on cherry-picked rows, with the test set leaking into training data, on a metric that doesn't reflect the actual user experience.

Before reporting any number, ask:

How was this measured?
On what data?
Under what conditions?
With what biases?

This is the same suspicion you apply to a monitoring dashboard reporting zero alerts: is the agent actually reporting, or is it dead?

Anti-Pattern 3 — Throwing raw error text at the AI without context

"It doesn't work" → "Why?"

In ops you'd never debug a router by saying "it's down." You'd attach: configuration, status output, syslog excerpts, behavior of connected devices.

Same here. The AI cannot infer your environment. Show the command, the actual output, the expected behavior, and the deviation. Treat each interaction like a bug report you'd file with a vendor.

Anti-Pattern 4 — Sending business data to an AI without compliance review

Default assumption: any data you put into a prompt may be retained, indexed, or used in training, regardless of what the vendor's marketing copy says.

The operational habit is straightforward — redact, mask, or synthesize. The same instinct that keeps you from posting customer IPs to Stack Overflow should stop you from pasting customer rows into a prompt.

(Part 1 covers this pattern in depth as it applied to the entity resolution build. The short version: deterministic logic touches the data; the AI touches only code, design notes, and synthetic samples.)

Anti-Pattern 5 — Stopping at "it works"

"The code runs" is not the same as "I understand why it runs."

The ops version of this is: a configuration that worked once but I can't explain is a future incident.

Make the AI explain why the working solution actually works. If neither of you can defend the design after one cycle of follow-up questions, treat it as a yellow flag — not a green light. Ship explainable code; the unexplained kind owns you on the day it breaks.

Wrap-Up

The pattern across all ten:

Apply ops discipline to AI sessions.
Treat AI claims like vendor claims — verify them in your environment.
Treat AI conversations like change windows — preamble, scope, verification, postmortem.
Treat AI outputs like config diffs — explain them or reject them.

What I'm explicitly not claiming:

These are not unique to operations engineers. They generalize. They just happen to package tightly through the ops lens because the discipline is already there.
These are not the only practices. Five is a lossy compression. The ten you'd build for your environment may differ in detail.
These cover the build phase of AI-assisted work — the session-time discipline. Day 2 operations (monitoring AI-generated code in production, detecting silent drift, incident response when AI-assisted changes break) is its own discipline and deserves its own article. The patterns here are necessary but not sufficient for production AI usage.

The big idea: AI doesn't replace engineering judgment — it amplifies it. Amplifying lazy judgment produces more bad code, faster. Amplifying disciplined judgment produces clear, audited, defensible work.

What's Next

A future part of this series will cover how the design review for these articles actually happened — a Multi-AI Adversarial Review (MAAR) loop where Claude and a second AI argued against each other under human routing. That's the meta-process behind both Part 1 and this one.

If you came in via this article, Part 1 is the concrete build that produced these lessons.

Comments welcome — particularly:

The five practices or anti-patterns you'd add.
Cross-domain engineering experiences (any technical background → another).
Cases where ops discipline did not transfer cleanly to AI work.
Rollback strategies when an AI-assisted change corrupts your codebase or repo state.
Day 2 operations practices for AI-generated code in production (monitoring, drift detection, incident response).

Beating 250,000 Mental Comparisons: A Cross-Domain Engineer's Entity Resolution Case Study

J.S_Falcon — Wed, 29 Apr 2026 11:13:41 +0000

TL;DR

Operations/Systems engineer recently moved to the software side via AI collaboration.
Built a domain-specific entity resolution tool in a handful of evening sessions with Claude Code.
Caught about 99.2% of human-detected reconciliation errors when replayed against 8 weeks of historical data.
Turned a "skilled-veterans-only" weekly task into something anyone on the team can run.
Design retrofitted unexpectedly well to dual process theory, Gestalt psychology, and anchoring-bias defense.
Source business records never reached an LLM. Deterministic pipeline + human review only.

1. The Hidden Problem: When 500 × 500 Becomes a Cognitive Wall

Many companies maintain the same business entities across multiple systems.

A retailer tracks SKUs in an internal master AND on Amazon / Rakuten / Shopify exports.
A clinic carries patient records in both an EMR and an insurance billing system.
A manufacturer holds internal inventory but also receives partner inventory feeds.
An accounting team reconciles general ledger entries against bank statements.

These pairs need periodic reconciliation. In the technical literature this is Entity Resolution or Data Reconciliation — a universal problem that nearly every mid-to-large business hits eventually.

The case study here uses the retail SKU vs marketplace listing framing. (The actual industry I work in is intentionally abstracted, but the structure transfers cleanly.) Two systems, ~500 rows each, weekly reconciliation. Skilled humans needed about 3 hours per week. Newcomers, half a day to a full day. Hidden detail: the small row count masks the real difficulty.

Why is 500 × 500 hard?

The 250,000 problem

Manually reconciling 500 × 500 pairs forces a person to evaluate up to 250,000 combinations in their head. Not 1,000 — 250,000. Plus typo tolerance, format variation (full-width vs half-width, mixed scripts, abbreviations, punctuation), and partial matches. Each pairwise judgment is not O(1).

Brute-forcing this is computationally similar to running a 1,000-node full-mesh ping check vs a flat 1,000-node liveness check. Order-of-magnitude different load.

Working memory overflow

Miller's "magical number" puts our short-term memory at 7 ± 2 chunks (Miller, 1956). Hunting matches across 1,000 candidates with format drift continuously overflows working memory and pegs System 2 (slow thinking) for the entire session. The 3-hour exhaustion experienced by veterans isn't a complaint — it's a neurological inevitability.

"Short to do" doesn't equal "easy to do" for cognitive labor.

Reproducibility decay

A one-off reconciliation can be brute-forced. But when the task repeats weekly across 10+ weeks, judgment drift becomes unavoidable:

"Last week I matched 'A Co.' and 'A. Company' as the same entity. This week I treated them as different."
"Last week I tolerated typo X. This week I rejected it."

This drift is what really breaks data quality long-term. It's the same structural failure mode as "config review standards differ by reviewer" in infrastructure operations.

The actual target

So the real problem the tool solved was not "shorten 3 hours per week" but:

250,000 judgments × 10 weeks of consistent reproducibility — a quality bar humans can't physically sustain — backed by a deterministic machine.

Plus removing the skill dependency. "Only one veteran can do this in 3 hours" is a single point of failure. After the tool: anyone could run it with consistent quality.

2. Background: Who I Am and What I Was Solving

I'm an Operations/Systems engineer. Configuration, validation, runbook authoring, monitoring, troubleshooting — that side of the house. Software development was not my primary craft, though scripting was always part of the job.

I'd recently moved into a new business domain (about 2 months in) and the tooling target system was something I'd only been touching for ~1 month. From the user side I'd seen the workflow longer, but not as a developer.

Translation: design / validation / runbook discipline solid. Python and application development essentially unfamiliar.

This article is not a "look what I shipped" piece. It's a record of how operations-side disciplines transferred unchanged into AI-assisted software work in an unfamiliar domain.

Who this article is for

Reader	Useful sections
Operations / SRE engineers exploring AI assistance	Everything
Mid-career engineers moving across technical domains	Background, Architecture, Cognitive Design
Engineers new to AI-assisted development	Architecture, Cognitive Design, PII
Managers thinking about AI for their teams	Results and the cognitive-load argument

3. PII / Compliance Considerations

A question that always comes up in comments on entity-resolution articles: where does the data go? Worth answering up front.

In this implementation:

Source business records never reach any LLM. Both input files (internal master + external system export) are read locally by a Python script.
Matching is fully deterministic. Pandas, openpyxl, and difflib.SequenceMatcher for similarity. No embedding API. No remote inference at runtime.
The LLM's role is code-side, not data-side. Claude Code helped write the matching logic, the validation scripts, the design review, and the documentation. None of the actual records were ever sent.
For testing only, masked synthetic data was used in prompts. Real names, amounts, and addresses were replaced with synthetic equivalents before any prompt left the local environment.
Edge cases stay with humans. When the deterministic pipeline can't decide, it surfaces a flagged row for human review — not for LLM second opinion.

This separation is intentional. The matching task is well-suited to deterministic logic. LLMs would only add cost, latency, and compliance exposure for no quality gain.

If your team has even a soft "no business data into external AI" policy, this pattern is fully compatible.

4. Architecture: Two-Stage Matching + Cognitive Gates

Stack

Python 3.11
pandas + openpyxl (Excel I/O, color-coded output)
difflib.SequenceMatcher for fuzzy similarity
Rule-based throughout. No machine learning.
~1,100 lines, single script.

Phases

Phase 1: Match by exact stakeholder name (or alias group)
Phase 2: Cross-match by name similarity ≥ 0.6 (rescue typos)
Phase 3: Last-name-only + structural match (single-typo tolerance)
Phase 4: Duplicate-registration detection (same stakeholder + similarity ≥ 0.8)
Phase 5: Rescue rows with no stakeholder name (attribute match)
Phase 5.5: Attribute-mismatch pair rescue (identifier similarity ≥ 0.7, stage 2)
Phase 6: Row generation + color decision

The score function (key gates)

def compute_score(row_a, row_b):
    # Hard gate: region must match — kills cross-region false positives
    if region_a != region_b:
        return 0.0
    # Hard gate: numeric attribute must be close enough
    if abs(value_a - value_b) > THRESHOLD:
        return 0.0
    # Identifier gate: row_b's identifier must be embeddable in row_a's identifier
    if not is_identifier_match(addr_a, identifier_b):
        return 0.0
    # Sub-identifier gate: anchoring-bias defense
    if sub_id not in addr_a:
        return 0.0
    # Soft scoring (only after every hard gate passed)
    score = max(identifier_match_score, similarity, value_fallback)
    return score if score >= 0.6 else 0.0

Why this shape?

The retail SKU framing helps here. The same product on a marketplace might appear as iPhone15 in your master and iPhone 15 Pro Max on the marketplace. Same item family, different surface form. Two key insights:

Hard gates first. "Different region" or "value difference > N" are absolute disqualifiers. Run them before any expensive similarity computation.
Soft scoring last. Once hard gates pass, compute similarity — but cap below 0.6 as "uncertain, surface to human."

Why not ML / Vector DB / embeddings?

Deterministic rule-based was chosen on purpose. Auditability was the requirement. When a flagged row is wrong, the operations team has to be able to trace exactly which gate fired and why. A black-box similarity score of 0.81 with no explanation cannot be reviewed, cannot be unit-tested, and cannot be defended in a compliance audit.

ML is a fine choice when you have labeled training data, training infrastructure, and a continuous evaluation pipeline. None of these applied here. The operating constraint was: "anyone on the team should be able to read the code and know why it decided what it decided." That constraint forces deterministic logic.

Abstracted structure

Domain-specific term	Abstract concept
Item / SKU	Entity
Stakeholder (vendor / agent)	Stakeholder attribute
Price / Amount	Primary numeric attribute
Address / Location	Identifier (multi-attribute)
Building / SKU name	Auxiliary identifier
Detail number / barcode	Sub-identifier
Format variation (kana/latin/case)	Data quality issue
Domain judgment	Tacit knowledge

This is a universal "match entities across two systems with format drift" problem. The pattern reappears in EC, healthcare, HR, accounting, manufacturing, publishing — anywhere two systems represent the same business object differently.

5. Cognitive-Science Design Principles (the Twist)

I didn't design this thinking about cognitive science. I built it, it worked, and only afterwards in a structured Gemini conversation did the underlying principles surface. The retrofit fits unsettlingly well.

5.1 Dual process theory (Daniel Kahneman)

The two phases map onto two thinking modes:

System 1 (fast) = Phases 1–5. Fuzzy "is this roughly the same thing?" — similarity scores, identifier matching, attribute closeness.
System 2 (slow) = determine_color(). Strict checks for value mismatch, format inconsistency, identifier mixing.

Color-coded human review gets the System 1 fuzzy pass plus the System 2 strictness annotation, which is exactly the input shape humans need to make a final call.

5.2 Gestalt psychology

Humans recognize "wholes," not character sequences. iPhone15 and iPhone 15 Pro Max feel like the same product family even though strict string equality fails. So:

def is_identifier_match(addr_a, identifier_b):
    """Recognize chunked identity even with mixed scripts and separators."""
    chunks = re.split(r'[A-Za-z0-9\s\-_]+', identifier_b)
    return all(chunk in addr_a for chunk in chunks if len(chunk) >= 2)

Matching by chunks survives whitespace, separator, and script variation.

5.3 Anchoring & confirmation bias defenses

Hard gates exist to deny human-style intuitive shortcuts:

"Same price, must be the same item" — rejected by sub-identifier gate.
"Same name, must be the same person" — rejected by region gate.

The machine's job is to be coldly skeptical exactly where humans get over-confident.

5.4 Reducing human cognitive load (Human-in-the-Loop)

When a human is asked to confirm a flagged row, they don't get an opaque "match score 0.62". They get a one-line annotation:

Same entity matched | [Value mismatch] diff ¥2,000,000 (5.4%)
(A: ¥34,900,000 / B: ¥36,900,000) · identifier format inconsistent

The human doesn't waste cycles re-deriving why the row was flagged. Cognitive load drops sharply.

5.5 Don't automate the ghost

This part borrows from Ghost in the Shell. Some judgments depend on tacit business knowledge that can't be reduced to rules. Don't build heuristics that pretend to encode them. Surface the row as a caution signal and let a human apply the tacit layer.

Tightening the logic isn't a path to recreating the ghost.
It's a path to revealing where the ghost is needed.

Mapping summary

Cognitive concept	Implementation
System 1 (fast)	Phases 1–5 (fuzzy matching)
System 2 (slow)	`determine_color()` strict checks
Two-stage / dual-pass	Stage 1 + Stage 2 (Phase 5.5)
Gestalt grouping	`similarity` / `is_identifier_match`
Anchoring defense	Sub-identifier gate, identifier gate
Cognitive load reduction	Aggregated `[reason] diff X` annotations
Human-in-the-Loop	Caution signals for tacit-knowledge zones

6. Results

Recall on 8 weeks of historical data

Metric	Value
Errors flagged by humans (excluding outlier weeks)	~130
Errors caught by the tool	~129
Recall	~99.2%

The single missed case was annotated by the human reviewer as "even a human couldn't decide here." Effectively the tool catches every case where a human commits a confident verdict.

(Caveat: this is recall against 8 weeks of one team's data, not a benchmark claim. Different domains will need their own measurement.)

Time and skill load

Item	Before	After
Skilled veteran throughput	~3 hrs/week	~30 min/week (review only)
Newcomer throughput	half a day to full day	~30 min/week
Skill dependency	Yes (single point of failure)	No (anyone can run it)

The time number understates the value. The real shift is breaking the skill SPOF. Veteran out sick, leaves, or buried in another priority — work continues at the same quality.

A note on false positives

Recall is 99.2%, but the tool is intentionally tuned for higher recall over higher precision. False positives — pairs flagged for human review that turn out to be fine — are accepted as the trade-off. The ~30 min/week of human review handles them without strain.

In a no-human-in-the-loop deployment this trade-off would be very different. Here, false positives are cheap (a glance from a human reviewer) and false negatives (missed reconciliation errors) are expensive (data drift propagates into business reports).

7. The Flowchart

Drawing the judgment flow as diagrams surfaced things the code review didn't. Below are the four phases as separate figures, in execution order.

7.1 Phase 1: Hard Gates (sequential disqualifiers)

Region → numeric value → auxiliary identifier → sub-identifier. Each gate is an absolute disqualifier: any "No" drops the pair. The order matters — cheapest disqualifiers run first.

7.2 Phase 2: Soft Match

Once a pair clears all hard gates, compute_score evaluates a soft similarity. Below 0.6 → drop. At or above → lock the pair as the same entity.

7.3 Phase 3: Parallel Flag Checks

For confirmed matches, six independent checks fire in parallel. Each surfaces a "this matched, but here's a discrepancy" signal. Tags are aggregated; there is no early-return contamination between checks.

7.4 Phase 4: Final Verdict and Drop Aggregation

Aggregate the tags into a color verdict. Drops from Phase 1 and Phase 2 converge into the "Unmatched" lane, surfaced standalone in the human-review output.

Things visible only after rendering as a diagram

These were invisible while reading code, only obvious once drawn:

Phase 1 hard gates are ordered by computational cost. Region → numeric → auxiliary → sub-identifier. I placed them by intuition; the diagram showed they were already optimal — cheapest disqualifiers first.
Phase 3 parallel flag checks are genuinely independent. Six checks fire in parallel with no early-return contamination. The diagram confirmed there was no silent dependency between them.
All Drop1–Drop5 paths converge to the same Unmatched node. I was throwing away the drop reason. Re-running "why was this pair rejected?" was impossible. Fix: log the drop reason in the row annotation.

Drawing the flowchart is roughly the same act as drawing an infrastructure topology before going live. The diagram is the rubber duck.

8. Wrap-up

Three transferable lessons from this build:

Cognitive load is the hidden cost of "short" repetitive judgment tasks. Headcount-hour math undersells the burnout reality and skill-SPOF risk.
Cognitive science principles fall out of good design retroactively. I didn't design with them in mind; the principles became visible only through structured review (with a second AI). If your design retrofits to known principles, that's confirmation. If it doesn't, that's a smell.
LLMs do NOT have to touch your data. Most entity resolution work doesn't need them at all. Use them for code, design review, and documentation. Keep the business records local and deterministic.

The implementation itself is internal-use only and won't be open-sourced. The patterns generalize cleanly to any two-system entity reconciliation: EC, healthcare, HR, accounting, manufacturing, publishing.

9. What's Next

Coming in Part 2: how this whole thing got built in the first place — the AI collaboration patterns, the anti-patterns I hit, and the cross-domain disciplines that transferred from operations to software development. (Link to A2 once published.)

Comments on entity resolution, cognitive load in repetitive tasks, or cross-domain engineering experiences are welcome.

Field Notes from a Cross-Domain Engineer Working with AI

J.S_Falcon — Wed, 29 Apr 2026 01:36:50 +0000

I run two AI assistants from different vendors against each other on every non-trivial decision, with a human (me) sitting in the middle as the routing authority. What's surprised me most is not that they disagree — it's how they disagree. They drift toward agreeing with me too quickly, then with each other too quickly, and I've had to design specific friction into the workflow to keep their disagreement productive.

Writing this down because I think the friction patterns might be useful to other people running similar setups.

What this is

I've been working as an operations / systems engineer for about twenty years. Networks first, then servers, then config management, then incident response. The kind of role where the job is mostly making sure nothing breaks, and where every change request comes with a runbook, a verification step, and a rollback plan.

In the last year and a half, AI coding assistants pulled me into the software side of the house. Not because I wanted a career change — because the boundary between "writes code" and "runs systems" got thin enough that ignoring it stopped being an option.

This series is a running log of what I've been finding, written as I go. The disciplines I pulled in from operations turned out to map onto AI collaboration unexpectedly well. The places where they didn't, I've had to invent something for myself. Observation-heavy and prescription-light. I don't think I've discovered anything new; I've stumbled into the same territory a lot of practitioners and researchers are mapping right now, and I'm writing down my coordinates while they're fresh.

What this is not

A few things I want to flag up front.

This is not a survey. I haven't done an exhaustive review of the literature, the toolchain ecosystem, or the practitioner community. I'm certain there are people who've been running similar disciplines longer than me and writing them up better. If you find one, please send me the link — I'll learn from it, and I'd rather correct the record than defend a stale draft.

This is dated material. Anything you read here is a snapshot from early 2026, written from one specific vantage point: a Japanese operations engineer transitioning into AI-assisted software work, with paid Claude / ChatGPT / Gemini subscriptions and no team-scale deployment. If your context differs significantly, the disciplines may transfer poorly or not at all.

The five disciplines this series circles around

A set of operating principles I keep coming back to in my own daily work. Each earns its own article (or several) in the series. The short version, in a "simple to compose" → "complex to compose" order:

1. TTL Satisficing

I cap the number of back-and-forth rounds in any AI-assisted decision. Three is a good default. The point is not to find the perfect answer — it's to converge on a good-enough answer before the next round costs more than the marginal improvement is worth. If three rounds don't converge, I treat that as a signal that the problem is the problem, not the round count.

This is essentially timeout / retry-budget design from network operations, applied to a conversation with an AI.

2. The Karpathy Rule (Simplicity First, Surgical Changes)

Andrej Karpathy's framing, adapted to my workflow: I don't let the AI add what wasn't asked for, and I don't let it edit what wasn't in scope. Speculative additions are how technical debt grows at AI speed. Keeping the diff small and the proposal narrow is a discipline I have to actively enforce.

This maps cleanly onto YAGNI ("You Aren't Gonna Need It") and minimal-diff code review culture — pre-AI principles, applied to a faster-moving collaborator.

3. Fail-Fast (Tolerance for Contradiction)

When a problem doesn't converge across multiple attempts and multiple AIs, I want the AI to declare it unfeasible and hand me a graceful-degradation path. The AI's job is not to grind itself into the ground trying to solve everything. Its job is to return judgment to the human when judgment is what's actually needed. "I can't solve this, here's how to amputate it cleanly" is a better answer than "I'm trying again, please wait."

This is circuit-breaker behavior from distributed systems, applied to AI reasoning loops.

4. Adversarial Review (the Devil's Advocate, made standing)

For any non-trivial proposal, I force the AI to argue against itself. Standing instructions — in CLAUDE.md, in custom instructions, in the system prompt — so I don't have to remember to ask. An assistant that only agrees with me is a single point of failure, not a colleague.

This is red-team review and RFC-style structured criticism, made automatic instead of episodic.

5. The Human Router (Multi-AI Adversarial Review)

For decisions with real trade-offs, I run two AIs from different vendors against each other and stay in the loop as the routing authority. Not as supervisor, not as referee — as the operator who decides which output goes where, and which conflicts get escalated. The cross-vendor part matters in my experience: same-vendor "multi-agent" setups still share a worldview. Cross-vendor (e.g., Claude vs. Gemini) actually surfaces disagreement.

This is human-in-the-loop control plane thinking, where the data plane is "AIs argue" and the control plane is "human decides which arguments matter."

I want to be clear: none of these are inventions of mine. The literature has Multi-Agent Debate, Self-Refine, Reflexion, Constitutional AI, and a growing body of human-in-the-loop research. I'm describing how I personally compose these ideas into a daily workflow, and writing down the specifics in case the composition is useful to someone.

Why "Cross-Domain"

I keep using this word and I should explain it.

The AI engineering conversation I see is heavily centered on the programmer's lens — code generation, dev tools, IDE integrations, prompt-to-PR workflows. That lens is real and valuable, and I'm not arguing against it.

But I came in from operations, and operations has its own logic. Change management. Verification gates. Runbooks. Rollback procedures. Skepticism toward dashboards that are too green. The instinct to ask "how would I know if this is silently broken?" before celebrating a green build.

When you bring those instincts to AI collaboration, you get a different shape of practice than you'd get if you came in from the dev side. Not better. Different. And I think the differences are worth writing down, because operations engineers, SREs, data engineers, business systems people, and infrastructure folks are all about to find themselves doing AI-assisted work — and the dev-centric playbook isn't going to fit them perfectly.

That's what "cross-domain" is pointing at. Not crossing one specific bridge, but recognizing that there are many bridges, and that each engineering discipline brings its own toolkit to the AI collaboration table.

What's coming in this series

(Links will appear as articles are published. Each piece is designed to stand alone — you don't need to have read this anchor page to follow any of them. This index is just here for the curious.)

Entity Resolution case study — A practical build: deterministic two-system reconciliation with AI assistance, no business data leaving the local environment, ~99.2% recall on historical replay. Where I learned that operations discipline transfers cleanly to software work in unfamiliar domains.
AI Collaboration Patterns — Five practices and five anti-patterns I picked up running AI-assisted operations work day to day. The kind of thing you only learn by getting burned.
Alchemy Essay — A late-night thought experiment mapping AI coding onto the principles of Fullmetal Alchemist. The most polemical piece in the series; also the most personal.
Domain-Native Governance — The architectural argument: why dev-centric AI governance feels insufficient from where I sit, and what a domain-native alternative might look like. The most theoretical piece in the series.
Domain Logic First — A position piece: each engineering discipline has its own logic, and there might be value in adapting AI to it rather than the other way around.
Field-Note Tips (ongoing) — Short pieces on individual operating habits: the CLAUDE.md preamble template, three-tier tool routing, telling fast evaluation from real evaluation, etc.

A note on tone

The Japanese versions of these articles tend to land in a softer register than the English ones. That's a translation choice; the underlying observations are the same. If you read both, the English will feel more direct and the Japanese more reflective. Neither is the "real" version — both are. I'm a different writer in each language, and I've stopped trying to flatten that.

A note on the time-shift

If you're reading this in 2027 or 2028 and any of these disciplines feel obvious by now, good. That means the field moved. The reason these notes exist is that I was figuring this out in early 2026 and wanted a record. I'd rather have written them down too early and looked dated than written them down too late and lost the messy details.

If you're reading this in 2026 and any of these disciplines feel new: I'm right there with you. We're figuring this out together.

Comments welcome. Particularly from:

Engineers from non-dev domains who've started AI-assisted work
Researchers working on multi-agent debate / adversarial review / human-in-the-loop systems
Anyone running multi-AI cross-vendor protocols and willing to share what's worked or broken
Anyone who thinks one of these five disciplines is wrong and wants to argue
Anyone who has prior art (papers, blog posts, internal write-ups) for any of this — I'd genuinely like to read them

"AI Coding Is Alchemy: A Late-Night Reflection from Fullmetal Alchemist"

J.S_Falcon — Mon, 27 Apr 2026 17:47:25 +0000

Late at night, writing code, I had a sudden realization: the experience of building with AI (LLMs) maps almost perfectly onto the foundational principle of alchemy from Fullmetal Alchemist — Comprehension, Deconstruction, Reconstruction. The current paradigm shift in AI development can be told as the evolution of these three steps.

1. Traditional Programming = Drawing Transmutation Circles by Hand

Software engineering, until recently, was traditional alchemy.

We read requirements and business logic (Comprehension), broke them down into algorithms and function designs (Deconstruction), and then drew the transmutation circle — the actual code, with exact syntax — by hand to bring the system to life (Reconstruction).

If even one chalk line wavered (a syntax error, a typo), the transmutation failed and the error blew back at us. We had to draw the circle on the ground by hand, every single time.

2. AI Coding = Hand-Clap Transmutation Through the Gate of Truth

Then ChatGPT, Claude, o1, and the rest arrived. It feels — at least to me — like we have collectively opened the Gate of Truth for infrastructure and coding.

Once we do the Comprehension and Deconstruction in our heads (requirements gathering through prompts, architecture design), we can outsource the most tedious step — Reconstruction (writing the code, drawing the transmutation circle) — to the Gate (the AI).

We clap our hands — well, hit Enter — and thousands of lines of boilerplate or a complex regex assemble in an instant, without ever drawing a circle.

3. The Equivalent Exchange Trap (An Operations View)

Hand-clap transmutation looks indistinguishable from magic. But here is the trap an operations engineer notices.

Outsourcing Reconstruction to the AI is conditional on the human side getting Comprehension and Deconstruction completely right.

What happens when domain knowledge (Comprehension) is shallow, or the logical structure of the prompt (Deconstruction) is broken, and you still hand the transmutation off to the AI? You ship a Chimera into production — undebuggable spaghetti code, or a quietly exploitable security hole.

Equivalent Exchange — the foundational law of alchemy — does not get repealed by AI. Whatever amount of human thought you skip, the system charges you back later, in the form of an incident.

4. "All Is One, One Is All" — Living Next to a Collective Intelligence

There is one more truth from the same series that has to be named: "All is One, One is All."

An LLM is the All — every line of code engineers have ever written, mixed with the cumulative knowledge of humanity, distilled into one model. Each of us, the engineer at the keyboard, is only one — a single point inside that enormous stream.

But if we hand everything over to the All and merely ship whatever it emits, we collapse into the All. We become a part of it — a downstream API endpoint of the AI's output, indistinguishable from the noise.

It is exactly because the One — the individual engineer — Comprehends the whole system and Deconstructs it with intent, that the All — the AI's collective intelligence — Reconstructs it as something with actual value.

Quality, security, the integrity of the larger system: all of these come back, finally, to the thought and judgment of the One.

5. Don't Become a Homunculus — Keep a Heart of Steel

As AI advances, the value of an engineer who is only a biological pair of hands — Reconstruction-only labor — is collapsing.

But the roles of Comprehending the system, Deconstructing its boundaries, and taking responsibility for what the AI emits — these will not be automated.

The survival strategy is not to outrun the AI. It is to resist being swept away by its overwhelming speed, to deliberately introduce the friction of thought — the small pain of slowing down — and to stay in that resistance.

That, I think, is the Heart of Steel that keeps us from sliding into Homunculi: dolls who have surrendered the act of thinking.

We are alchemists who have seen Truth. There is no going back to the world where every circle was drawn by hand. But the one thing we cannot afford to let go of is the act of thinking for ourselves and owning what we ship.

A quiet promise made to myself in front of a monitor, late at night.

The Japanese title of Fullmetal Alchemist is 鋼の錬金術師 (Hagane no Renkinjutsushi — "Steel Alchemist"). The "Heart of Steel" line above is a small wordplay on that title. It survives translation only partially.

"Beating 250,000 Mental Comparisons: A Cross-Domain Engineer's Entity Resolution Case Study"

J.S_Falcon — Sun, 26 Apr 2026 08:41:34 +0000

TL;DR

Operations/Systems engineer recently moved to the software side via AI collaboration.
Built a domain-specific entity resolution tool in a handful of evening sessions with Claude Code.
Caught about 99.2% of human-detected reconciliation errors when replayed against 8 weeks of historical data.
Turned a "skilled-veterans-only" weekly task into something anyone on the team can run.
Design retrofitted unexpectedly well to dual process theory, Gestalt psychology, and anchoring-bias defense.
Source business records never reached an LLM. Deterministic pipeline + human review only.

1. The Hidden Problem: When 500 × 500 Becomes a Cognitive Wall

Many companies maintain the same business entities across multiple systems.

A retailer tracks SKUs in an internal master AND on Amazon / Rakuten / Shopify exports.
A clinic carries patient records in both an EMR and an insurance billing system.
A manufacturer holds internal inventory but also receives partner inventory feeds.
An accounting team reconciles general ledger entries against bank statements.

Why is 500 × 500 hard?

The 250,000 problem

Brute-forcing this is computationally similar to running a 1,000-node full-mesh ping check vs a flat 1,000-node liveness check. Order-of-magnitude different load.

Working memory overflow

"Short to do" doesn't equal "easy to do" for cognitive labor.

Reproducibility decay

A one-off reconciliation can be brute-forced. But when the task repeats weekly across 10+ weeks, judgment drift becomes unavoidable:

"Last week I matched 'A Co.' and 'A. Company' as the same entity. This week I treated them as different."
"Last week I tolerated typo X. This week I rejected it."

This drift is what really breaks data quality long-term. It's the same structural failure mode as "config review standards differ by reviewer" in infrastructure operations.

The actual target

So the real problem the tool solved was not "shorten 3 hours per week" but:

250,000 judgments × 10 weeks of consistent reproducibility — a quality bar humans can't physically sustain — backed by a deterministic machine.

Plus removing the skill dependency. "Only one veteran can do this in 3 hours" is a single point of failure. After the tool: anyone could run it with consistent quality.

2. Background: Who I Am and What I Was Solving

Translation: design / validation / runbook discipline solid. Python and application development essentially unfamiliar.

This article is not a "look what I shipped" piece. It's a record of how operations-side disciplines transferred unchanged into AI-assisted software work in an unfamiliar domain.

Who this article is for

Reader	Useful sections
Operations / SRE engineers exploring AI assistance	Everything
Mid-career engineers moving across technical domains	Background, Architecture, Cognitive Design
Engineers new to AI-assisted development	Architecture, Cognitive Design, PII
Managers thinking about AI for their teams	Results and the cognitive-load argument

3. PII / Compliance Considerations

A question that always comes up in comments on entity-resolution articles: where does the data go? Worth answering up front.

In this implementation:

Source business records never reach any LLM. Both input files (internal master + external system export) are read locally by a Python script.
Matching is fully deterministic. Pandas, openpyxl, and difflib.SequenceMatcher for similarity. No embedding API. No remote inference at runtime.
The LLM's role is code-side, not data-side. Claude Code helped write the matching logic, the validation scripts, the design review, and the documentation. None of the actual records were ever sent.
For testing only, masked synthetic data was used in prompts. Real names, amounts, and addresses were replaced with synthetic equivalents before any prompt left the local environment.
Edge cases stay with humans. When the deterministic pipeline can't decide, it surfaces a flagged row for human review — not for LLM second opinion.

This separation is intentional. The matching task is well-suited to deterministic logic. LLMs would only add cost, latency, and compliance exposure for no quality gain.

If your team has even a soft "no business data into external AI" policy, this pattern is fully compatible.

4. Architecture: Two-Stage Matching + Cognitive Gates

Stack

Python 3.11
pandas + openpyxl (Excel I/O, color-coded output)
difflib.SequenceMatcher for fuzzy similarity
Rule-based throughout. No machine learning.
~1,100 lines, single script.

Phases

Phase 1: Match by exact stakeholder name (or alias group)
Phase 2: Cross-match by name similarity ≥ 0.6 (rescue typos)
Phase 3: Last-name-only + structural match (single-typo tolerance)
Phase 4: Duplicate-registration detection (same stakeholder + similarity ≥ 0.8)
Phase 5: Rescue rows with no stakeholder name (attribute match)
Phase 5.5: Attribute-mismatch pair rescue (identifier similarity ≥ 0.7, stage 2)
Phase 6: Row generation + color decision

The score function (key gates)

def compute_score(row_a, row_b):
    # Hard gate: region must match — kills cross-region false positives
    if region_a != region_b:
        return 0.0
    # Hard gate: numeric attribute must be close enough
    if abs(value_a - value_b) > THRESHOLD:
        return 0.0
    # Identifier gate: row_b's identifier must be embeddable in row_a's identifier
    if not is_identifier_match(addr_a, identifier_b):
        return 0.0
    # Sub-identifier gate: anchoring-bias defense
    if sub_id not in addr_a:
        return 0.0
    # Soft scoring (only after every hard gate passed)
    score = max(identifier_match_score, similarity, value_fallback)
    return score if score >= 0.6 else 0.0

Why this shape?

Hard gates first. "Different region" or "value difference > N" are absolute disqualifiers. Run them before any expensive similarity computation.
Soft scoring last. Once hard gates pass, compute similarity — but cap below 0.6 as "uncertain, surface to human."

Why not ML / Vector DB / embeddings?

Abstracted structure

Domain-specific term	Abstract concept
Item / SKU	Entity
Stakeholder (vendor / agent)	Stakeholder attribute
Price / Amount	Primary numeric attribute
Address / Location	Identifier (multi-attribute)
Building / SKU name	Auxiliary identifier
Detail number / barcode	Sub-identifier
Format variation (kana/latin/case)	Data quality issue
Domain judgment	Tacit knowledge

5. Cognitive-Science Design Principles (the Twist)

5.1 Dual process theory (Daniel Kahneman)

The two phases map onto two thinking modes:

System 1 (fast) = Phases 1–5. Fuzzy "is this roughly the same thing?" — similarity scores, identifier matching, attribute closeness.
System 2 (slow) = determine_color(). Strict checks for value mismatch, format inconsistency, identifier mixing.

Color-coded human review gets the System 1 fuzzy pass plus the System 2 strictness annotation, which is exactly the input shape humans need to make a final call.

5.2 Gestalt psychology

Humans recognize "wholes," not character sequences. iPhone15 and iPhone 15 Pro Max feel like the same product family even though strict string equality fails. So:

def is_identifier_match(addr_a, identifier_b):
    """Recognize chunked identity even with mixed scripts and separators."""
    chunks = re.split(r'[A-Za-z0-9\s\-_]+', identifier_b)
    return all(chunk in addr_a for chunk in chunks if len(chunk) >= 2)

Matching by chunks survives whitespace, separator, and script variation.

5.3 Anchoring & confirmation bias defenses

Hard gates exist to deny human-style intuitive shortcuts:

"Same price, must be the same item" — rejected by sub-identifier gate.
"Same name, must be the same person" — rejected by region gate.

The machine's job is to be coldly skeptical exactly where humans get over-confident.

5.4 Reducing human cognitive load (Human-in-the-Loop)

When a human is asked to confirm a flagged row, they don't get an opaque "match score 0.62". They get a one-line annotation:

Same entity matched | [Value mismatch] diff ¥2,000,000 (5.4%)
(A: ¥34,900,000 / B: ¥36,900,000) · identifier format inconsistent

The human doesn't waste cycles re-deriving why the row was flagged. Cognitive load drops sharply.

5.5 Don't automate the ghost

Tightening the logic isn't a path to recreating the ghost.
It's a path to revealing where the ghost is needed.

Mapping summary

Cognitive concept	Implementation
System 1 (fast)	Phases 1–5 (fuzzy matching)
System 2 (slow)	`determine_color()` strict checks
Two-stage / dual-pass	Stage 1 + Stage 2 (Phase 5.5)
Gestalt grouping	`similarity` / `is_identifier_match`
Anchoring defense	Sub-identifier gate, identifier gate
Cognitive load reduction	Aggregated `[reason] diff X` annotations
Human-in-the-Loop	Caution signals for tacit-knowledge zones

6. Results

Recall on 8 weeks of historical data

Metric	Value
Errors flagged by humans (excluding outlier weeks)	~130
Errors caught by the tool	~129
Recall	~99.2%

The single missed case was annotated by the human reviewer as "even a human couldn't decide here." Effectively the tool catches every case where a human commits a confident verdict.

(Caveat: this is recall against 8 weeks of one team's data, not a benchmark claim. Different domains will need their own measurement.)

Time and skill load

Item	Before	After
Skilled veteran throughput	~3 hrs/week	~30 min/week (review only)
Newcomer throughput	half a day to full day	~30 min/week
Skill dependency	Yes (single point of failure)	No (anyone can run it)

The time number understates the value. The real shift is breaking the skill SPOF. Veteran out sick, leaves, or buried in another priority — work continues at the same quality.

A note on false positives

7. The Flowchart

Drawing the judgment flow as diagrams surfaced things the code review didn't. Below are the four phases as separate figures, in execution order.

7.1 Phase 1: Hard Gates (sequential disqualifiers)

Region → numeric value → auxiliary identifier → sub-identifier. Each gate is an absolute disqualifier: any "No" drops the pair. The order matters — cheapest disqualifiers run first.

7.2 Phase 2: Soft Match

Once a pair clears all hard gates, compute_score evaluates a soft similarity. Below 0.6 → drop. At or above → lock the pair as the same entity.

7.3 Phase 3: Parallel Flag Checks

7.4 Phase 4: Final Verdict and Drop Aggregation

Aggregate the tags into a color verdict. Drops from Phase 1 and Phase 2 converge into the "Unmatched" lane, surfaced standalone in the human-review output.

Things visible only after rendering as a diagram

These were invisible while reading code, only obvious once drawn:

Phase 1 hard gates are ordered by computational cost. Region → numeric → auxiliary → sub-identifier. I placed them by intuition; the diagram showed they were already optimal — cheapest disqualifiers first.
Phase 3 parallel flag checks are genuinely independent. Six checks fire in parallel with no early-return contamination. The diagram confirmed there was no silent dependency between them.
All Drop1–Drop5 paths converge to the same Unmatched node. I was throwing away the drop reason. Re-running "why was this pair rejected?" was impossible. Fix: log the drop reason in the row annotation.

Drawing the flowchart is roughly the same act as drawing an infrastructure topology before going live. The diagram is the rubber duck.

8. Wrap-up

Three transferable lessons from this build:

Cognitive load is the hidden cost of "short" repetitive judgment tasks. Headcount-hour math undersells the burnout reality and skill-SPOF risk.
Cognitive science principles fall out of good design retroactively. I didn't design with them in mind; the principles became visible only through structured review (with a second AI). If your design retrofits to known principles, that's confirmation. If it doesn't, that's a smell.
LLMs do NOT have to touch your data. Most entity resolution work doesn't need them at all. Use them for code, design review, and documentation. Keep the business records local and deterministic.

9. What's Next

Comments on entity resolution, cognitive load in repetitive tasks, or cross-domain engineering experiences are welcome.