DEV Community: Rafael Costa

Confront, Don't Assert

Rafael Costa — Tue, 30 Jun 2026 03:00:00 +0000

Why your AI code auditor should be able to tell you how it's wrong

A docstring told me the write was safe. So did the architecture doc. Both had been reviewed; both had been trusted. They described a liveness update — the kind that flips a record to "in progress" — and they were specific about the guard: the write, they said, was conditional on the record's status, so a finished record could never be re-opened by a late, racing writer. Compare-and-swap, in the filter. Textbook.

The code matched the record by its id. Only its id. The status predicate the docs promised wasn't there.

Nothing was on fire. In practice there was effectively one writer, so the race the missing guard would have allowed had never been provoked. But the guard the documentation described — the one a future maintainer would read, trust, and then decline to add because "it's already handled" — did not exist. The doc wasn't stale. It had never been true. And it was confident enough that no reviewer had thought to check it against the single line of code it claimed to describe.

I didn't find that by reading carefully. I found it because I'd built a tool whose entire job is to assume a confident claim is guilty until reconciled against reality — and to tell me, in the same breath, how it could be wrong about that.

That second clause is the whole point of this essay.

The thing we built backwards

Here is the shape of AI code tooling in 2026. The field already found the axis that matters — it just enshrined the wrong end of it as the default.

On one side, the precision-first reviewers, tuned to stay quiet: catch less, almost never cry wolf. On the other, the recall-first camp, willing to eat noise to catch more. The split is real and measurable — run them against the same bugs and the spread is enormous. On Greptile's own 50-PR benchmark — real production bugs, reconstructed from the commits that fixed them — Greptile catches around 82% to CodeRabbit's 44%: the recall-tuned tool against the precision-tuned one that trades catch rate for a quieter signal. Two philosophies, both shipping, both defensible.

The quiet one has a real case, and I'll concede it its strongest version. If you're interrupting a developer mid-flow at pull-request time, quiet is correct. Noise has a cost and a human pays it. OpenAI saw it building CriticGPT: trainers preferred its critiques to the baseline's 63% of the time, partly because it produced fewer unhelpful nitpicks. Fewer false positives is not a dumb goal. In the right room it's the only goal.

So I'm not here to call the low-noise instinct stupid. Two things are still wrong, and both are worse than a mistuned threshold.

The first is the variable. Confidence is the wrong thing to threshold on, whichever end you set it to. A model's confidence is not a measurement of correctness — it's a property of the text it trained on, overwhelmingly written by people who sound sure. A 2026 study called TRACE, out of UT Dallas, measured this on the exact task in my opening story — does the code honor the documentation? — across seven models, and found their confidence poorly calibrated for six of the seven. The number you'd filter on doesn't track whether the finding is true. So filtering by confidence doesn't filter for truth. It filters for fluency — the same failure mode that produced the doc that lied.

And look at what that filter lets through. The measured weakness of LLM judges isn't volume; it's recall. In summarization-consistency tests from the GPT-3.5 era, judges caught only 30 to 60 percent of the inconsistent summaries — they miss real defects far more readily than they invent fake ones. The dominant defect of an AI auditor is the false negative, not the false positive. "Surface only high-confidence findings" trims the loud false positives and leaves the silent false negatives exactly where they were. It optimizes the one number that was already fine.

It gets sharper. The same study put numbers on where these auditors go blind, and they land on my story with uncomfortable precision. When the documentation is plainly wrong, the models catch it well — 67 to 94 percent of the time; a false claim is a loud, local contradiction. The failure case is the inverse: documentation that stays plausible while the code quietly fails to honor it. Detection of that case falls by more than forty points in the worst case — and because the confidence is miscalibrated, the model can't even tell you when it's standing in the blind spot. That is the doc that lied. It didn't recommend a bad guard. It described a good guard as done — plausible on its face, false only against the code — and a claim that's right on the merits and wrong only about reality is the hardest thing there is for an auditor, or a human reviewer, to see. Which is why it sat untouched while everyone who read it nodded.

The second is the room. Every one of these tools, loud or quiet, fires at pull-request time — into a developer's flow, where noise genuinely costs something. Point the same posture at code that's been sitting still for a year and the economics invert. No flow to interrupt, no reviewer mid-thought, nothing to protect from noise. The reason to be quiet evaporates — and what's left is a tool still trained to hush about the one place, standing and trusted and unexamined code, where silence costs the most.

The doc that lied is what that quiet looks like from the inside. Not an exotic bug. The ordinary one: a claim that sounds true, that everyone trusts, that nothing was built to falsify.

Confront, don't assert

So here is the principle the tool is built on, and the one I'd defend against the whole low-noise consensus:

A claim is worth nothing until something could falsify it. Not the code's claims — the auditor's. If my tool tells you a write is unguarded, it has to tell you, in the same finding, the condition under which it's wrong: "This is incorrect if an upstream caller already restricts this path to in-flight records, making the filter predicate redundant." That sentence is not a hedge. It is the most useful line in the report, because it tells you exactly where to look to refute the finding — and if you can't refute it, you've learned something real. It's the physics habit, ported: you don't trust a result because the computation produced it — you trust it because you tried to break it and couldn't.

Three commitments fall out of that, and they're the opposite of how the field is trending:

Every finding states how it could be wrong. A falsifiability clause per finding. This sounds like a small UI nicety. It is actually a different epistemics: it forces the auditor to hold a model of the world specific enough to be attacked, instead of emitting a vibe with a confidence score stapled to it.

Silence is not coverage. If an auditor finds nothing, that is not the same as "there is nothing." It might mean it didn't look. So the report has a first-class section for what was checked and found clean — the guarded sibling of the unguarded write, named, with the reason it's fine ("the predicate is in the filter; this one is correct"). An empty findings list is a claim too, and it should carry its evidence. The audit standards humans have used for decades already work this way: scope, what was examined, what was found. AI review quietly dropped that and kept only the alarms.

A finding you can't refute is decoration. Before a finding ships, an adversary runs against it whose only job is to kill it: does the racing path actually co-execute? Is the missing guard present one layer up? Is the cited path even reachable? Only the findings that survive the attack are reported. This is the same move that makes a courtroom adversarial rather than a press release — and it's astonishing how much "AI found a problem" content would not survive it.

The mechanism here isn't unheard of, and I won't pretend it is. Greptile's newer versions run an agent that chases a flagged issue down its call chain to decide whether it's real before it ever reaches you — the same instinct, kill the finding before the human sees it. The difference I'm drawing isn't that an adversary exists. It's that the adversary's verdict belongs in the report. The surviving finding ships with the refutation it survived — the condition someone tried to meet and couldn't — so the reader inherits the attack, not just its verdict. A verification you run and discard protects the tool's precision. A verification you hand the reader protects the reader's judgment. Only one of those survives you walking away from the keyboard.

None of these are about making the auditor smarter. They're about making it honest — and honest in a way you can check.

The auditor has to be falsifiable too

There's a trap one level up, and it's the one that makes most of this rhetoric worthless in practice. An auditor that prizes falsifiability can still be vacuous: it can "pass" by finding nothing, or by emitting confident-sounding findings that don't actually depend on the code. How would you know? The auditor sounds great either way.

So the discipline has to apply to the tool itself. Mine ships with planted-bug fixtures: tiny pieces of code with one known defect each. The rule is red-green — fix the planted bug, and the finding must vanish. If it doesn't, the auditor wasn't detecting the defect; it was pattern-matching something incidental. And a second corpus strips the explanatory comments out, because if your "analysis" is really just paraphrasing a comment that hands it the answer, removing the comment should break it — and if it doesn't break, the finding was never analysis.

This isn't novel as an idea; it's how OpenAI validated CriticGPT, by paying people to insert subtle bugs and measuring whether the critic caught them. It's the same instinct as mutation testing: a test suite that can't be made to fail by a deliberately broken program isn't protecting anything. What's strange is how rarely it's applied to the AI auditors we're shipping. We test the code. We don't test the thing that judges the code. A test that can't go red is inert — and an auditor whose findings can't be made to vanish is decoration with a progress bar.

I'll say the uncomfortable version plainly: most of the "AI reviewed your code" you can buy today cannot show you a single planted defect it reliably catches and reliably stops catching when fixed. Ask for that demonstration. It's a fair question, and it's the whole game.

If it only works when you're in the room

There's a test for whether a discipline is real or just a preference: hand it to something that can't ask you anything, and see if it holds. I write prompts for remote agents — no memory of my standards, no shared context, no way to come back with a question — and the rule I put above all others is the same blade this essay is about, aimed at a machine: reading the bug-report text counts as zero evidence; a verified reproduction counts as one. Don't act on zero. The prompt has to teach the agent to verify its own work, because nobody else will — I've watched one process authoritative-looking data straight into a confident, wrong conclusion for want of exactly that check. A conviction you can compress into an instruction a stranger obeys — confront the claim before you act on it — is a conviction, not a taste. If it only works when you're in the room, it was never a discipline.

Why retrospective, and why per-pillar

One more design choice, because it's where the doc that lied actually lived.

The crowded, well-funded part of this market reviews diffs — it gates the change at pull-request time. That's valuable and I'm not competing with it. But the doc that lied wasn't in a diff. It had been sitting in the repository, unexamined, the whole time. There was no change to gate. It would have sailed through every pull-request reviewer on earth, because none of them were looking at code that wasn't moving.

The rest of the market splits two ways, and the doc that lied slips both. The static analyzers — Sonar and its lineage — read the whole codebase, but with rules: deterministic checks for complexity, duplication, known-bad shapes. "The docstring claims a guard the code doesn't implement" is not a rule you can write in advance. Neither is "this lock ordering can deadlock," or "this abstraction has one caller and is pre-paying for a future that never arrived," or "this retry has no backoff and will become a stampede under load." Those are judgments, not patterns.

The newer codebase-aware reviewers — Greptile, Qodo — can make judgments like that. Full-repository context, prose reasoning, not a rulebook. So that ground isn't empty, and I won't pretend it is. But their whole machinery is pointed at a change. The standing code only gets read insofar as a diff reaches into it; leave the file alone and the file goes unexamined.

So the empty seam isn't "whole-codebase, prose-reasoned" — that's occupied. It's the conjunction the silent failures actually require: retrospective, triggered by nothing, run against code that isn't moving — one quality dimension at a time — every finding carrying the condition that would refute it — and the auditor itself held to the red-green discipline above. Not any single one of those. The stack. That specific stack is where almost nobody is, and it's exactly where the expensive, silent failures hide.

What's free, and what isn't

The framework is meant to be free — and that's a stance, not a footnote. The laws, the phase structure, the eval harness, the falsifiability discipline: that's method, and method is meant to travel. What isn't free, and shouldn't be, is the calibration — the invariant catalogs and failure-mode libraries tuned against real systems, the part that turns a general method into something that knows what tends to go wrong in your kind of codebase. That line — method open, calibration earned — is the whole shape of the thing, and I'm clear about it up front, because what poisons "open" tools is moving that line later. It was never the other way around. The method is the gift; the calibration is the craft. Reading how I'd audit a system is not the same as being able to do it on yours — which is, not coincidentally, the entire reason a published methodology has ever been worth publishing.

The invitation

Point an auditor at your own code and watch what it does with uncertainty.

If it finds nothing, make it tell you what it checked. If it finds something, make it tell you how it could be wrong — and then try to make that finding vanish by fixing the thing it pointed at. If you can't make it go quiet by fixing the bug, it was never really watching the bug.

That's the whole idea, and it's smaller than it sounds. Confront, don't assert. Silence is not coverage. A finding you can't refute is decoration. The field has plenty of tools that sound sure. I'd rather ship one that can tell you, precisely and falsifiably, where it might be wrong — because the doc that lied was confident too, and confidence is exactly what fooled everyone who read it.

Sources — every figure above is checkable, which is the only honest way to end an essay called this.

Catch rates (≈82% / ≈44%) and the precision-vs-recall split: Greptile's own 50-PR benchmark. Read it as a vendor benchmark — it crowns itself — and note the bugs are real production bugs traced from their fix commits, not synthetic.
CriticGPT, 63% preference (partly for fewer nitpicks): OpenAI · paper.
LLM judges catching 30–60% of inconsistent summaries: scoped to GPT-3.5-era summarization-consistency evaluations — survey of the literature. The direction generalizes; that exact figure does not.
Documentation auditing vs. code-drift detection — the six-of-seven miscalibration and the detection collapse when code drifts under a plausible doc: TRACE, arXiv:2604.03447. The paper reports that collapse twice and disagrees with itself — 7–42 points in the abstract, 21–43 in the results — so the line above states only what holds under both. Which is, fittingly, the whole point.

Software Architecture as Educated Coarse-Graining

Rafael Costa — Wed, 22 Apr 2026 10:26:16 +0000

Every serious practitioner agrees that software architecture matters. Almost nobody agrees on what the word means. That's an operational problem disguised as philosophical nuisance. If you can't define what counts as architectural, you can't scope an architecture review, you can't decide what belongs in an ADR, and you can't tell a junior engineer why their "small refactor" just changed the system's failure modes.

The field has definitions. None of them are wrong, exactly. But each one captures a piece and mistakes it for the whole.

What the canonical definitions get right — and where they stop

Bass, Clements, and Kazman get closest: architecture is "the set of structures needed to reason about the system." That makes architecture relative to reasoning goals, which is the right move. But "set of structures" leaves the boundary underdetermined — the definition doesn't tell you which structures, or how many, or when to stop. Fowler's "the decisions that are hard to change" focuses on irreversibility, which is genuinely where architectural risk lives, but it's circular — you don't know what's hard to change until you try — and it has no mechanism. Why is something hard to change? Coupling? Coordination cost? Contractual obligation? "Hard" carries all that weight silently. Booch's emphasis on cost of change gets closer to an operationalizable metric, but cost is an output of architectural structure, not a definition of it.

ISO 42010 deserves more careful engagement than practitioners usually give it. Its framework — architecture descriptions organized around concerns, viewpoints, and architecture principles — is richer than the single-line "fundamental organization of a system" that most people quote. The viewpoint mechanism is genuine: it says architecture is always described relative to concerns, which is directionally the same insight the coarse-graining frame formalizes. Where 42010 stops is at the operation itself. It specifies that you should organize by concern and should use viewpoints, but doesn't explain why different concerns produce structurally different architectures, or what makes one concern-relative model more faithful than another. The framework is a governance template. It's not a theory of the operation it's governing.

Each gets something right. None answers the question I actually need: what exactly makes an element architectural, and how would I know?

The definition

Software architecture is a concern-relative coarse-grained model of a system — both reasoning instrument and governing structure — stabilized by its decisions, defaults, and inherited constraints.

An element is architectural to the extent that changing it predictably changes governed system-wide properties, the coordination topology around them, or the non-local cost, risk, or feasibility of future change over the relevant horizon.

That's the descriptive core. It says what architecture is without smuggling in what architecture should be. Bad architecture, accidental architecture, architecture-by-default — all fit. Systems accumulate architecture through legacy constraints and organizational accidents as much as through deliberate choice. A definition that excludes those isn't describing the phenomenon; it's prescribing an ideal. And "system" here includes the organizational context — teams, contracts, compliance regimes — because real systems are governed by organizational coupling as much as by technical coupling. Scoping the definition to code alone would leave out half of what actually governs the system.

The normative layer is separate: good architecture is educated coarse-graining — team-legible, property-preserving, and attentive to future change economics. The gap between the descriptive and normative is where architectural judgment lives. A system always has some architecture. Whether that is educated — whether it preserves what matters and remains legible to whoever needs to reason about it — is the thing you're actually evaluating when you review it.

Three things need unpacking: what "educated coarse-graining" means and why it's more than a metaphor, what the model-governance duality buys you, and why the element test works where previous demarcation criteria don't.

Architecture as educated coarse-graining

The idea that architecture involves multiple views organized by concern isn't new — it's been the field's organizing principle since Kruchten's 4+1 model, and it's embedded in every version of Bass et al. and in ISO 42010's viewpoint framework. What's new is the claim about mechanism: architecture is not just "multiple views for different concerns." It's a property-preserving reduction of information, and the properties you choose to preserve determine the reduction you get.

In condensed matter physics, this operation has a precise name. The renormalization group builds effective theories at a chosen scale by systematically integrating out degrees of freedom that are irrelevant at that scale — while preserving the ones that matter. You don't model a lattice by tracking every electron. You choose which physical property you're investigating — magnetic ordering, transport, superconductivity — and you build a model that's faithful to that property while discarding everything else. The model is the theory.

The distinction between naive and educated coarse-graining is the key. Naive coarse-graining just throws information away — average over a region, collapse detail, and hope nothing important was lost. Educated coarse-graining discards detail selectively, preserving the information that governs the properties you care about at the scale you're working at. The result is a model that's simpler than the full system but faithful — within the scope of the chosen concerns — to the behavior that matters.

I claim that software architecture is this operation applied to a software system.
You're building a coarse-grained model — one that's simpler than the codebase, the deployment topology, and the full graph of runtime interactions. The question is whether your model is naive or educated: did you abstract away whatever seemed unimportant and hope for the best, or did you identify the specific structures that govern your chosen system-wide properties?

A quick example makes the distinction concrete. Say your system uses Celery for async task processing and you need to choose a broker. The naive model says: "Celery talks to a broker. Redis or RabbitMQ — doesn't matter, pick one." Both brokers accept tasks and deliver them to workers, so the model is currently correct. The educated model asks which future system-wide properties the broker choice governs. RabbitMQ's AMQP protocol gives you durable queues, routing keys, dead-letter exchanges, and per-queue priority at the protocol level; Redis implements similar features as application-layer conventions through Celery/Kombu. The difference isn't "can versus can't" — it's where the guarantees live, which governs how much you'd need to re-architect if you outgrow the client-side abstractions. Same current dispatch. Different structural ceiling. The naive model isn't wrong — it's uninformed. It classifies a degree of freedom as irrelevant that turns out to govern a property the system will need.

This framing does real work that "multiple views" doesn't. It explains why different concerns produce different architectures (different properties are governed by different degrees of freedom, so the reduction differs) and why some models are better than others (fidelity to the coupling structure that actually governs the property you're reasoning about, not aesthetic preference). It also names what makes architecture hard: identifying the relevant degrees of freedom requires understanding the system deeply enough to know which details are safe to discard. That judgment is the core architectural skill, and no framework can automate it.

Fairbanks's notion of architectural hoisting — architecture directly owning global properties — is subsumed by this frame: a hoisted property is one that survives the reduction. But coarse-graining also explains why some properties resist hoisting, a point I'll develop below.

Concern-relative and team-legible follow as consequences, not independent axioms. If architecture is a property-preserving reduction, then the properties you choose to preserve determine the reduction — that's the concern-relativity. And the model must be legible to whoever needs to reason about those properties — that's team-legibility. The grain of the model is set by cognitive load: fine enough to preserve what matters, coarse enough that the team can hold it in working memory. Coarse-graining isn't a bonus feature. It's a consequence of the legibility requirement.

Architecture as model and governance

The em-dash clause — "both reasoning instrument and governing structure" — resolves a tension the field has, as far as I can tell, never explicitly named.

The architecture-as-decisions school (Jansen and Bosch, Tyree and Akerman, the entire ADR tradition) treats architecture as the set of decisions that constrain the system. The architecture-as-model school (Bass et al., Kruchten, the viewpoint tradition) treats architecture as a representation you reason with. These are but two aspects of the same thing. Architecture is the governing model stabilized by decisions, where the model makes the decisions legible and the decisions make the model enforceable.

"Stabilized by" rather than "constituted by" is a deliberate choice. Architecture isn't reducible to its decisions — that loses the epistemic role, the model that makes the system thinkable at the right grain. But architecture without enforcement mechanisms degrades: decisions that nobody checks and nothing enforces aren't architectural, they're aspirational. The two roles need each other. Without the model, you have scattered constraints with no way to reason about their interactions. Without the decisions, you have a whiteboard diagram that governs nothing.

Perry and Wolf's tripartite — elements, form, rationale — implicitly contains both sides but never names the duality. Making it explicit resolves a tension practitioners live with but rarely articulate: the same engineer who points at the architecture diagram when asked "what's our architecture?" will point at the ADR log when asked "how did we decide that?". Both gestures are right. Architecture is the governing model those diagrams represent, stabilized by the decisions those ADRs record.

The coarse-graining frame explains why the duality is necessary rather than accidental. A property-preserving reduction is simultaneously a cognitive operation (you need a model simple enough to reason with) and a system constraint (the invariants must actually be enforced, or the reduction is fiction). A model with no enforcement is a wish; enforcement with no model is governance by accident.

The element test

Most definitions go vague here. "Significant decisions," "hard to change," "fundamental organization" — intuitions, not tests. Here's the test:

An element is architectural to the extent that changing it predictably changes those properties, the coordination topology around them, or the non-local cost, risk, or feasibility of future change over the relevant horizon.

"To the extent that" is deliberate. Architectural relevance is a gradient, not just black or white.

The most relevant prior work I'm aware of is Zimmermann's Architectural Significance Test (published 2020, following discussions at ECSA 2020), which provides seven criteria for identifying architecturally significant requirements: cross-cutting impact, high business value, QA sensitivity, first-of-a-kind, and several situational factors. Zimmermann's test is a practitioner checklist — it tells you when to pay architectural attention. The element test I'm proposing operates at a different level: it's a demarcation principle that tries to explain why something is architectural, grounded in a causal framing ("predictably changes governed properties") rather than an advisory one. Zimmermann asks "should this requirement trigger architectural work?" The element test asks "is this element structurally coupled to system-wide behavior?". Those are complementary framings. But the causal grounding is what makes the element test falsifiable: you can name a coupling channel and check whether the predicted propagation actually occurs.

The gradient framing matters practically: it lets you ask "how architectural is this?" rather than forcing a yes/no that the coupling structure of real systems doesn't support.

The element test probes along three dimensions:

Predictable non-local impact on governed properties. You can trace the causal path from the element's variation to a system-wide property change. Not "might, in some scenario, conceivably affect" — trace. "Trace" means: you can name the coupling channel, name the governed property, and describe the mechanism by which variation in one produces variation in the other — before the change is attempted. If you rename a variable and accidentally break a regex in a deployment script that causes a production outage, that's a bug. The variable was not predictably connected to availability. The database's consistency model, the authentication middleware, the message format between services — these are predictably connected to governed properties, and their effects propagate beyond their immediate module boundary.

A fair pushback: couldn't a thorough enough engineer predict the variable → regex → deployment → outage chain and so the variable was predictably connected to availability, just not to you? This is where "predictably" has to be anchored to the architectural model your team is actually reasoning with, not to an omniscient observer. Usually, a coupling channel that only becomes visible by reading deployment scripts isn't architectural under the model in use; it's a channel the model legitimately integrated out. When it bites, you're seeing the IR/UV mixing I'll describe below — not a counterexample to the test, but an instance of the thing the test already says is real.

A related objection: isn't "predictably" doing the same work as Fowler's "hard"? Both depend on the observer's knowledge. The difference is the anchor. "Hard to change" is grounded in nothing — hard for whom, under what constraints, over what horizon? "Predictably changes" is grounded in a traceable causal path articulable ex ante: name the coupling channel, name the governed property, describe the mechanism. The prediction can be wrong — and when it is, that's information about where your coupling map is incomplete. Fowler's test generates no prediction to evaluate; you only discover "hard" retrospectively, which means it can't guide decisions, only narrate the outcome.

Coordination topology. A boundary becomes architectural not only through technical coupling but through the coordination it demands: multiple teams ship across it, a compliance regime inspects it, incident response depends on it, a vendor contract freezes it. The service boundary that two teams own independently is more architectural than the one a single team controls, even if the technical coupling is identical, because changing it requires cross-team coordination whose cost scales with organizational friction.

This isn't Conway's Law as a sociological observation — it's coordination topology as a first-class architectural dimension. The definition absorbs it without a separate theory because coordination topology is an interaction structure, not merely analogous to one. The interacting components include teams, processes, and contracts alongside services and databases, and the coupling channels — approval gates, shared release schedules, cross-team incident ownership — propagate the effects of change non-locally through the organization just as technical coupling channels propagate them through the system. When changing a service boundary requires three teams to coordinate a release, that coordination cost is as architecturally real as a shared database schema.

Future change economics over the relevant horizon. Architecture shapes what you can build next, how expensive it will be, and what becomes impossible. The auth-gateway decision below is the clearest case: it makes adding new services cheaper (auth comes free) while making migration away from the gateway expensive (every service must reimplement). Same current behavior. Different future. The horizon isn't arbitrary — it's the timeframe over which the system needs to remain viable and evolvable for its stakeholders.

A demarcation principle that can only say "yes" isn't a test — it's a rubber stamp. So, let's consider one positive case and one negative case, both broadly legible.

Positive: your team decides to enforce authentication at the API gateway rather than in each service. The element test says architectural on all three dimensions. Governed properties: the system's security posture is now coupled to the gateway's behavior — a misconfiguration there exposes every service behind it. Coordination topology: every team that ships a new service inherits the gateway's auth contract; changing that contract requires coordinating across all of them. Future change economics: adding services is cheaper (auth comes free), but migrating away from the gateway requires every service to reimplement auth — the decision opened one path and narrowed another. You can name the coupling channels, name the governed properties, and trace the propagation before anything goes wrong.

Negative: consider a frontend framework migration — Angular to React, entire UI layer rewritten, every component rebuilt, team retrained. By Fowler's test, this is unambiguously architectural: it's hard to change, expensive, and affects the whole frontend surface. Most practitioners would call it architectural without hesitating.

The element test disagrees — given specific conditions. If the backend API contracts don't change, the deployment topology doesn't change, SLO-relevant behavior (latency, availability, SEO) remains materially the same, the migration doesn't alter release cadence or observability contracts, and a single team owns the frontend, then: governed properties are unaffected, coordination topology is unchanged, and future change economics of the system are the same. What changed is the implementation within a bounded context. It's a large, painful, risky implementation change — but the pain is contained. Nothing propagated non-locally.

The interesting part is where the test draws a finer line than intuition does. If the React app's client-side caching changes the load pattern on the API in ways that affect availability under peak traffic — that specific coupling channel is architectural, and the element test will flag it. But the framework migration itself isn't. Does "predictably" carry too much weight here? Maybe. But "hard to change" can't make the distinction at all. "Predictably changes governed properties" at least generates a prediction you can check. The test separates the pain of a change from its architectural significance, which is exactly the separation practitioners need when deciding what belongs in an ADR versus a project plan.

What this buys you

It gives "technical debt" a scalpel. Under this definition, architectural debt is specifically: decisions that are degrading governed system-wide properties or closing off future change paths you need open. A tangled service class is messy. A Celery task that bypasses the message bus and writes directly to another service's database — that's architectural debt, because it violates an interaction structure and changes the blast radius of future schema migrations.

It tells you when to write an ADR. Can you trace a non-local effect on a governed property or on future change economics? If yes — architectural, document it. If no — implementation.

It resolves "is X architectural?" by making the answer context-dependent. "We use Redis" is architectural if Redis is enforcing a system invariant — pub/sub as the only inter-service communication path. It's not architectural if it's the implementation behind an interface that could be swapped for Memcached without non-local effects. The isolation level is architectural if your application assumes serializable and would break under read committed. Same technology, different architectural status, depending on which invariants it participates in. The element test just asks: does this element's variation predictably change a governed property or the coordination topology around it?

Why the boundary is fuzzy — and why that's a result, not a bug

The whole premise of coarse-graining is scale separation: the micro-level details you've integrated out don't affect the macro-level properties you've chosen to govern. In physics, effective field theories work because scales typically separate — the atomic physics of a crystal doesn't affect its thermodynamics at room temperature. But sometimes scale separation fails. High-energy modes contaminate low-energy observables through channels the effective theory wasn't built to see. This is called IR/UV mixing, and it's structural: no amount of careful coarse-graining can prevent it, because the mixing arises from the coupling structure of the theory itself.

Software systems exhibit a structurally analogous phenomenon. Your architectural model says "the database is behind a repository abstraction, so the database choice is non-architectural." The model is internally consistent. The coarse-graining looks correct. Then your application grows, and the ORM's default transaction isolation level — a detail your model integrated out — turns out to be governing your consistency semantics. The implementation detail coupled to the system-wide property through a channel the architectural model wasn't built to represent. That's invisible coupling: the channel was always there, but the model couldn't see it.

Or: a retry policy with no exponential backoff, buried in a single service's HTTP client configuration, causes a cascade that saturates the message bus under load. That's scale-dependent coupling: the channel only exists at load levels the model wasn't tested against.

Or: a logging format string that includes request headers leaks PII into an unencrypted log store, turning an implementation choice into a compliance violation. That's domain-crossing coupling: the channel connects a technical decision to a regulatory property through a path the technical model doesn't represent.

In each case, a detail the coarse-grained model correctly classified as non-architectural turned out to couple to a governed property through a channel the model wasn't built to represent. This isn't accidental — it's structural. Interface contracts specify what a component does. They can't fully specify what a component is: its latency profile under contention, its failure propagation paths, its resource consumption at scale. Every architectural boundary is a lossy projection from a system's full state space onto the subset the contract exposes, and the information it loses can always couple to system-wide behavior through channels the contract doesn't represent. Ostermann et al. argued at ECOOP 2011 that this is a logical limitation of information hiding, not a practical one — one of the sharpest articulations of the point in the software literature. Spolsky named the symptom ("all non-trivial abstractions, to some degree, are leaky"); Ostermann et al. diagnosed the mechanism. Neither frames the implication for architectural demarcation: if implementation details can always potentially couple to system-wide properties, then no definition can draw a clean — one-size-fits-all — boundary between "architectural" and "non-architectural". The boundary is fuzzy not because our definitions are weak, but because the coupling structure of real systems doesn't respect the separation that definitions assume.

This is why the element test says "to the extent that" rather than "if and only if" — and why the coarse-graining frame earns its keep. It explains why architectural significance resists crisp definition: the system's coupling structure can always promote a micro-level detail to macro-level relevance through a channel invisible at the time the model was drawn. Effective theories in physics have the same structural property. Whether the underlying mechanism is formally identical across physics and software is a stronger claim I'm not making. The structural analogy is explanatory, not decorative, and that's enough.

Where this leaves us

The coarse-graining frame clarifies some things and opens others.

It helps explain the definitions debate: Bass, Fowler, Booch, and ISO 42010 may each be describing one aspect of the same operation — model, propagation symptom, change economics, governance structure — without recognizing the shared mechanism. And it helps explain why the architecture/implementation boundary resists bright lines: the coupling structure of real systems doesn't fully respect scale separation, so the element test offers a dimension to measure along, not a threshold to enforce.

What the definition offers is a mechanism that makes architectural judgment teachable: choose your concerns, model at the grain your team can reason with, identify the elements whose variation changes governed properties or future change economics, and understand that the boundary you've drawn is a function of the coupling structure you can currently see — not a permanent feature of the system. When a "non-architectural" detail causes a system-wide failure, that's not necessarily a failure of judgment. The coarse-grained model didn't fail because it was wrong — it failed because it was incomplete. That's the normal condition of effective models, and knowing it changes how you respond to surprises. It also opens a question for another day: if architectural failure is mis-coarse-graining — classifying a relevant degree of freedom as irrelevant — that should be empirically trackable. But that's another investigation.

The Privacy Is the Architecture: Building an Instagram Bulk Unfollower Under MV3 Constraints

Rafael Costa — Wed, 15 Apr 2026 15:01:12 +0000

The Instagram follower-tool ecosystem has a malware problem. In January 2026, the GhostPoster campaign was found to have spread across 17 extensions with 840,000+ cumulative installs across Chrome, Firefox, and Edge — hiding JavaScript malware inside PNG icon files using steganography. In March, two Chrome extensions were reported to have turned malicious after ownership transfer; in ShotBird's case, researchers documented fake Chrome update prompts used to deliver credential-theft malware. The "privacy policy" is a legal checkbox. The permissions list is where the real policy actually lives.

I built Reciprocity, a Manifest V3 Chrome extension that computes the set difference between who you follow and who follows you back on Instagram, and automates the unfollows. Zero servers, no external dependencies and only two permissions: tabs and storage. Host permissions locked to www.instagram.com and instagram.com. That's it, nothing else.

Privacy isn't a policy or a promise. It's an architecture with no server-side collection path and no third-party exfiltration endpoint.

In Chrome MV3, building this safely forces you into a specific, often painful set of constraints.

The MV3 Two-World Problem

Chrome extensions run content scripts in an "isolated world." You share the DOM with the page, but not the JavaScript execution environment (window). Great for security, but fatal if you need to intercept the page's own network requests before they happen.

To parse a user's Instagram following list without forcing them to scroll a modal for an hour, you have to hook into fetch() and XMLHttpRequest. To do that before Instagram's minified React bundle mounts, your code must run at document_start in the MAIN world.

The catch: MAIN world scripts have zero access to chrome.runtime.* APIs. They can't talk to your background service worker. Ergo, they can't read your extension storage.

So you build a two-world bridge:

content-main.js (MAIN world): Hooks fetch, parses GraphQL responses, drives direct API pagination.
content.js (Isolated world): Orchestrates the state machine, talks to the background service worker, manages the execution queue.

They communicate via window.postMessage. But throwing messages across the window boundary on a public site is fundamentally insecure. Page JS can see them. Page JS can forge them.

Build the Bridge, Then Mistrust It

window.postMessage is a broadcast channel in hostile territory. The W3C WebExtensions working group has had an open proposal since 2021 for a secure replacement — acknowledging that the current approach is fundamentally broken — and no solution has shipped. Grammarly's extension exposed authentication tokens to every website a user visited through an unvalidated postMessage bridge. Twenty-two million users, and JavaScript on any visited page could abuse the bridge to access a session.

So the rule in Reciprocity is simple: the MAIN world is scan-only.

It can contribute observations, but not authority.

When the MAIN world intercepts a GraphQL batch of followers, it sends scan data across the bridge tagged with a __RECIPROCITY__ prefix, a per-scan scanId and a per-pagination-run requestId, plus a rotated 32-char hex bridge token negotiated at session start. The isolated world validates every incoming payload. In the real code, that means checking the bridge source marker, token, scan correlation, phase, and request correlation before accepting scan data. Simplified:

window.addEventListener('message', (event) => {
  if (event.source !== window) return;
  const msg = event.data;
  if (!msg || msg.source !== '__RECIPROCITY__') return;
  if (msg.token !== currentBridgeToken) return;
  if (msg.scanId !== activeScanId) return;
  // validated — process scan data only
});

If the token doesn't match, or if the scanId is stale, the message is silently dropped. The token is not a secret from page JavaScript once the bridge is established; its job is correlation, freshness, and stale-session rejection, not authentication against a page that's already listening. The real security boundary is narrower and stronger: even if a hostile page can forge scan traffic, it still cannot cross the isolated-world/runtime boundary into the unfollow path.

Is this overengineered for a tool that most people will run once a month? Maybe. But the threat model isn't "normal user on a clean page." It's "normal user on a page where any third-party script — Instagram's own ad SDK, a browser toolbar, an injected A/B test — can postMessage into your bridge." That very bridge has exactly one job: make sure that even in that environment, a hostile page can at worst corrupt or spam scan-only data, not cross into the unfollow path.

Destructive actions never originate from the MAIN world. The background service worker accumulates the lists, computes the set difference, and holds the state. When the user clicks "Execute", the background script talks only to the isolated world via chrome.runtime.sendMessage. This is the part of the architecture I'm proudest of — not because it's clever, but because the guarantee is structural. It doesn't depend on discipline, or code review, or "we'd never route unfollows through MAIN." It depends on the fact that the MAIN world physically cannot reach the unfollow path.

The page JavaScript cannot trigger, spoof, or even perceive an unfollow command.

Stop Puppeteering the DOM

The standard approach to scraping a single-page app is DOM puppeteering — scrolling the viewport to trigger lazy loading. I tried this first. It's brittle in ways that compound: (i)if the tab loses focus, Chrome throttles requestAnimationFrame and the scrolling stalls; (ii) if Instagram changes a modal's CSS class, the scroller breaks. You're simulating a human to trick a UI into loading data that already has an API.

That is backwards.

Reciprocity captures the endpoint shape Instagram is already using when available, then takes over with direct cursor-based pagination, falling back to the well-known REST endpoint when needed. content-main.js makes direct fetch() calls to Instagram's endpoints using cursor-based pagination for list extraction — no scroll simulation, no dependency on Instagram's UI rendering pipeline.

Because we aren't relying on UI rendering, the background script spawns a dedicated, unfocused Chrome window for the scan. The user clicks "Scan" and goes back to whatever they were doing. The extension pages through up to 50,000 users per list silently — 800–2000ms between API calls, with a 5-second pause every 50 requests.

Rate Limits Are Part of the Product

Instagram will shadowban you for velocity. Unfollow 300 people in two minutes and your account goes quiet for weeks. The architecture has to absorb that constraint, not defer it to user discipline.

Reciprocity enforces: 20 unfollows per rolling 60-minute window, 100/day hard cap, 3–8 seconds of randomized delay between each request.

Rolling window, not clock-hour buckets. If I execute 20 unfollows at 2:55 PM, I shouldn't get 20 more at 3:00 PM. Both limits derive from a single unfollowSuccessTimestamps array in chrome.storage.local — epoch-millisecond entries, continually pruned to retain only same-day and last-hour entries. One data structure, two constraints, zero drift.

The execution lock is persistent. If the user closes the browser mid-unfollow, the unfollowExecutionState snapshot in storage prevents concurrent bulk runs when they reopen. There's a 90-second stale-lock reconciliation to handle bad exits. The background state machine (idle → scanning_following → scanning_followers → processing → done, plus error and interrupted recovery paths) is the sole truthbearer.

Testing Without a Build System

337 tests are covered through five files and, again, no external dependencies.

The tests use Node.js built-in node:test and node:assert. No Vitest, no Jest, no test runner that needs its own config file. But this created a real constraint: the extension's source files are vanilla scripts, i.e., no module.exports or ESM exports. They're designed to run in a browser, not in Node.

The solution is ugly and deliberate. Each test file copies the pure functions it needs to test inline. validateBatchUser(), normalizeTimestamps(), the message normalization logic — they exist as duplicated source in the test files, extracted by hand from the extension code.

This is a maintenance cost I chose to pay. The alternative was introducing a build step — a bundler that could tree-shake exports for the browser while making them available to Node. For a four-file extension with no external dependencies, a bundler is not simplification. It's a new failure mode wearing a productivity costume.

The same zero-dependency principle that keeps the permission surface minimal keeps the toolchain auditable. There's no package-lock.json because there are no packages to lock in the first place. The npm/package-manager supply-chain surface here is zero.

What the Constraints Produced

Every MV3 constraint I fought against turned into a structural property I'd now defend:

The two-world split forced the MAIN world to be scan-only. Destructive actions are architecturally unreachable from page JS.
window.postMessage is insecure by design, so every payload goes through token rotation and validation. A compromised page can corrupt scan data, but it cannot cross into the unfollow path.
With only tabs and storage in the permission set, there's no cookie-store permission, no webRequest, and no broad host access.

None of this is invisible to the user. The manifest.json is under 60 lines. A skeptical developer can read it in a few minutes and verify that the trust model matches the claim. That verifiability is the entire product thesis.

Chrome disabled Manifest V2 for all users in Chrome 138 on July 24, 2025; Chrome 139 removed the enterprise-policy escape hatch. MV3 is the environment extensions actually live in now. You can spend your time trying to smuggle the old model forward, or you can let the constraints shape the architecture.

The constraints here were the architecture.

Reciprocity is on the Chrome Web Store. Install it, right-click, inspect the source. Don't believe me, verify.

Five Corrections: What an AI Agent Didn't Know About My Production Database

Rafael Costa — Wed, 08 Apr 2026 14:08:54 +0000

And why "just write a better prompt" is the wrong lesson

The AI agent had just pulled 30 days of CloudWatch metrics, parsed them correctly, built a table, and pivoted the entire recommendation based on what it found.
I typed four words: "that's shared cluster data."
The deadlock counts, the connection peaks, the daily patterns — all real numbers from a real database cluster. All useless for the decision we were making. Our application shares that cluster with other services. The agent had processed the data flawlessly and drawn a conclusion from someone else's workload.
The data was flawless. The conclusion was someone else's. That gap has a structure worth looking at.

The setup

Multi-tenant Django application on Aurora MySQL. Low user count, background workers for bulk operations, primary-replica topology. No transaction isolation level configured anywhere — MySQL's default REPEATABLE READ running unchallenged. Sporadic deadlocks in background tasks, stale reads on the replica.
I pointed an AI agent at the codebase and asked it to devise a plan for choosing the right isolation level. Not "switch to READ COMMITTED" — decide what it should be, with evidence.

What the agent did well

Within minutes it had explored the entire codebase, mapped six categories of concurrent access patterns, and evaluated all four MySQL isolation levels against each. Found that no isolation level was configured anywhere — not in Django settings, not in middleware, not in init_command. Identified the specific background tasks doing bulk write cycles, the views with missing transaction boundaries, the delete-then-recreate services structurally vulnerable to race conditions.
This would have taken me a full day of methodical grep-and-read. The agent produced a concurrency map that didn't exist before the session started.
Then it started getting things wrong.

Three corrections, same shape

Over the course of the investigation, I interrupted the agent five times. Three of those reveal the same structural pattern. The other two are different in kind — I'll get to those.
"That's shared cluster data." CloudWatch showed hundreds of peak connections and sporadic deadlocks across 30 days. The agent built a careful analysis from these numbers, but those numbers belonged to the Aurora cluster, not to our application — low user count, handful of background workers, a fraction of those connections. The deadlocks could be entirely from other tenants on the same cluster.
Could a better prompt have prevented this? Sure. "The Aurora cluster is shared; CloudWatch metrics are cluster-wide." I knew that before the session. It's promptable.
But I expected the agent to catch it. It had the user count, the worker count, the CLAUDE.md. There was enough signal to at least question whether those connection peaks belonged to us. It didn't question anything — processed the numbers as ours and kept going.
"Did you check production only, or were staging and dev mixed?" The agent queried SigNoz for database metrics — millions of reads, tens of thousands of writes, clean latency percentiles. SigNoz had separate service entries per environment, and the agent hadn't verified which ones it was aggregating. The CLAUDE.md, the project docs — the namespace separation was right there.
"What about the slow queue workers?" The agent pulled error data for the main API service and the primary background worker. Missed the slow-queue workers entirely — the ones handling the heaviest bulk operations, the exact workloads where deadlocks would actually manifest. I know which queues carry what because I designed the routing. The agent queried the obvious service names and stopped.
The pattern. Each time, the agent had directional context: the codebase, the project setup, the CLAUDE.md scoping the investigation. The shared cluster, the environment boundaries, the queue routing were all inferrable from material it could see.
Fair objection: "So the context was there and the agent missed it. That's a tooling problem." Not a bad objection — better agents will get better at this.
But notice what's actually happening. The agent treats observability data as self-describing, and observability data inherits the topology of the infrastructure that produces it — structure that doesn't announce itself in the query results. CloudWatch doesn't label which connections belong to which tenant. SigNoz doesn't flag cross-environment aggregation. The data just arrives, looking authoritative.
The total context space includes cluster layout, environment naming conventions, service routing, monitoring configuration, team history with past incidents. The hard part isn't supplying that context. It's knowing which piece applies at the exact moment of inference — and that's a judgment problem wearing a context mask.
Better agent protocols will close some of this gap: verification steps before ingesting observability data, environment boundary checks, worker enumeration. Adversarial loops and self-correction prompts — "before drawing conclusions from this data, verify its scope" — would probably have caught all three. But someone has to write that checklist, and it's the engineer who already knows where the agent will drift. You're pre-encoding judgment into process. The human steering doesn't disappear; it moves earlier in the pipeline.
That objection only holds for these three. The next two resist it entirely.

Two different species

Premature convergence. The agent's first plan evaluated REPEATABLE READ versus READ COMMITTED and recommended switching. MySQL has four isolation levels. The plan hadn't touched READ UNCOMMITTED or SERIALIZABLE, hadn't considered per-connection versus per-transaction strategies, hadn't looked at Aurora-specific features. It optimized before it explored. Not all of those gaps matter equally — per-transaction strategy is the one that actually bites — but the point is the agent never even mapped the decision space before narrowing it. Probably because the REPEATABLE READ vs. READ COMMITTED binary dominates MySQL docs and Stack Overflow, so the training data funnels toward it. Strong early evidence triggers convergence, and decision-space awareness is a judgment, not instruction-following, skill.
Adversarial peer review. After the agent recommended READ COMMITTED, I dropped in a document I'd prepared separately: a fact-check showing this is a defensible but contested position. Dimitri Kravtchuk, Oracle's MySQL performance architect, has shown that READ COMMITTED creates per-statement ReadView overhead causing trx_sys mutex contention at scale — up to 2x performance degradation for short transactions. No prompt can pre-load a rebuttal to a conclusion that hasn't been reached yet. That's the human acting as adversarial reviewer, not correcting the trajectory but stress-testing the destination.

Why this generalizes

The isolation level was the vehicle. The pattern applies anywhere the codebase is an incomplete map of the system — capacity planning, migration strategies, incident response — anywhere the agent can analyze what's visible but can't weigh what's implicit.
Providing context and having the agent apply it at the right inferential moment are different problems, and the second one requires holding the system in your head, which is exactly the kind of knowledge an agent is supposed to help you leverage, not replace.

What actually happened

The agent shipped a solid architecture decision record, a team-facing report, a concurrency architecture document, half a dozen issues for the gaps it found, and a PR with the implementation. READ COMMITTED via init_command, with an escape hatch for the two operations that genuinely need snapshot isolation. Days of senior engineering work compressed into hours.
But the final recommendation was correct because a human who understood the system — the infrastructure, the observability configuration, the workload routing, the external literature — kept redirecting the analysis every time it drifted.
Who in the room holds the topology?
That's the mental model problem applied to infrastructure. The agent is a force multiplier — but a multiplier needs something to multiply.
The faulty inferences — despite context that should have been more than enough — cast mental model shadows: business rules, operations structure, and infrastructure constraints are all projections of the same multidimensional creature that can either devour or protect you forever.

GitHub Told Me I Had Merge Conflicts. Git Told Me I Didn't. They Were Both Right.

Rafael Costa — Wed, 01 Apr 2026 14:37:36 +0000

Last month I tried to merge main into stg. Routine sync. GitHub said: "Can't automatically merge". So I ran the same merge locally... and got a clean merge. Zero conflicts.

Same branches. Same commits. Different answer. I've been writing software for years and I genuinely did not know this could happen.

What followed was the kind of debugging session I recognize from physics more than from software: tracing a failure back through layers of structure until you hit the actual constraint that's doing the damage. Except the system wasn't a quantum lattice. It was git's commit graph. And the constraint wasn't an obvious one.

If you've ever been surprised by a merge conflict, or wondered how git's merge-base works, or just want to understand how branch flow design can create weird graph topologies, welcome to the story of twelve merge bases, a diamond-shaped DAG, and the fix that shouldn't have worked.

A quick crash course (the parts that matter)

If you already think in terms of DAGs and merge bases, skip to "Twelve Ancestors." If not, we'll cover three core concepts, and the rest of this article follows from them.

Concept 1: commits are snapshots with parent pointers.

A ← B ← C ← D

Each commit stores a complete snapshot of your files and a pointer back to its parent. That's it. The whole history is a chain of these. Computer scientists call the resulting structure a directed acyclic graph - a DAG. Directed because pointers go one way. Acyclic because you can never follow them in a circle. Every problem I'm about to describe is a property of this graph.

A branch is just a sticky note pointing to a commit. main points to D. Branches are labels, not containers.

Concept 2: merges need a common ancestor.

Consider what happens when you branch off and both sides get new commits. Say we're creating a feature branch from main:

       ← E ← F     ← feature
      /
A ← B ← C ← D       ← main

B is the last commit reachable by walking back parent pointers from both branch tips. Git calls this the merge base — "common" always means "common to the two branches being merged." That definition is load-bearing for everything below.

To merge, git diffs base → main and base → feature, then combines both diffs; this is a 3-way merge of two branch tips and one common ancestor:

        ← E ← F ──╮
       /           M  ← main
A ← B ← C ← D ────╯

If both diffs touch the same line differently, that's a conflict. Everything else merges automatically.

You should notice that M, our merge commit, has two parents: D (main's old tip) and F (feature's tip). This is contrary to a regular commit which has a single parent. That means M is a descendant of both branches that were merged. This seems innocuous, but it's the single property that makes ~~me writing this piece possible~~ everything below work. It's really how ancestry flows between branches, why sequential merges stay clean, and why the diamond requires concurrency. One mechanism, three consequences.

Concept 3: git needs the newest common ancestor.

If git picks an old ancestor as the base, both diffs include changes the branches already agree on. False conflicts everywhere. The newest ancestor minimizes the diff — only what actually diverged shows up.

But what happens when there isn't a single newest?

Twelve Ancestors

Back to my problem.

git merge-base --all origin/main origin/stg | wc -l
12

Twelve merge bases, not one.

When git finds multiple incomparable bases - none descending from any other - it can't just pick one. Its fallback: recursively merge them together into a synthetic virtual ancestor, then use that as the single base for the real merge. Not a real commit in your history, though. It's a temporary in-memory artifact. With twelve bases, that's a cascade of merges-within-merges before the intended one even starts.

Local git merged it cleanly. GitHub's server-side mergeability check didn't.

That was the first surprise. "GitHub runs the same merge I run locally" is close enough for everyday work, but not literally true. GitHub computes PR mergeability in the background using a test merge commit, and historically its server-side behavior diverged from local git often enough that GitHub migrated merges and rebases to merge-ort in 2023. In our case, the important fact wasn't the exact internal path — it was that the server-side check surfaced a graph-topology problem that local git could still resolve.

But twelve merge bases is not normal. Where did they come from?

How twelve diamonds form

Our branch flow had recently evolved into this:

main → working-branch           (branches always fork from main)
       working-branch → stg     (QA)
       working-branch → releases → main   (deploy)

stg is a dead end, a parallel validation lane with nothing flowing out of it. Clean, one-directional pipeline. Working branches are always born from main, so their fork points sit on main's history.

Except two things used to happen that broke the one-directional rule.

First: we merged main back into stg to "stay in sync." Second - and this was the invisible one - developers (not me, pff, of course, cough cough) occasionally ran git merge origin/stg on their working branches to grab something from staging. That branch then shipped through releases into main, carrying stg-only ancestry with it.

The first puts main-only commits into stg's history. The second puts stg-only commits into main's ancestry. Bidirectional flow from intermediate branch states. When two branches each absorb the other's history like that, git calls it a criss-cross merge.

The minimal version

Strip away the branch flow. Two branches, two merges, one diamond:

Start:   main at M, stg at S (diverged from ancestor A)

Person 1: git checkout main && git merge origin/stg
          → M' (parents: M, S)

Person 2: git checkout stg && git merge origin/main
          → S' (parents: S, M)

(Both fetched before either pushed.)

Now trace the common ancestors of M' and S':

M is reachable from M' (direct parent). Reachable from S' too (S' has M as its second parent, a cross-merge). Common ancestor.
S is reachable from S' (direct parent). Reachable from M' as well (M' has S as its second parent, the other cross-merge). Common ancestor.
M does not descend from S. S does not descend from M.

       M
      ╱ ╲
   S'    M'
      ╲ ╱
       S

Two bases → diamond. That's it (;

The key: both merges see the other branch's pre-merge state. If Person 2 had fetched after Person 1 pushed, S' would descend from M' — single base, no diamond. Concurrency is the crucial ingredient.

Your first instinct might be: can't you create this with sequential direct merges? main→stg, then stg→main, then main→stg? No, because of the two-parent property from the crash course. Each merge commit descends from both tips. So when you merge stg→main, main's new tip descends from stg's current state. The next merge (main→stg) sees that result (a commit that already contains stg's history) and there's a single dominant ancestor. Always, no matter how many times you alternate.

How our branch flow produced this concurrency

In practice, nobody on our team was simultaneously merging in both directions. The working branch indirection turned sequential actions, spread across days or weeks, into graph-concurrent events. Here's the mechanism.

Start with stg at state S and main at state M. Both have diverged from their last common point, that is, neither descends from the other "in the near past". For all purposes now, they sit on parallel paths.

The contamination. A developer on branch-B runs git merge origin/stg to pull something from staging. That merge commit has two parents: branch-B's old tip and S. The second parent is the door: stg's entire history is now reachable from branch-B by walking that parent pointer. Branch-B now carries S in its ancestry.

    stg:   ─────── S ──────────────
                    │
              (git merge origin/stg)
                    │
    main:  ─── M ────── branch-B ──

The backflow. Before branch-B ships, someone merges main → stg to "stay current." Same mechanism, opposite direction: BF's two parents are stg's old tip and M. Once more, second parent paves the way to disaster: main's history is now reachable from stg.

    stg:   ─── S ──────── BF
                          ╱
                   (main → stg)
                        ╱
    main:  ─── M ────────── branch-B (still developing)

The carrier ships. Branch-B goes through releases into main.

    stg:   ─── S ──────── BF ─────── ...
                │          ╱
          (via branch-B)  (backflow)
                │        ╱
    main:  ─── M ────── MX ─────── ...

MX descends from both M and S (through branch-B's merge of stg). The two-parent property created the bidirectional flow, and it's about to create the diamond too.

Now trace the common ancestors of stg and main:

M is reachable from stg. How? BF is a merge commit, its two parents are S and M. Walk stg's ancestry back to BF, then follow BF's second parent to M. That's the backflow's two-parent link doing the work. M is also reachable from main directly. Common ancestor.
S is reachable from main. How? MX descends from branch-B, and branch-B merged stg, again two parents, one of which is S. Walk main's ancestry back to MX → branch-B → S. That's the contamination's two-parent link. S is also reachable from stg directly. Common ancestor.

But M does not descend from S. And S does not descend from M. They're on parallel paths: M on main's history, S on stg's history.

None of this matters until someone tries to merge stg and main. That's the triggering event — git needs a single merge base, runs merge-base, and hits both S and M. Remember the filtering rule from the crash course: git only keeps the youngest common ancestors, dropping any that have a younger common ancestor descending from them. But neither S nor M can filter the other out, because neither descends from the other. So both survive, leaving Git stuck with two incomparable bases. It now has to fall back and recursively synthesize a virtual ancestor from S and M.

        M (reached via backflow)
       ╱  ╲
    stg    main
       ╲  ╱
        S (reached via branch-B)

Two paths diverge and reconverge. Same diamond as the toy example, different wiring. The working branch captured stg's state weeks before the backflow captured main's state. Wall-clock sequential. But in the graph, each cross-merge referenced the other branch's pre-merge state. The branch indirection turned sequential actions into the same concurrent topology as two people merging "at the same time".

Each cycle, of course, adds more diamonds: a new backflow that references main before the latest carrier ships, or some new working branch that merged stg before the latest backflow... With six backflows over a few months, you can get twelve merge bases with none dominating the others.

In physics you'd call this a degeneracy — multiple states at the same energy level, no symmetry-breaking mechanism to select one. The DAG had the same problem: twelve ancestors at the same "depth," no easy descendancy relationship to break the tie.

A natural question if your team has a similar branch flow: does merging many feature branches into both stg and main create this problem? No. If branches fork from main and merge into both sides without pulling from stg first, the common ancestors are all fork points on main's linear history. Linear means each one descends from the last. Clear ordering, always one merge base.

    stg:   ── M₁ ───── M₂ ───── M₃ ───── M₄
             ╱         ╱         ╱         ╱
          feat-A    feat-B    feat-C    feat-D
           ╱         ╱         ╱         ╱
    main:  F₁ ── M₅ ── F₂ ── M₆ ── F₃ ── M₇ ── F₄ ── M₈

    Every feature forks from main and merges into both stg and main.
    Fork points are on main's history — each descends from the last.
    merge-base(stg, main) always has a single youngest. No diamonds.

The diamond requires two ingredients: stg-only ancestry entering main (via a working branch that merged stg) plus main-only ancestry entering stg (via a backflow). And critically, each has to happen before the other's result is visible, so they reference mutual past states instead of the merged result. No bidirectional flow, no diamonds.

Why "just pick the newest" doesn't work

My first instinct: why doesn't git use timestamps to pick the most recent?

Because "newest" requires a total ordering, and the diamond creates commits that are incomparable. M1 was created February 25. M2 was created February 28. Neither descends from the other. They sit on parallel paths connected at the top and bottom of the diamond, but not to each other.

Git can only order commits along parent-child chains. Across parallel paths, there's no ordering. Asking "which is newer?" is like asking which is taller, the color blue or a Tuesday.

What git actually does

It doesn't pick a winner. It synthesizes one.

When merge-base returns S and M as incomparable youngest ancestors, git's ort strategy merges S and M into a virtual commit V, and to do that, it needs their common ancestor. Remember, S and M do share ancestry - the old fork point where stg was born from main, the one that was filtered out of the top-level search because S and M are younger. That older ancestor comes back one level down as the base for merging S against M.

So the cascade is:

Top-level merge (stg into main): finds twelve youngest common ancestors, all incomparable.
Recursive resolution: merge them pairwise. Each pairwise merge needs a base, and that base is an older ancestor that was "too old" for the top-level merge but exactly right for resolving the conflict between two of the twelve.
At that lower level, the older ancestor is usually unambiguous: it predates the diamond-creating merges, so there's a single youngest. The recursive merge succeeds and produces a virtual result.
V becomes the synthetic base for the real merge.

With twelve bases, this is why the operation is expensive! It's not twelve straightforward comparisons, but a cascade of merges-within-merges, each needing its own base resolution. Local git's ort handled that depth. GitHub's server-side path didn't.

The fix that shouldn't work

Here's where it got counterintuitive. The fix for twelve merge bases is... yet another merge!

git checkout stg
git merge origin/main
git push origin stg

Wait: wasn't a main → stg merge the thing that caused this? Shouldn't this make it worse?

That was my reaction too. But look at the graph:

BEFORE: 12 merge bases, none dominating the others

    stg ──────────── ...          main ──────────── ...
         ╲        ╱                    ╲        ╱
         MB₁    MB₂                  MB₃    MB₄  ... MB₁₂
         (none is a descendant of any other — all 12 are "latest")


AFTER: git merge origin/main

               stg (pointer moves here)
                ↓
    ── ── ──── M (new merge commit)
              ╱ ╲
    (stg's      main-tip  ← this is now THE single merge base
   old head)       │
                (descendant of all 12 old MBs)

Remember: a branch is just a pointer to a commit. The merge created M and moved stg to point at it. M has two parents: the commit stg used to point at, and main's tip. That makes main's tip reachable from both branches (a new common ancestor). And main's tip descends from all twelve old bases, because main's history contains all of them.

Here's the mechanism. git merge-base only returns the youngest common ancestors of the two branches. The rule: if a common ancestor has a descendant that's also a common ancestor of both branches, the older one is redundant - the younger one already contains everything the older one had - so it gets dropped.

The twelve old bases survived before because none descended from any other. Same depth, parallel paths, no way to filter. But the merge created main's tip as a new common ancestor that descends from all twelve. Now every one of them has a younger common ancestor below it. All twelve filtered out. Only main's tip survives.

It doesn't untangle the diamonds. It buries them.

And nothing is lost, which is where the first concept pays off: "Commits are snapshots, not diffs". Diffs are computed on the fly from whatever base git picks. Main's tip already contains all the content from the twelve old bases - it descends from all of them, so their content is baked into its snapshot. The burial doesn't discard data — it moves the comparison point forward, shrinking the diff to only the real divergence.

After pushing that merge, GitHub's "Can't automatically merge" disappeared.

The fix works because the criss-cross requires a cycle, content flowing through both directions. A single final merge without subsequent backflows creates a new dominant ancestor and stops. No cycle, no new diamond. It's a one-time symmetry-breaking intervention: you introduce a commit that's unambiguously "more recent" than all twelve, and the degeneracy lifts.

Cherry-pick vs merge

One thing clicked during this that I'd never really appreciated enough.

A merge connects two histories, due to the commit-with-two-parents property. Every future merge-base calculation has to account for that connection. A cherry-pick copies a diff as a new, independent commit. No parent link. Git doesn't know they're related (because, fundamentally, they aren't). The graphs stay disconnected with no impact on ancestry and no diamonds.

MERGE: main → stg (to get hotfix H)

    main:  A ── B ── H
                     │
    stg:   C ── D ── M (merge commit, H is a parent)
                    ╱
              parent link created
              → graphs are now connected
              → affects all future merge-base calculations
              → potential diamond


CHERRY-PICK: cherry-pick H onto stg

    main:  A ── B ── H

    stg:   C ── D ── H' (new commit, same diff, NO parent link)

              H and H' have identical content
              but git doesn't know they're related
              → graphs stay independent
              → no impact on ancestry
              → no diamonds

A note on received wisdom: Raymond Chen's excellent "Stop cherry-picking, start merging" series documents how cherry-picks between branches that will eventually merge create time bombs — spurious conflicts, silent reversions, the works. But in a follow-up, he's explicit: "if the two branches never merge, then there's no need to get all fancy with your cherry-picking." Our stg is a dead end. Nothing flows out of it. Cherry-pick is the right tool precisely because the graphs should stay disconnected.

Rebase avoids merge commits, but it doesn't avoid ancestry changes. It replays your branch onto a chosen upstream. In this workflow, rebasing a working branch onto main is fine — your branch stays rooted in main's history. Rebasing onto stg would pull staging ancestry into the branch tip, which is exactly what we wanted to avoid. You also pay the price of rewriting history, but that's a separate tradeoff.

What changed

One merge to collapse existing damage. One flow change to prevent new damage.
A guiding principle to unite them all: stop backflowing with merges.

Working flow:     main → branch (fork)
                  branch → stg (QA, dead end)
                  branch → releases → main (deploy)

If stg needs a hotfix:  cherry-pick from main
Never:                  main → stg, stg → main, releases → stg

The deeper realization was simpler and more uncomfortable: the bidirectional merges that created this mess weren't accidents. They were our process. "Merge main into stg to stay current" was something we did on purpose, routinely, because it seemed like good hygiene. The diamonds accumulated silently for months and nobody noticed... until GitHub's less-tolerant merge path surfaced what local git had been quietly papering over.

GitHub wasn't wrong. It was simply less forgiving. And that turned out to be substantially useful, since it forced us to see a graph topology problem that ort had been abstracting away!

When GitHub says "can't merge" and local git says clean... the question isn't who's right. Both are. They're evaluating the same graph in different contexts - local git running a direct merge in your repo, GitHub running a server-side mergeability check. Knowing that is the difference between debugging for ten minutes and debugging for a day.

The cheat sheet I wish I'd had:

You want to...	Do this	Not this
Update your branch	`git rebase origin/main`	`git merge origin/stg` or `git rebase origin/stg` (imports stg ancestry)
Test a feature	branch → stg	—
Ship a feature	branch → releases → main	stg → releases
Get a hotfix into stg	cherry-pick from main	merge main → stg
GitHub says "can't merge"	Test locally first	Trust the UI
Check merge base health	`git merge-base --all A B \| wc -l`	Assume it's 1

The Mental Model Problem of AI-Generated Code

Rafael Costa — Wed, 25 Mar 2026 15:00:37 +0000

The Mental Model Problem: Why AI-Generated Code Is More Expensive Than It Looks

In physics, you never trust a result just because the math produced it.

You take the output and attack it, check limiting cases — does the equation reduce to something known when you push a parameter to zero or infinity? You plug in extreme values, look for dimensional inconsistencies, and compare it against independent derivations. The computation is merely a tool; the verification is the methodology.
Then, if and only if you can't break the result, you can start to believe it. And it's win-win because, even if you do break it, that means you learned something specific about where the original reasoning went wrong — which is, sometimes, equally or more valuable than the result itself.

I trained as a physicist — years of condensed-matter theory, all the way through a PhD. Now I build and ship software products. The career changed; the verification instinct didn't. And somewhere along the way, I noticed that the discipline that's second nature in physics is almost perfectly inverted in how most developers use AI coding tools.

Much of the industry is converging on one workflow: AI generates code, you review it (as a matter of fact, even the review process is often automated). And the response to the quality problems this creates is to bolt guardrails on top — better review tools, AI-on-AI review chains, automated quality gates. All of that addresses a real problem. But it's addressing it from the wrong end.

There's a different workflow that I've found consistently more effective for nontrivial work, and far fewer people center it as their default:

You write the code. AI tries to break it.

This isn't about who types the first draft, but a cognitive fact: for nontrivial work, AI is often more useful as a critic than as a first author. And once you internalize that, your entire workflow changes.

Why AI Often Works Better as Critic Than First Author

When an AI model generates code, it's predicting plausible token sequences given your prompt. It doesn't have intent. Hell, it frequently doesn't even know your system's history, nor does it understand why you picked one data structure over another six months ago, or what edge case took your team a week to discover. It produces something that looks like a solution. Sometimes it is one, but often it's a sophisticated guess that drifts from your constraints in ways that are expensive to find.

When the same model critiques code, the dynamic is fundamentally different. You've given it a concrete artifact to reason about. Now, it can trace logic paths, check boundary conditions, ask "what happens if this input is null, or negative, or enormous?" and even compare your implementation against known patterns and spot deviations. What's central is this: critique is a constrained task — the model is operating within the boundaries of something that already exists. Generation is more-or-less an unconstrained task — the model is making architectural decisions it can have no basis for.

This isn't just a practical observation. There's a cognitive mechanism underneath it that explains why the difference is so large.

The Mental Model Problem

In nontrivial systems, the most expensive bottleneck is usually the mental model someone holds of how the system works.

When you write code yourself — even rough, incomplete, first-draft code — you're building that mental model as you go. Every decision, even the ones you make quickly, leaves a trace in your understanding. You know why the function is structured this way. You know which constraints you're encoding and which you're deferring. You know where you cut corners and where you were careful.

When AI generates code and you review it, nobody holds the mental model. The AI never had one in the first place. And you're trying to reconstruct that by reading the output — reverse-engineering intent from an artifact that was produced without any. This is possible for trivial code. For nontrivial systems, it's where time goes to die. Even just-a-little seasoned developers know how expensive this is. It's why code reviews are so much more draining than pair programming — in the latter, the mental model is shared in real time; in review, particularly of code you're somehow "far away from," it has to be reconstructed from the artifact alone.

I think of it as the difference between navigating a city you've walked through and navigating a city from a map someone else drew. Both get you places. But when something unexpected happens — a road closure, a detour, a constraint that wasn't on the map — the person who walked the city knows six alternatives. The person with someone else's map is lost.

This is why AI-generated code that "works" can be more dangerous than AI-generated code that breaks. Broken code surfaces the gap immediately - at least it should. Working code that you don't fully understand creates what I call orphaned architecture — a system with no mental model owner. A couple of months later, when something downstream fails, you'll debug a design whose rationale exists only in a conversation history you've long since closed.

What "You Generate" Actually Means

I don't mean "you handwrite every line."

I mean you author the first intent-bearing artifact. This is the thing that gives the system its center of gravity before AI starts expanding it.

That might be the domain model, the core function, whatever invariants must hold, or a couple of key test cases that define correct behavior. It could be the state machine, the architectural skeleton, or a short ADR explaining which tradeoff you're accepting and why.

I'm not coming from a manual purity perspective. The crucial detail is that someone — you — has made the decisions that carry judgment, and those decisions exist in a form the model can now reason about. Once that exists, AI becomes dramatically more powerful, because it's critiquing your structure instead of silently inventing one.

False Velocity and the Missing Mental Model

The evidence is piling up, and it's converging on a pattern that should worry anyone paying attention.

A CMU study accepted at MSR '26, analyzing 807 Cursor-adopting repositories against matched controls, found that velocity gains were real but transient — they faded within months — while code complexity increases were persistent, creating a self-reinforcing debt cycle. An IEEE Spectrum piece from January documented something worse: newer models producing code that doesn't crash but silently fails to do what was intended — avoiding errors by removing safety checks or generating fake output that matches the expected format. And METR's own follow-up revealed that they had to redesign the study because developers increasingly refused to participate if it meant working without AI on half their tasks. The tool that makes you slower has become the tool you can't imagine working without.

The industry's reaction is reasonable: add more review layers. AI reviewers reviewing AI-generated code, quality gates, automated scanning.

This addresses but a symptom of the disease: nobody holds the mental model.

When AI generates code and a human reviews it, the human is doing the most cognitively expensive possible version of review: building a mental model from scratch by reading someone else's output. There's no reasoning to reconstruct. There's no intent to discover. There's just an artifact that looks plausible, and you have to determine whether plausible is correct.

When AI generates code and another AI reviews it, you may catch surface defects — style violations, common security patterns, obvious bugs. But you still haven't solved the real problem: nobody owns the reasoning that gave the system its shape. That's fine for boilerplate, but potentially endgame for code that encodes judgment.

"But My AI Has Full Repo Context Now"

The obvious counterargument: tools have gotten better. Cursor indexes your repo. Claude Code reads your file tree and does the beautiful /init thing. You can inject conventions via agents.md or .cursorrules. Copilot has repo-wide context. Some teams — mine included — have experimented with architectures where a large-context model ingests the entire codebase and compresses it for downstream agents. If AI can see your system, doesn't the mental model problem go away?

That narrows the issue, but doesn't quite close it.

Context-aware tools can see what your code looks like. They can match conventions, follow existing patterns, stay stylistically consistent. That's a real upgrade over a blank-slate chat prompt, and I'm not pretending otherwise. Generated code from a context-aware tool is substantially better than what's produced with a model that's never seen your repo.

But context is not intent. The tool can see that you use a specific pattern for error handling across your codebase. What it can't see is whether that pattern actually represents a deliberate architectural choice or legacy debt you haven't cleaned up yet. It can see your data model, but not which constraints are load-bearing and which are accidental — which fields exist because of a product decision or due to a migration you never finished. It can see what you decided. It can't see what you considered and rejected, which is often the more important half of understanding a system. Context windows capture artifacts, not decision trees.

And here's the part that actually strengthens the inverted workflow: context-aware AI is an even better critic than it is a generator. A model that can see your full codebase, your conventions, your patterns — and then reviews your new code against all of that — catches things a context-free critic never would. "This function doesn't follow the error handling pattern you use everywhere else." "This data flow is inconsistent with how the rest of the system handles state." "Your naming here deviates from the convention in these twelve other files."

Context makes AI-as-critic dramatically more powerful. It makes AI-as-generator incrementally better. That asymmetry is exactly the point.

When AI-First Generation Is the Right Call

I want to be precise about the boundary, because overselling the inverted workflow would be exactly the kind of false clarity I'm arguing against.

AI generation is the right default when the mental model doesn't need an owner, or the path is so well-trodden that the decisions are obvious. Config files, project scaffolding, CORS setup, CI pipeline boilerplate — nobody needs to deeply understand why the YAML looks the way it does. The code doesn't encode intent; it encodes convention. Let the model handle that.

AI generation is also great as a research tool: "show me three different approaches to X" is not asking the model to build your system. It's asking it to widen your field of view before you make a decision. Same with translation ("rewrite this Python function in Go") — intent is fully specified; the generation is mechanical.

The workflow flips when the code should incarnate your actual product decisions. Core logic, business rules, architectural boundaries. Anything where the reason for a design choice is as important as the choice itself. Anything where, if someone asked you "why is it structured this way?", the answer matters.

The Workflow

Here's the concrete version, if you want to try it on one real feature:

Write the skeleton. Not the whole feature — just the parts that carry intent. Module boundaries, data model, core function, invariants, key test cases. Don't optimize for completeness. Optimize for decisions. Every line should reflect a choice you made for a reason you could articulate if pressed. There's no problem with using AI to refine this, brainstorm alternatives, or even generate a thoroughly guided first draft — as long as you understand that the mental model is yours, and the AI is just a tool to help you build it.

Have AI attack it. Not "review this" — that's too passive. Ask for adversarial input: "What inputs would break this? What assumption am I making that might not hold? Write tests that target the riskiest parts of this design. Argue against my architectural choice — under what conditions is it the wrong call? Am I overlooking any established patterns that would solve this more robustly?" The goal is to find the holes in your design, not just surface-level defects.

Fix what the critique reveals. Because you designed the system, you'll know exactly where each fix goes. No reverse-engineering required. This is where the convergence advantage is most obvious.

Then let AI expand. Once the core is solid and yours, hand AI the periphery: documentation, error messages, logging, additional test cases, boilerplate around the edges. This code is easy to verify because you have a clear architectural spine to compare it against.

The first time you try this, it'll feel slower. You'll miss the rush of watching code appear.
Give it one full feature cycle, though.
Then compare not just time, but confidence in what you shipped and speed of the next change in that module.

The Discipline

Every conversation about AI coding eventually arrives at the same question: how much can AI do?

I think the more useful question is: where is the mental model, and who owns it?

For boilerplate, nobody needs to own the mental model. Let AI generate. For the core of your system — the logic that encodes why your product exists — the mental model is the most valuable artifact you produce. More valuable than the code itself, because code can be rewritten but understanding can't be downloaded that easily.

Physics taught me this before software did. You don't trust a result because the computation produced it. You trust it because you attacked it and it survived. The computation is cheap. The verification is where understanding lives.

The question is not how much code AI can write. The question is whether your workflow preserves a human owner of the system's mental model.

Write the structure, let AI break that, and even use it to explore alternatives or cut corners... but understand the core of what you deliver.
The mental model is the most expensive thing in your system. Don't let it become an orphan.