<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Connor Hickey</title>
    <description>The latest articles on DEV Community by Connor Hickey (@conalh).</description>
    <link>https://dev.to/conalh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946778%2Fa162b091-556f-4cd1-adb3-2ae70ede27d4.jpeg</url>
      <title>DEV Community: Connor Hickey</title>
      <link>https://dev.to/conalh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/conalh"/>
    <language>en</language>
    <item>
      <title>Before the First Token -- The AI Coding Interview as Preregistration, Not Prompt Theater</title>
      <dc:creator>Connor Hickey</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:26:58 +0000</pubDate>
      <link>https://dev.to/conalh/before-the-first-token-the-ai-coding-interview-as-preregistration-not-prompt-theater-495k</link>
      <guid>https://dev.to/conalh/before-the-first-token-the-ai-coding-interview-as-preregistration-not-prompt-theater-495k</guid>
      <description>&lt;p&gt;The coding interview has always been a compromise with a bad conscience.&lt;/p&gt;

&lt;p&gt;It tries to answer a serious question — can this person do the work? — by staging a performance that only partially resembles the work. A candidate is placed in front of an interviewer, handed a small problem, asked to think aloud, and expected to produce code under conditions stripped of nearly everything that makes software engineering real: the existing codebase, the tests, the team's conventions, the build system, the stale documentation, the product boundary, the hidden invariant, the reviewer who will ask why the diff is so large.&lt;/p&gt;

&lt;p&gt;The industry knows this. It complains about LeetCode, whiteboards, anxiety, memorized patterns, and the strange theater of narrating thought while being watched. Then it keeps the ritual anyway, because the ritual has one defensible property: it produces a signal. Not always a fair signal. Not always the right signal. Some signal. A person who can reason clearly, write code, and recover under pressure probably has technical ability. A person who cannot may still be good, but the interview has no clean way to see it.&lt;/p&gt;

&lt;p&gt;Artificial intelligence does not remove this problem. It sharpens it.&lt;/p&gt;

&lt;p&gt;If software development increasingly happens with AI assistance, an interview that bans AI begins to resemble a purity test for a version of the job that is already aging. At the same time, an interview that simply allows AI can become worse than the old ritual. It may confuse the model's competence with the candidate's competence. It may reward tool access, private prompt libraries, hidden assistance, or the luck of a good completion. It may replace LeetCode theater with prompt theater.&lt;/p&gt;

&lt;p&gt;The important question, then, is not whether candidates should use AI in coding interviews.&lt;/p&gt;

&lt;p&gt;The important question is what the interview should make visible once AI is allowed.&lt;/p&gt;

&lt;p&gt;The answer is older than AI.&lt;/p&gt;

&lt;p&gt;Great engineers have always had to enter unfamiliar systems and decide what matters. They have always had to find the relevant files, preserve the hidden contract, identify the meaningful test surface, avoid unnecessary blast radius, and explain why a local change is safer than a heroic rewrite. They have always had to know what not to touch. They have always had to turn ambiguity into bounded work.&lt;/p&gt;

&lt;p&gt;AI does not create that judgment.&lt;/p&gt;

&lt;p&gt;AI creates a chance to preregister it.&lt;/p&gt;

&lt;p&gt;That is the stronger version of the future coding interview. Give the candidate a repository, a realistic task, a fallible coding agent, and a bounded environment. Before the candidate asks the model for anything, require a short pre-agent record: what they believe the task is, which files likely matter, which invariants must hold, what the agent may touch, what it must avoid, what tests would count as evidence, and what "done" means.&lt;/p&gt;

&lt;p&gt;The value of that record is not that it perfectly captures the candidate's judgment. It will not. The value is that it is written before the evidence arrives.&lt;/p&gt;

&lt;p&gt;That is the point.&lt;/p&gt;

&lt;p&gt;In science, preregistration does not matter because the first hypothesis is always correct. It matters because the researcher commits before seeing the result. They cannot quietly become the person who predicted whatever happened. The same logic applies here. Before the agent emits a patch, before the tests pass or fail, before the candidate sees what the machine makes easy or hard, the candidate has to write down their theory of the work.&lt;/p&gt;

&lt;p&gt;Then the interview has something the old format rarely had: a record of prior commitment.&lt;/p&gt;

&lt;p&gt;Did the candidate's map find the right subsystem? Did they protect the correct invariant? Did their definition of done survive contact with the patch? Did they revise their theory honestly when the code contradicted it? Did they hold the line against a model suggestion that violated their own stated constraints? Or did they retrofit a story after the agent produced something plausible?&lt;/p&gt;

&lt;p&gt;That is the signal. The artifact alone is not the signal. The defense alone is not the signal. The signal is the full instrument: a pre-commitment, an AI-assisted attempt, the task reality established by tests and review, and a defense of the gap between them.&lt;/p&gt;

&lt;p&gt;The old whiteboard measured the wrong fifteen minutes: keystrokes, syntax, and public performance. A better AI-era interview measures the earlier decision: how the candidate scopes the work before implementation begins, and how honestly they revise that scope when reality pushes back.&lt;/p&gt;

&lt;p&gt;The future coding interview should begin before the first token.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Is a Forecast, Not a Victory Lap
&lt;/h2&gt;

&lt;p&gt;AI-assisted development is already common enough that hiring cannot ignore it. Stack Overflow's 2025 Developer Survey reported that 51% of professional developers use AI tools daily. The same survey reported that 66% of developers are frustrated by AI solutions that are "almost right," 45% say debugging AI-generated code is more time-consuming, and 46% actively distrust AI-tool accuracy. That is the world hiring is entering: heavy usage, low trust, and a widening need for human verification.&lt;/p&gt;

&lt;p&gt;Companies are beginning to adapt. Canva announced in June 2025 that backend, machine-learning, and frontend engineering candidates are expected to use tools such as Copilot, Cursor, and Claude during technical interviews. Canva's rationale is direct: its engineers use AI in daily work, so interviews that prohibit those tools fail to assess how candidates would perform in the actual role. Canva is not alone. Google has been reported to be piloting a "code comprehension" round in which candidates read, debug, and optimize an existing codebase with an AI assistant, and Meta has been reported to evaluate AI-assisted interviews partly on verification — the same axis this essay treats as central. These are reported moves, not published doctrine, so they are evidence of direction, not proof of method.&lt;/p&gt;

&lt;p&gt;That direction does not prove AI-assisted interviews are valid.&lt;/p&gt;

&lt;p&gt;It does not even prove that AI tools currently improve senior engineering work in mature repositories. METR's July 2025 randomized controlled trial found that experienced open-source developers working on their own repositories took 19% longer with early-2025 AI tools, and METR framed the result as a snapshot of one relevant setting rather than a universal law.&lt;/p&gt;

&lt;p&gt;That finding is not fatal to this argument. It makes the argument more honest.&lt;/p&gt;

&lt;p&gt;The case for AI-aware interviews is not that AI is already an unambiguous productivity win in every repository. The case is that software work is moving toward human-machine collaboration, and the human skill that matters most in that collaboration is judgment over the tool.&lt;/p&gt;

&lt;p&gt;This essay is therefore making a narrower claim than "AI is the future, so interviews must test AI fluency."&lt;/p&gt;

&lt;p&gt;The claim is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As AI coding agents become normal development tools, interviews should use them to expose durable engineering judgment — especially scoping, context selection, verification, revision, and review — rather than treating generated code as proof of candidate ability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That claim is partly present-tense and partly forecast. The present-tense part is that many developers already use AI daily and some companies already design interviews around it. The forecast is that the most durable hiring signal will not be prompt fluency, turn-budget discipline, or token thrift. It will be the candidate's ability to govern automation.&lt;/p&gt;

&lt;p&gt;Current agent mechanics will rot. Prompt syntax will change. Context windows will grow. Agent interfaces will become more autonomous. Turn counts, file limits, approval modes, and "best prompting practices" may age quickly.&lt;/p&gt;

&lt;p&gt;The durable skill is older than AI: understanding a system well enough to change it safely.&lt;/p&gt;

&lt;p&gt;The interview should bet on that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pre-Agent Record
&lt;/h2&gt;

&lt;p&gt;A pre-agent work sample begins with a simple rule: the candidate may inspect the repository manually, but may not ask the AI assistant to act until they have produced a short setup record.&lt;/p&gt;

&lt;p&gt;That record might include a task summary, a context map, an &lt;code&gt;AGENTS.md&lt;/code&gt; or equivalent agent-instruction file, a verification plan, a risk note, and a definition of done.&lt;/p&gt;

&lt;p&gt;The format matters less than the timing. The candidate has to commit before the model produces code.&lt;/p&gt;

&lt;p&gt;A weak pre-agent record says: &lt;em&gt;Use clean code. Follow best practices. Add tests. Be careful.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is not judgment. That is engineering perfume.&lt;/p&gt;

&lt;p&gt;A stronger record says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The likely surface is &lt;code&gt;validator.ts&lt;/code&gt;, &lt;code&gt;schema.ts&lt;/code&gt;, and &lt;code&gt;validator.test.ts&lt;/code&gt;. The router is out of scope unless the failing behavior cannot be reproduced at the validation layer. Preserve the public error-response shape. Reuse &lt;code&gt;formatValidationError&lt;/code&gt; rather than adding a new formatter. Add a regression test for missing nested array values. Do not add dependencies. The change is done when the new regression test and the existing validator suite pass.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is not merely an instruction to the agent. It is a claim about the system. The candidate is making falsifiable commitments: I think this is where the problem lives. I think this is the boundary. I think this is the invariant. I think this test would prove the change. I think these subsystems should remain untouched.&lt;/p&gt;

&lt;p&gt;The agent's output then becomes one source of evidence. It is not ground truth. That distinction is crucial. If the candidate says the router is out of scope and the agent edits the router, the agent's behavior does not automatically prove the candidate was wrong. The model may have overreached. Resisting that overreach may be exactly the governance skill the interview is trying to measure.&lt;/p&gt;

&lt;p&gt;The ground truth is not what the agent did. The ground truth is what the task actually required, established through tests, review, code ownership, and the stated product contract.&lt;/p&gt;

&lt;p&gt;So the useful comparison is three-way:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Object&lt;/th&gt;
&lt;th&gt;What it reveals&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-agent record&lt;/td&gt;
&lt;td&gt;What the candidate believed before automation acted.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent behavior&lt;/td&gt;
&lt;td&gt;What the model attempted, including useful discoveries and overreach.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task reality&lt;/td&gt;
&lt;td&gt;What tests, review, code ownership, and requirements show was actually necessary.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That three-way gap is where the signal lives.&lt;/p&gt;

&lt;p&gt;If the candidate excluded the router and the task truly did not require it, rejecting the agent's router edit is a governance win. If the candidate excluded the router and the bug actually lived there, that is a scope miss. If the candidate accepted an agent edit that violated their own registered constraint, that is a discipline failure. If the candidate revised their scope after a failing test exposed a real dependency, that is not hypocrisy. It is a good update.&lt;/p&gt;

&lt;p&gt;This is why sequencing matters.&lt;/p&gt;

&lt;p&gt;Real engineering is interleaved. Engineers hypothesize, inspect, revise, run tests, discover new facts, and revise again. A forced pre-agent record is artificial. The artificiality earns its place because it creates a sealed envelope. Without the envelope, the candidate can become the person who always knew whatever the agent eventually discovered.&lt;/p&gt;

&lt;p&gt;The pre-agent phase is not a simulation of the entire job. It is a measurement device. The candidate still gets to revise their theory. In fact, revision should be scored. But the interviewer now has a before-and-after trace: what the candidate believed before automation acted, what the code revealed, and how the candidate handled the gap.&lt;/p&gt;

&lt;p&gt;That is a cleaner signal than watching someone type.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Silent Expert and the Articulate Fraud
&lt;/h2&gt;

&lt;p&gt;The preregistration frame solves one problem and exposes another.&lt;/p&gt;

&lt;p&gt;It helps catch the candidate who rationalizes backward. Once the pre-agent record exists, the candidate cannot pretend they always knew the subsystem mattered, always intended to preserve that invariant, or always meant to run that test. The envelope was sealed before the agent acted.&lt;/p&gt;

&lt;p&gt;But preregistration does not solve the silent expert problem.&lt;/p&gt;

&lt;p&gt;Some strong engineers carry tacit judgment. They scope correctly by feel. They recognize danger before they can neatly explain it. Their knowledge is compiled from years of code review, migrations, outages, and the slow accumulation of scars. Ask them to write a polished context map under time pressure, and the artifact may look thinner than one produced by an articulate mid-level engineer who has read every essay about invariants and blast radius.&lt;/p&gt;

&lt;p&gt;That is a real weakness. A legibility device can become a literacy test if it rewards only the people who are good at producing legibility artifacts.&lt;/p&gt;

&lt;p&gt;The inverse failure mode is the articulate fraud.&lt;/p&gt;

&lt;p&gt;This candidate writes a beautiful &lt;code&gt;AGENTS.md&lt;/code&gt;. They name plausible invariants. They produce a neat context map. They keep the diff small. They add a regression test. They write a fluent PR summary. They sound like the kind of person who has read every essay about engineering judgment.&lt;/p&gt;

&lt;p&gt;And they are wrong. They preserved the wrong invariant. They tested the obvious case but missed the contract. They excluded the subsystem that actually owned the behavior. They used the right language for the wrong system. A surface-level rubric will pass them.&lt;/p&gt;

&lt;p&gt;So the interview cannot stop at the written artifact. It needs a defense, and the defense has to be specific enough to separate substance from performance without recreating the stress ritual of the old whiteboard. The interviewer should ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You said this response shape is public. Where is that established?&lt;/li&gt;
&lt;li&gt;You excluded the router. What evidence would make you bring it back into scope?&lt;/li&gt;
&lt;li&gt;You added this regression test. What bug would still pass?&lt;/li&gt;
&lt;li&gt;You reused this helper. What assumptions does it encode?&lt;/li&gt;
&lt;li&gt;You said the diff is minimal. Minimal relative to what?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the articulate fraud, these questions expose the gap between the form of judgment and the substance of judgment. For the silent expert, these same questions give compressed judgment a chance to unpack itself.&lt;/p&gt;

&lt;p&gt;This reintroduces a tension. Live questioning can distort performance. A North Carolina State University and Microsoft study found that performance in traditional technical interviews was reduced by more than half when candidates were watched by an interviewer, and that stress and cognitive load were higher in the public whiteboard condition.&lt;/p&gt;

&lt;p&gt;The pre-agent interview should learn from that rather than reproduce it. The defense should be a review conversation around artifacts: the preregistration, the patch, the tests, and the deltas between them. The interviewer should not hover over every prompt. The candidate should have private work time. The pressure should move closer to the work and farther from performance theater.&lt;/p&gt;

&lt;p&gt;The written record prevents post-hoc rationalization. The defense prevents polished artifacts from becoming cosplay. The two pieces need each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compression, Not Context Dumping
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;AGENTS.md&lt;/code&gt; is useful here because it gives the pre-agent record a concrete form. The official &lt;code&gt;AGENTS.md&lt;/code&gt; project describes it as "a README for agents": a predictable place to provide context and instructions that help AI coding agents work on a project, including setup commands, test commands, style guidance, and repository conventions. OpenAI's Codex documentation says Codex reads &lt;code&gt;AGENTS.md&lt;/code&gt; files before work and stops adding project-instruction files once their combined size reaches &lt;code&gt;project_doc_max_bytes&lt;/code&gt;, which is 32 KiB by default.&lt;/p&gt;

&lt;p&gt;That budget is not just a technical footnote. It reveals the shape of the problem. Instructions are not free. They consume space and attention. They can clarify the work, or they can poison it.&lt;/p&gt;

&lt;p&gt;A strong &lt;code&gt;AGENTS.md&lt;/code&gt; is not a shrine to best practices. It is not a place to tell the model to be "world-class," "elegant," "production-ready," or "careful." Those words sound like engineering, but they rarely constrain anything.&lt;/p&gt;

&lt;p&gt;A strong &lt;code&gt;AGENTS.md&lt;/code&gt; is a compression test. It asks: what is the smallest set of instructions that materially reduces the agent's odds of making a bad change in this repository?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# AGENTS.md&lt;/span&gt;

&lt;span class="gu"&gt;## Objective&lt;/span&gt;
Make the smallest correct change for the assigned issue.

&lt;span class="gu"&gt;## Local commands&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Targeted tests: &lt;span class="sb"&gt;`pnpm test validator`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Full tests: &lt;span class="sb"&gt;`pnpm test`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Lint: &lt;span class="sb"&gt;`pnpm lint`&lt;/span&gt;

&lt;span class="gu"&gt;## Constraints&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Preserve the public error-response shape.
&lt;span class="p"&gt;-&lt;/span&gt; Do not add dependencies.
&lt;span class="p"&gt;-&lt;/span&gt; Reuse existing validation helpers before adding new utilities.
&lt;span class="p"&gt;-&lt;/span&gt; Do not reformat unrelated files.
&lt;span class="p"&gt;-&lt;/span&gt; Do not modify generated files.
&lt;span class="p"&gt;-&lt;/span&gt; Do not broaden the task into router, database, or UI changes unless the
  validation layer cannot reproduce the failure.

&lt;span class="gu"&gt;## Completion standard&lt;/span&gt;
A change is complete only when:
&lt;span class="p"&gt;1.&lt;/span&gt; the failing behavior is covered by a regression test;
&lt;span class="p"&gt;2.&lt;/span&gt; the targeted validator test passes;
&lt;span class="p"&gt;3.&lt;/span&gt; the candidate can explain the root cause, the patch, and the remaining risk.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file is not valuable because it is long. It is valuable because it is local.&lt;/p&gt;

&lt;p&gt;A 2026 paper evaluating repository-level context files makes the point sharper. Its abstract reports that context files tended to reduce task success rates compared with no repository context while increasing inference cost by over 20%. The body of the paper is more nuanced: developer-provided files marginally improved performance by 4% on average compared with omitting them, while LLM-generated context files had a small negative effect of 3% on average. The authors conclude that unnecessary requirements can make tasks harder and that human-written context files should describe only minimal requirements.&lt;/p&gt;

&lt;p&gt;That is not a reason to abandon &lt;code&gt;AGENTS.md&lt;/code&gt;. It is a reason to stop treating it as a talisman.&lt;/p&gt;

&lt;p&gt;The study's useful lesson is not "context files are bad." It is that context files are only helpful when they are minimal, human-written, and operational — exactly the skill this interview should test. The candidate should not be rewarded for writing a context file. They should be rewarded for knowing what belongs in one, what does not, and why.&lt;/p&gt;

&lt;p&gt;A bloated context file is not maturity. It is another way to lose control of the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Limits Are Diagnostic, Not Sacred
&lt;/h2&gt;

&lt;p&gt;The first version of this idea is easy to overstate: test whether candidates can build with low tokens, low context, and limited AI turns.&lt;/p&gt;

&lt;p&gt;That framing is too brittle.&lt;/p&gt;

&lt;p&gt;Token thrift is not engineering excellence. A candidate who uses more context and produces a safer, clearer, better-tested patch is better than a candidate who performs austerity and ships fragile code. Inference will get cheaper. Context windows will grow. Agent interfaces will change. An interview that treats raw token count as a primary score will age into nonsense.&lt;/p&gt;

&lt;p&gt;The better claim is narrower: context limits are useful because they force selection into the open. They are diagnostic, not sacred.&lt;/p&gt;

&lt;p&gt;Without constraint, a weak candidate can dump the entire repository into the model and accept the first patch that passes visible tests. Abundance can hide weak judgment. A constraint forces the candidate to choose. Which files matter? Which tests matter? Which interface is public? Which layer owns the behavior? What should the agent ignore? What is the minimum evidence that the patch is correct?&lt;/p&gt;

&lt;p&gt;But the constraint reveals only the candidate's model under that constraint. It does not automatically prove how they would work in production. A five-file limit can become a test of artificial scarcity if treated as a literal simulation of the job.&lt;/p&gt;

&lt;p&gt;So the interview should not claim that real engineering is always token-bound. It should claim that real engineering requires relevance judgment, and bounded context is one way to make that judgment observable.&lt;/p&gt;

&lt;p&gt;Research on long-context behavior supports the modest claim, not the inflated one. "Lost in the Middle" found that language-model performance can degrade depending on where relevant information appears in long contexts, with performance often highest when relevant information appears near the beginning or end rather than buried in the middle.&lt;/p&gt;

&lt;p&gt;That does not prove a token-limited interview is valid. It only shows that context has shape, noise, and order. The hiring instrument still needs validation.&lt;/p&gt;

&lt;p&gt;The context limit is not the score. It is scaffolding. The score is whether the candidate selected relevant context, revised that selection when evidence changed, verified the patch, and defended the tradeoffs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coachability Is Not Always a Failure
&lt;/h2&gt;

&lt;p&gt;Every hiring test eventually becomes a game.&lt;/p&gt;

&lt;p&gt;That was the fate of algorithm interviews. They began as a proxy for reasoning and became a curriculum of pattern recognition. Dynamic programming, graph traversal, sliding windows, heaps, tries, backtracking, binary search on answer — all trainable, all rehearsable, all capable of becoming ritual.&lt;/p&gt;

&lt;p&gt;A pre-agent work sample can become ritual too. Candidates will memorize &lt;code&gt;AGENTS.md&lt;/code&gt; templates. Interview coaches will teach context maps. Prep courses will drill candidates on naming invariants, writing risk notes, and saying "blast radius" with the right kind of restraint. Companies will turn a useful idea into another gate.&lt;/p&gt;

&lt;p&gt;This objection is real. It also has a better answer than "use fresh repos."&lt;/p&gt;

&lt;p&gt;Fresh repositories prevent candidates from memorizing this answer. They do not prevent candidates from training the orientation routine across fifty unfamiliar repos. The coachable object is not merely the answer. It is the meta-procedure: orient quickly, identify the test surface, name the invariant, constrain the diff, write the guardrail, defend the revision.&lt;/p&gt;

&lt;p&gt;That sounds like a threat until you look at it closely. If preparation teaches candidates to read unfamiliar code, identify meaningful tests, preserve invariants, constrain changes, and defend tradeoffs, then preparation is teaching the job. That is different from memorizing an answer key. It does not eliminate Goodhart's law, but it changes what optimization produces.&lt;/p&gt;

&lt;p&gt;Two honest caveats keep this from becoming wishful. First, the claim rests on a transfer assumption that algorithm interviews failed: that the coached version of "orient fast, name the invariant, defend the revision" stays attached to real competence rather than drifting into its own rehearsed performance, the way "recognize the dynamic-programming pattern" drifted from designing algorithms. That attachment is a bet, not a proof. Second, the defense is itself coachable — a prep course can drill convincing answers to "which API is public here." The format's protection is that grounded answers are harder to fake against a fresh repository than ungrounded ones. But "harder" is not "impossible."&lt;/p&gt;

&lt;p&gt;The danger is not that candidates learn the form. The danger is that they learn only the form. That is why the defense matters. A candidate can memorize "preserve public API shape." They still have to answer: which API is public here? They can memorize "add regression coverage." They still have to answer: what does this test prove, and what does it not prove? They can memorize "keep the diff local." They still have to answer: local relative to which ownership boundary?&lt;/p&gt;

&lt;p&gt;The format remains a scaffold only when it rewards grounded judgment. It becomes a cage when it rewards the appearance of judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Should Actually Be Scored
&lt;/h2&gt;

&lt;p&gt;The final code matters, but it cannot carry the whole assessment. A passing patch can be lucky. A generated patch can be correct for reasons the candidate does not understand. A small diff can preserve the wrong thing. A fluent explanation can hide shallow comprehension.&lt;/p&gt;

&lt;p&gt;The score should focus on durable engineering evidence:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Score this&lt;/th&gt;
&lt;th&gt;Do not overvalue this&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Problem framing&lt;/td&gt;
&lt;td&gt;Did the candidate identify the real task, constraints, and non-goals?&lt;/td&gt;
&lt;td&gt;Confidence or speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preregistration quality&lt;/td&gt;
&lt;td&gt;A committed initial theory whose confidence is matched to what manual inspection could actually establish&lt;/td&gt;
&lt;td&gt;A clean, certain-sounding record that overstates what inspection could reveal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context relevance&lt;/td&gt;
&lt;td&gt;Did they choose the right files, tests, contracts, and invariants?&lt;/td&gt;
&lt;td&gt;Raw token count or file count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Revision quality&lt;/td&gt;
&lt;td&gt;Did they update their theory honestly when task evidence changed?&lt;/td&gt;
&lt;td&gt;Sticking to the first plan for ego reasons&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification&lt;/td&gt;
&lt;td&gt;Did they define meaningful evidence before trusting the patch?&lt;/td&gt;
&lt;td&gt;Running tests only at the end&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architectural restraint&lt;/td&gt;
&lt;td&gt;Did they keep the change local and preserve boundaries?&lt;/td&gt;
&lt;td&gt;Large, impressive rewrites&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI governance&lt;/td&gt;
&lt;td&gt;Did they treat the model as fallible?&lt;/td&gt;
&lt;td&gt;Prompt elegance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Defense&lt;/td&gt;
&lt;td&gt;Could they explain why their record, patch, and tests fit this codebase?&lt;/td&gt;
&lt;td&gt;Fluent PR theater&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One row deserves a warning, because the preregistration frame creates a temptation it does not advertise. A sealed envelope rewards commitment, and commitment is easy to confuse with certainty. It is not the same thing. The candidate who writes "leading hypothesis is the validator, but the error shape could surface at the router, so I would confirm before ruling it out" is showing better judgment than the one who writes "the bug is in &lt;code&gt;validator.ts&lt;/code&gt;, router out of scope" with false confidence — even though the second record is cleaner and more falsifiable. Good engineering in an unfamiliar repository includes knowing what manual inspection cannot yet settle. A record that flags genuine uncertainty where the code cannot resolve it is calibrated, not weak, and the rubric must not penalize it for being less tidy. The thing being scored is whether the candidate's confidence was appropriate to the evidence available — not whether the record reads as sure.&lt;/p&gt;

&lt;p&gt;This fits adjacent evidence from selection research, but that evidence should not be overstated. The U.S. Office of Personnel Management describes work-sample tests as tasks or activities that mirror the tasks employees perform on the job, and it says structured interviews with higher degrees of structure show higher validity, rater reliability, rater agreement, and less adverse impact.&lt;/p&gt;

&lt;p&gt;That motivates the format. It does not validate it. A pre-agent AI work sample would need its own validation: correlation with later job performance, interviewer agreement, candidate experience, adverse impact, false-positive rate, false-negative rate, and cost.&lt;/p&gt;

&lt;p&gt;The most novel signal in the format — revision quality — is also the softest to score. Two interviewers may disagree about whether a candidate made an honest update, rationalized backward, or simply recovered from a bad first map. Honest revision, scrambling, and face-saving can look similar in a live room. That weakness does not kill the format. It means revision quality cannot be scored by vibe. It needs anchors.&lt;/p&gt;

&lt;p&gt;A strong revision looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the candidate expands scope because a failing test or code path proves the original map incomplete;&lt;/li&gt;
&lt;li&gt;the candidate resists model overreach and explains why the original boundary still holds;&lt;/li&gt;
&lt;li&gt;the candidate updates the &lt;code&gt;AGENTS.md&lt;/code&gt; or task plan explicitly rather than silently changing the story;&lt;/li&gt;
&lt;li&gt;the candidate distinguishes "the agent found a file" from "the task required that file."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A weak revision looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the candidate ignores evidence because it threatens the original plan;&lt;/li&gt;
&lt;li&gt;the candidate silently changes the story after the model output;&lt;/li&gt;
&lt;li&gt;the candidate cannot explain why the plan changed;&lt;/li&gt;
&lt;li&gt;the candidate says "the AI found it" without understanding the dependency;&lt;/li&gt;
&lt;li&gt;the candidate accepts a patch that violates their own registered constraint.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scoring instrument should make those anchors explicit. Multiple reviewers would help. A written delta between the preregistration, the patch, the tests, and the defense would help more.&lt;/p&gt;

&lt;p&gt;This is a proposal grounded in adjacent evidence, not a proven instrument.&lt;/p&gt;

&lt;p&gt;That sentence has to stay. Without it, the essay becomes the thing it criticizes: a confident performance standing in for evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Concrete Version
&lt;/h2&gt;

&lt;p&gt;A usable interview might look like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are given a small TypeScript service with a failing edge case in request validation. You may inspect the repository manually. Before using the provided AI assistant, write a short pre-agent record: task summary, context map, agent instructions, verification plan, risk note, and definition of done. After submitting that record, you may use the assistant inside the provided environment. Your final submission should include the pre-agent record, the patch, tests run, and a defense of how your theory changed or held.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notice what is missing from the scored prompt: no sacred five-file limit, no sacred ten-turn budget, no raw token target. The environment may still impose practical limits. Those limits are scaffolding. They are not the skill.&lt;/p&gt;

&lt;p&gt;The candidate should be judged on questions like: Did their initial context map point toward the files that mattered? Did they notice when their first theory was wrong? Did they preserve the correct boundary? Did they define a meaningful test? Did they keep the change smaller than the agent wanted? Did they catch model output that violated the repository's conventions? Could they explain the remaining risk?&lt;/p&gt;

&lt;p&gt;A strong candidate may use more context than expected because the first map was incomplete. That should not count against them if the expansion was justified. A weak candidate may use little context because they prematurely narrowed the task and missed the real owner of the behavior.&lt;/p&gt;

&lt;p&gt;The score is not austerity. The score is judgment under commitment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fidelity, Integrity, and Cost
&lt;/h2&gt;

&lt;p&gt;The practical objections are serious.&lt;/p&gt;

&lt;p&gt;A pre-agent AI work sample is more expensive than a standard coding screen. It requires a curated repository, a realistic task, a standardized environment, logging, a rubric, and trained reviewers. It is not cheap at scale.&lt;/p&gt;

&lt;p&gt;There is also a cheating problem. A remote interview can be compromised by a second machine, an off-screen model, private prompt libraries, or outside help. A locked-down environment improves integrity but reduces realism. Let candidates use their real workflow and the interview becomes more faithful to the job, but less comparable. Force every candidate into the same sandbox and comparability improves, but the workflow becomes artificial. This is not a detail. It is the design problem.&lt;/p&gt;

&lt;p&gt;The practical answer is not to use this format as a mass filter. Use cheaper screens earlier. Use the pre-agent work sample later, when the candidate is far enough along that the stronger signal is worth the cost. Keep the task bounded. Publish expectations. Provide the same environment. Do not score model-specific tricks. Pay candidates if the work becomes long enough to resemble real labor.&lt;/p&gt;

&lt;p&gt;There is a quieter objection, and it is the one a skeptical hiring manager will actually raise. The comparison class for this format is not only the AI ban and the AI free-for-all. The real incumbent is the structured take-home followed by a code-review debrief: give the candidate a repository and a task, let them work however they like, then sit down and ask them to defend their choices. That format already captures much of what this essay prizes — scoping, restraint, verification, and the ability to explain a diff. It is cheaper. It is partly validated. And it asks for judgment in something close to the candidate's real workflow.&lt;/p&gt;

&lt;p&gt;What the take-home debrief lacks is the sealed envelope. Because the debrief happens after the work, the candidate can narrate a clean story backward, explaining the choices they appear to have made rather than the ones they actually committed to. The pre-agent record is the one thing the debrief cannot reproduce: a timestamped commitment made before the evidence arrived. So the entire incremental case for this format reduces to a single empirical question. How much is that pre-commitment worth, net of the cost of forcing an artificial sequence onto work that is naturally interleaved? This essay argues the envelope is worth a great deal, because backward rationalization is exactly the failure the debrief cannot detect. But that is an argument, not a measurement, and it is the first thing a pilot should test: run the pre-agent work sample against a take-home debrief on the same candidates, and see whether the commitment adds predictive signal the debrief misses. If it does not, the cheaper instrument wins.&lt;/p&gt;

&lt;p&gt;The format also has a structural blind spot worth naming. It mandates AI use after the pre-agent record, which means it cannot assess one real judgment its own evidence implies matters: deciding that a task is faster or safer done by hand. METR's result — experienced developers slowed down by AI in mature repositories — is partly a finding about engineers who should have declined the tool and did not. An interview that requires the tool forecloses that call. A mature version might let candidates justify not invoking the agent for part of the task, and score that too.&lt;/p&gt;

&lt;p&gt;Fairness needs a harder treatment than "standardize the tools." AI-workflow familiarity is unevenly distributed. Some candidates have daily access to frontier tools. Others do not. Some have worked in companies that encourage agentic coding. Others have been forbidden to use it. Some are fluent in the current English-heavy style of machine instruction. Others may have the same engineering judgment and less practice encoding it into agent-facing documents.&lt;/p&gt;

&lt;p&gt;Canva's own report cuts both ways here. It says candidates with limited AI experience struggled not because they could not code, but because they lacked the judgment to guide AI effectively and identify when suggestions were suboptimal. That supports this essay's thesis directly. It also sharpens the fairness problem: low AI experience may track unequal access, unequal workplace norms, and unequal opportunity to practice.&lt;/p&gt;

&lt;p&gt;So the fairer version of the interview scores durable behaviors: context selection, verification, architectural restraint, revision, and defense. For junior candidates, it should not over-index on sophisticated agent orchestration. For senior candidates, it can demand more explicit scoping and review judgment. For staff-level candidates, the relevant task may be designing the human-AI workflow itself. The format has to be calibrated by level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: The Scaffold and the Cage
&lt;/h2&gt;

&lt;p&gt;A bad hiring process becomes a cage. It traps candidates inside a ritual that serves the institution more than the work. LeetCode became that for many engineers: a narrow game standing in for a broad craft. It rewarded speed, rehearsal, and pattern recognition while missing slower forms of engineering judgment.&lt;/p&gt;

&lt;p&gt;An AI interview can become the same kind of cage. It can reward prompt theater, tool privilege, memorized templates, synthetic confidence, and the polished performance of judgment without judgment underneath. Prompt theater asks whether a candidate can make the model say useful things. The better question is whether they can make the work safe enough for a model to touch.&lt;/p&gt;

&lt;p&gt;The better version is a scaffold. A scaffold does not do the work. It makes the work possible. It gives shape, support, access, and constraint. The pre-agent record is that scaffold. It asks the candidate to commit before automation begins: the relevant context, the non-goals, the tests, the standards, the invariants, the risks, the stopping condition.&lt;/p&gt;

&lt;p&gt;That commitment will sometimes be wrong. Good. The point is not to worship the first map. The point is to see how the candidate handles the distance between the map and the territory.&lt;/p&gt;

&lt;p&gt;The future coding interview should not pretend AI does not exist. It should also not surrender the assessment to the machine. It should use the machine to reveal the human decision that matters most: how the candidate turns ambiguity into bounded, reviewable, verifiable work.&lt;/p&gt;

&lt;p&gt;Before the first token, the candidate has not written the answer.&lt;/p&gt;

&lt;p&gt;They have sealed the envelope.&lt;/p&gt;

&lt;p&gt;Then the interview begins.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>interview</category>
      <category>career</category>
      <category>agents</category>
    </item>
    <item>
      <title>The Scaffold and the Cage: Vibe Coding, Enabled Coding, and the Fight for Judgment</title>
      <dc:creator>Connor Hickey</dc:creator>
      <pubDate>Sat, 30 May 2026 15:36:45 +0000</pubDate>
      <link>https://dev.to/conalh/the-scaffold-and-the-cage-vibe-coding-enabled-coding-and-the-fight-for-judgment-4ljd</link>
      <guid>https://dev.to/conalh/the-scaffold-and-the-cage-vibe-coding-enabled-coding-and-the-fight-for-judgment-4ljd</guid>
      <description>&lt;p&gt;The phrase &lt;em&gt;vibe coding&lt;/em&gt; has become a convenient way to describe a strange new relationship between humans, machines, and software. At its simplest, vibe coding means telling an AI system what you want and letting it produce the code. The human provides intent, mood, direction, and correction. The machine produces implementation. The result may be a game prototype, a tool, a website, a mod, a script, or an entire application. The person may not understand every line. They may not even pretend to. They describe the desired artifact, test whether it feels right, and keep prompting until the thing seems to work.&lt;/p&gt;

&lt;p&gt;The term itself is recent. Andrej Karpathy coined &lt;em&gt;vibe coding&lt;/em&gt; in a widely shared post in February 2025, describing a way of working in which you trust the model, stop reading the diffs, and "forget that the code even exists" (Karpathy, 2025); within the year the phrase had spread far enough to be named Collins Dictionary's Word of the Year (Collins, 2025). Karpathy was candid about what the mode gives up — in its original sense, vibe coding meant precisely &lt;em&gt;not&lt;/em&gt; reviewing the output.&lt;/p&gt;

&lt;p&gt;That description is useful, but it is also too blunt. It collapses many different practices into one label. It treats the person who blindly accepts generated code the same as the person who uses an agent to learn, debug, test, and gradually understand a system they could not have built alone. It also risks turning "vibe coder" into a social category — almost an insult — rather than a description of a method. The term can imply that someone is merely pretending to code, that they are outsourcing the real work while borrowing the identity of a programmer.&lt;/p&gt;

&lt;p&gt;I am not sure that label fits me. At least, not cleanly.&lt;/p&gt;

&lt;p&gt;I do not experience agentic coding as pretending to be a programmer. I experience it as finally being able to stay inside the programming loop long enough to become one.&lt;/p&gt;

&lt;p&gt;That, at least, is the story I want to tell. A good part of this essay is an attempt to find out whether the story is true, or whether it is the most comfortable thing I could believe about a tool I have come to depend on.&lt;/p&gt;

&lt;p&gt;The distinction matters. For me, and likely for many others, AI-assisted or agentic coding is not simply a shortcut around skill. It is a scaffold that makes skill reachable. It lowers the activation barrier. It helps manage the blank page, the syntax wall, the debugging spiral, the architecture fog, and the working-memory demands that make programming difficult to sustain. This is especially significant for people with ADHD or other executive-function challenges. Coding is not only a technical activity; it is also a cognitive endurance task. It requires attention, sequencing, planning, error tolerance, working memory, and the ability to return to a problem after repeated failure. Agentic coding changes the shape of that task.&lt;/p&gt;

&lt;p&gt;The more interesting question, then, is not whether AI wrote the code. That question is already becoming less useful. The better question is: who owns the intent, the judgment, and the resulting system?&lt;/p&gt;

&lt;p&gt;Coding is shifting from line production to system stewardship. In that shift, the meaningful boundary is no longer between human-written and AI-written code. The boundary lies between artifacts the human can own and artifacts the human merely accepts.&lt;/p&gt;

&lt;p&gt;This essay began as a defense of the purple space between vibe coding and genuine ownership: the space where an agent writes more of the code than the human could comfortably write alone, but the human is still learning, testing, questioning, and moving toward understanding the system. I still think that space exists. But I no longer think it is a natural developmental stage. Purple is not a conveyor belt from dependence to competence. It is a fork. One path uses the agent as a scaffold and deliberately preserves the difficulty required to build judgment. The other uses the agent as a cage, removing so much friction that the user gains fluency without ownership. The difference is not whether the machine writes the code. The difference is whether the human refuses to surrender evaluation.&lt;/p&gt;

&lt;p&gt;Throughout, I will use three colors as shorthand. Red is vibe coding in the narrow sense: the human expresses desire and accepts machine output with minimal understanding. Blue is enabled coding: the human leans on agents heavily but keeps conceptual ownership, verification responsibility, and the ability to reason about the system. Purple is the contested space between them — and the rest of this essay is an argument about which way it points.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe Coding as Red: Desire Without Ownership
&lt;/h2&gt;

&lt;p&gt;Vibe coding begins with desire. The human says, in natural language, what they want the software to do. The prompt may be specific or vague. It may describe an interface, a mechanic, a workflow, a tool, or a feeling. "Make me a basic platformer controller." "Build a save system." "Create an inventory UI." "Fix this bug." "Make it feel smoother." "Add juice." "Make the enemy smarter." "Make this look like a real app."&lt;/p&gt;

&lt;p&gt;The agent responds with code. The human runs it. Something breaks. The human pastes the error back. The agent patches. The human tries again. Eventually the thing works, or appears to work. The loop continues.&lt;/p&gt;

&lt;p&gt;There is nothing inherently wrong with this process. In low-risk contexts, it can be playful, productive, and creatively liberating. A solo developer can prototype faster. A non-programmer can test an idea. A designer can make an interactive sketch. A student can get unstuck. A person who would normally never touch code can suddenly make a working artifact.&lt;/p&gt;

&lt;p&gt;The risk appears when the artifact becomes detached from human understanding. In the red zone, the user accepts code because it appears to work, not because they understand why it works. The program becomes opaque. The user's standard of correctness is surface behavior: the button clicks, the scene loads, the function returns something plausible, the error disappears. The agent becomes the only participant with any apparent model of the implementation, and even that model may be unstable or hallucinated.&lt;/p&gt;

&lt;p&gt;This matters because software is not only output. Software has consequences. It stores data, moves money, exposes private information, controls experiences, shapes user behavior, and breaks in ways that can be subtle. Even in small projects, code accumulates. A prototype becomes a tool. A tool becomes infrastructure. A quick fix becomes an architectural dependency. The more a system grows, the more dangerous it becomes for the human to remain outside the logic of the thing they are building.&lt;/p&gt;

&lt;p&gt;In red, the human says: "It works, so I accept it."&lt;/p&gt;

&lt;p&gt;That may be enough for a disposable prototype. It is not enough for ownership.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enabled Coding as Blue: Acceleration With Ownership
&lt;/h2&gt;

&lt;p&gt;Enabled coding looks similar from the outside. The human still uses an agent. The agent may still write most of the code. The human may still describe changes in natural language. The workflow may still include copy-pasting errors, asking for patches, and iterating quickly.&lt;/p&gt;

&lt;p&gt;The difference is not the amount of AI involvement. The difference is the human's relationship to the artifact.&lt;/p&gt;

&lt;p&gt;Enabled coding means the agent reduces the execution burden while the human retains responsibility for direction, comprehension, verification, and maintenance. The human does not need to type every line to own the system. They do need to understand the relevant behavior well enough to make decisions about it.&lt;/p&gt;

&lt;p&gt;In blue, the human asks different questions.&lt;/p&gt;

&lt;p&gt;Why did you choose this pattern?&lt;br&gt;
What files did you change?&lt;br&gt;
What assumption does this function make?&lt;br&gt;
What happens if the input is null?&lt;br&gt;
What breaks if there are two players instead of one?&lt;br&gt;
Is this state stored globally?&lt;br&gt;
Can this be simplified?&lt;br&gt;
Can we add a test?&lt;br&gt;
Can you explain this like I am going to maintain it next month?&lt;/p&gt;

&lt;p&gt;These questions change the role of the agent. The agent is no longer just a code vending machine. It becomes a pair programmer, tutor, debugger, explainer, and implementation accelerator. It can still be wrong, but its wrongness becomes part of a review process rather than a hidden liability.&lt;/p&gt;

&lt;p&gt;Enabled coding does not require total mastery. That would be an unrealistic standard. No programmer understands every layer of the stack they use. Professional developers rely on compilers, engines, frameworks, libraries, documentation, autocomplete, forums, package managers, and abstractions they do not fully control. The question is not whether the human has absolute knowledge. The question is whether the human has enough situated understanding to responsibly guide, test, and maintain the system.&lt;/p&gt;

&lt;p&gt;This is not only how I would like experienced developers to work; it appears to be how they actually do. When researchers observed and surveyed professional developers using AI agents through 2025, they found that the experienced ones do not vibe code at all. They plan the task, supervise the agent closely, and review its output rigorously, holding onto authority over design and implementation out of a refusal to compromise on software quality (Huang et al., 2025). Expertise, in agentic coding, expresses itself not as faster acceptance but as more disciplined control.&lt;/p&gt;

&lt;p&gt;This is where the traditional gatekeeping around programming starts to break down. If programming is defined narrowly as manually producing lines of syntax, then AI-generated code seems to threaten the identity of the programmer. But if programming is understood as designing, reasoning about, testing, maintaining, and evolving computational systems, then agentic tools do not erase programming. They shift its center of gravity.&lt;/p&gt;

&lt;p&gt;The coder becomes less like a typist and more like a system steward.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Purple Zone: Scaffolded Ownership
&lt;/h2&gt;

&lt;p&gt;Between red and blue is purple.&lt;/p&gt;

&lt;p&gt;Purple is the state where the agent writes more code than the human could comfortably write alone, but the human is not merely accepting magic. The human is directing, testing, questioning, and learning. They may not understand the implementation immediately, but they do not treat incomprehension as the final state. They use the agent to move toward understanding.&lt;/p&gt;

&lt;p&gt;This is the zone where many new programmers probably live now. It is also where many solo builders, indie developers, modders, designers, domain experts, and neurodivergent creators may find themselves. They are not traditional programmers in the old sense, but they are not non-programmers either. They are becoming capable through collaboration with a machine.&lt;/p&gt;

&lt;p&gt;Purple is easy to dismiss because it looks messy. The person may ask naive questions. They may rely heavily on the agent. They may struggle to explain the code at first. They may use imprecise language. They may build something that works before they fully understand why it works. To an experienced programmer, this can look like incompetence wearing a productivity mask.&lt;/p&gt;

&lt;p&gt;But that judgment, I want to argue, misses the developmental nature of the process. A beginner using an agent is not necessarily bypassing learning. They may be entering learning from the other side. Instead of spending weeks blocked by syntax, setup, and error messages, they can start with a functioning artifact and then interrogate it. They can ask the agent to explain the architecture. They can trace the data flow. They can request comments. They can break the code and repair it. They can compare implementations. They can ask why one approach is better than another. They can move from outcome to mechanism.&lt;/p&gt;

&lt;p&gt;That is not fake programming. It is scaffolded programming — and the word is not loose. In developmental psychology, &lt;em&gt;scaffolding&lt;/em&gt; (Wood, Bruner, &amp;amp; Ross, 1976) names the temporary support a more capable partner supplies so that a learner can accomplish something that would be "beyond his unassisted efforts," within what Vygotsky (1978) called the zone of proximal development: the distance between what a learner can do alone and what they can do with help. But the concept carries a condition that is easy to forget. The defining feature of a scaffold, in that literature, is that it &lt;em&gt;fades&lt;/em&gt; — it is deliberately withdrawn as the learner's competence grows. A scaffold that is never removed is not a scaffold. It is a permanent prop, and the building never learns to stand.&lt;/p&gt;

&lt;p&gt;The distinction depends on whether the scaffold becomes a bridge or a cage. If the user remains dependent on the agent for every change, every bug, and every explanation, purple collapses back into red. The artifact remains opaque. The user can produce software but cannot own it. But if the agent helps the user build a mental model, purple moves toward blue. The user becomes more capable over time.&lt;/p&gt;

&lt;p&gt;I have now used the word &lt;em&gt;if&lt;/em&gt; twice in a single paragraph, and I want to flag that, because everything optimistic in this essay is hiding inside those conditionals. I have asserted that the scaffold &lt;em&gt;can&lt;/em&gt; become a bridge. I have not yet given any reason to believe it &lt;em&gt;tends&lt;/em&gt; to. That is the work the rest of the essay has to do, and it is harder than I would like it to be.&lt;/p&gt;

&lt;h2&gt;
  
  
  ADHD, Executive Function, and the Programming Loop
&lt;/h2&gt;

&lt;p&gt;The ADHD angle is not incidental. It may be central.&lt;/p&gt;

&lt;p&gt;Programming is often described as a logic skill, but in practice it is also an executive-function gauntlet. A programming task requires the developer to hold multiple layers of information in mind: the goal, the current bug, the relevant files, the syntax, the architecture, the runtime behavior, the error messages, the edge cases, and the next step. The developer has to break large tasks into smaller tasks. They have to tolerate delayed gratification. They have to recover from repeated failure. They have to remember what they were doing before the last error interrupted them.&lt;/p&gt;

&lt;p&gt;For someone with ADHD, these demands can become the real barrier. The problem is not always lack of intelligence or lack of interest. It can be task initiation, sequencing, working memory, context switching, emotional regulation, and persistence through friction. Programming creates friction constantly. One missing semicolon, one broken dependency, one unclear error, one setup issue, one file in the wrong folder — any of these can derail momentum.&lt;/p&gt;

&lt;p&gt;Agentic coding can function as an external executive system. It can hold context. It can summarize the next step. It can break a feature into smaller chunks. It can explain an error without the shame spiral of feeling stupid. It can offer a concrete first move when the blank page is too abstract. It can convert "I want this mechanic" into "start by creating these files and these functions." It can keep the loop alive.&lt;/p&gt;

&lt;p&gt;For me, that matters more than I know how to say in an essay that is trying to stay analytical. The agent does not simply make coding faster. It makes coding &lt;em&gt;reachable&lt;/em&gt;. It lets me remain in contact with the work long enough to build understanding. Instead of falling out of the loop every time the task becomes too abstract or too fragmented, I can use the agent as a stabilizer. It gives me a way back in.&lt;/p&gt;

&lt;p&gt;This reframes the ethics of AI-assisted coding. The public conversation often treats AI coding as a question of laziness, authenticity, or cheating. Those frames are too narrow. For some people, agentic coding is closer to access technology: an external support for task initiation, sequencing, working memory, and recovery from failure. It does not remove the need for judgment, effort, or learning. It changes the conditions under which those things become possible.&lt;/p&gt;

&lt;p&gt;That is the strongest version of my case, and I believe it. Which is exactly why I have to attack it now, because I notice that I have arranged the argument so that no one is allowed to question it. I have wrapped the claim in the language of disability and accessibility, and that language has a way of ending conversations. To doubt an accessibility tool feels like doubting the person who needs it. But I am not interested in an argument that wins by becoming unfalsifiable. So I have to ask the question the accessibility framing is designed to make me feel bad for asking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Counterclaim: Why the Scaffold Might Be the Cage
&lt;/h2&gt;

&lt;p&gt;Here is the objection, and I am going to give it every advantage.&lt;/p&gt;

&lt;p&gt;The danger of agentic coding is not that it removes labor. The danger is that it may remove the specific forms of labor through which judgment is formed.&lt;/p&gt;

&lt;p&gt;There is a comforting word for the support I described a moment ago: a prosthetic. An external stand-in for a capacity I struggle to supply on my own. But if I let myself reach for that word, I inherit its darker half. A prosthetic is not a teacher. It is a substitute. We do not expect a prosthetic limb to grow a real one underneath it. A wheelchair is not a stage in learning to walk. So the moment I call the agent a prosthetic, I may be smuggling in good news that the metaphor does not actually contain. The honest version of the prosthetic framing is not hopeful at all. It is the picture of a permanent substitution for a capacity that will never develop — &lt;em&gt;because&lt;/em&gt; the substitution removes the very stimulus that would have developed it.&lt;/p&gt;

&lt;p&gt;Look again at what I praised, and notice what it costs.&lt;/p&gt;

&lt;p&gt;Every executive function the agent supplies is one I do not exercise. It holds the context, so my working memory never has to stretch to hold it. It sequences the next step, so I never build the muscle of decomposition. It absorbs the failure spiral, so I am never the one who sits in the wreckage of a broken build until I understand why it broke. The agent does not strengthen these capacities by performing them for me, any more than a forklift strengthens my back. It performs them &lt;em&gt;instead&lt;/em&gt; of me. And a function that is always performed for you is a function that quietly disappears.&lt;/p&gt;

&lt;p&gt;The learning sciences have an uncomfortable name for the thing I have been treating as pure cost: &lt;em&gt;desirable difficulty&lt;/em&gt; (Bjork, 1994; Bjork &amp;amp; Bjork, 2011). The finding, roughly, is that conditions which make a task feel harder and slower in the moment often produce more durable learning, and conditions which make a task feel fluent and easy often produce the &lt;em&gt;illusion&lt;/em&gt; of learning without the substance. Underneath it lies a distinction Bjork draws between &lt;em&gt;performance&lt;/em&gt; — how well you can execute right now, with support in place — and &lt;em&gt;learning&lt;/em&gt; — the durable capability that remains once the support is gone. The two routinely move in opposite directions, which is exactly why fluency in the moment is such an unreliable signal of competence acquired. Struggle is not a bug in the process of becoming competent. In many cases struggle is the process. There is a related and equally inconvenient result, the generation effect: across decades of experiments, people remember and understand material they generate themselves far better than the same material merely shown to them (Slamecka &amp;amp; Graf, 1978). Reading a correct solution feels like understanding. It is not. It is recognition wearing understanding's clothes.&lt;/p&gt;

&lt;p&gt;Now consider what an agentic coding tool actually is, mechanically. It is a fluency-maximizing machine. Its entire value proposition is the removal of difficulty. That is the product. That is what I am paying for, in money and in dependence. So if the difficulty was where the learning lived, then the tool is not protecting my learning. It is optimizing it away, and presenting me with the pleasant sensation of competence as the receipt. This gap between sensation and fact is measurable. In a 2025 randomized controlled trial, experienced open-source developers predicted that AI tools would speed them up, and reported afterward that the tools &lt;em&gt;had&lt;/em&gt; sped them up — yet they actually completed their tasks roughly nineteen percent &lt;em&gt;slower&lt;/em&gt; with the tools than without them (Becker et al., 2025). The feeling of acceleration and the fact of it had come apart, and the people inside the experiment could not detect the difference. If fluency can hide a slowdown that large, it can certainly hide the smaller, slower divergence between understanding a system and merely operating one.&lt;/p&gt;

&lt;p&gt;There is an older version of this worry, from outside software, named the &lt;em&gt;ironies of automation&lt;/em&gt; (Bainbridge, 1983). Bainbridge's observation was that when you automate the routine parts of a task and leave the human responsible for the rest, you erode the operator's skill at exactly the moments automation fails and a human must take over — so the more reliable the automation, the less prepared the human it ultimately depends on. Aviation has tested this directly, and the result is precise about &lt;em&gt;which&lt;/em&gt; skills go. When researchers had airline pilots fly routine and non-routine scenarios in a Boeing 747 simulator at varying levels of automation, they found that the manual control skills — the stick-and-rudder motor skills — held up reasonably well, but the &lt;em&gt;cognitive&lt;/em&gt; skills of manual flight, the knowing-what-to-attend-to and deciding-what-to-do, were the ones that decayed under reliance on automation (Casner, Geven, Recker, &amp;amp; Schooler, 2014; see also Ebbatson, Harris, Huddlestone, &amp;amp; Sears, 2010). The hands remembered the airplane. The judgment did not. Generated code threatens to fail along the same fault line. The agent handles the ordinary. The human is summoned only for the catastrophe — the subtle data corruption, the security hole, the architectural dead end that no further prompting can patch. The mechanical skill of producing code may well survive; it is the judgment that quietly hollows out, invisibly, behind the comfortable hum of things mostly working.&lt;/p&gt;

&lt;p&gt;This is the point where my essay is in real trouble, and I want to name precisely how, because it is worse than a missing caveat.&lt;/p&gt;

&lt;p&gt;Rehabilitation science draws exactly the distinction that exposes the problem. Assistive and rehabilitative technologies are not one category but two, with opposite definitions of success (Cook &amp;amp; Polgar, 2015). Some is &lt;em&gt;compensatory&lt;/em&gt;: a wheelchair, glasses, a hearing aid. You will use it permanently, and that is completely fine — independence was never about legs or unaided eyes. Permanent dependence on a wheelchair is not a failure of the wheelchair. It is the wheelchair working. But some assistive technology is &lt;em&gt;rehabilitative&lt;/em&gt;: a course of physical therapy, training wheels, a scaffold around a building under construction. Its whole purpose is to be outgrown. Permanent dependence on a rehab program is not a success. It is the rehab failing.&lt;/p&gt;

&lt;p&gt;Here is the bind. My entire gradient — red to purple to blue, &lt;em&gt;movement&lt;/em&gt;, &lt;em&gt;becoming&lt;/em&gt;, "a steward of the system" — is a rehabilitative claim. I am promising that you outgrow the scaffold. The word "scaffold" gave it away. So I do not get to retreat to the comfortable wheelchair defense when challenged — "it's a prosthetic, dependence is fine" — because the wheelchair defense abandons my thesis. I committed to the harder claim: that the tool is a bridge you cross and leave behind. And the harder claim is exactly the one the entire deskilling literature suggests is &lt;em&gt;least&lt;/em&gt; likely to come true, because the easier and more fluent a scaffold makes the work, the less reason and less stimulus there is to ever step off it.&lt;/p&gt;

&lt;p&gt;And the ADHD framing, which I leaned on as my strongest card, may be my weakest. Because if executive function is genuinely the barrier, then a tool that supplies executive function on demand removes the only conditions under which executive function gets practiced. The story I told — "it keeps me in the loop long enough to learn" — assumes the time in the loop is spent learning. But it might be spent &lt;em&gt;being carried.&lt;/em&gt; The frictionless loop is not obviously a classroom. It may just be a more comfortable room in the same cage.&lt;/p&gt;

&lt;p&gt;I am not going to pretend this objection is weak. It is the true center of the question, and most of the optimistic writing about AI and coding, including my own first draft of this essay, simply walks around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Survives, and Under What Condition
&lt;/h2&gt;

&lt;p&gt;I do not think the objection is fatal. But it changes what I am allowed to claim, and it forces me to give up ground I would rather have kept.&lt;/p&gt;

&lt;p&gt;The counterclaim forces a narrower definition of enabled coding. I can no longer define it as AI-assisted production that happens to make me feel more capable. Nor can I define it as any process where the agent helps me stay in motion. Motion is not growth. The only defensible definition left is this: enabled coding is agentic coding in which production may be delegated, but evaluation is not.&lt;/p&gt;

&lt;p&gt;First, the concession, and it is a real one. For the executive-function layer specifically — initiation, sequencing, the working-memory juggling, the chunking of a feature into files — I will grant the compensatory reading and stop pretending it is rehabilitative. I do not need to internalize the ability to break a task into the right four files, and I probably will not, and I have decided that is acceptable, the way a writer does not need to internalize manuscript formatting to be a writer. Ownership was never made of those parts. So if those particular muscles atrophy, they cost me nothing I needed to keep. The skeptic is right about them, and being right about them turns out not to matter.&lt;/p&gt;

&lt;p&gt;The real question is not about executive function at all. It is about &lt;em&gt;judgment&lt;/em&gt;. And here I have to relocate the entire argument.&lt;/p&gt;

&lt;p&gt;Ownership, when I am honest about what it consists of, is not the ability to type or to sequence. It is the ability to evaluate. To look at a working solution and know whether it is also a correct one. To recognize the specific texture of an agent that has stopped reasoning and started guessing. To reject a patch that passes every visible test but quietly corrupts the architecture. To come back next week, reopen the project, find the relevant part, and make a controlled change without starting from zero. That faculty — judgment — is what blue is actually made of. Everything else is logistics.&lt;/p&gt;

&lt;p&gt;So the only question worth arguing is narrow and brutal: is the agent compensatory or rehabilitative &lt;em&gt;with respect to judgment&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;And here the objection bites hardest, because I cannot wave it off. Judgment is built by consequence. It is the residue of having been wrong and having had to find out why. If the agent absorbs the failure spiral — and absorbing the failure spiral is exactly what I praised it for in the ADHD section — then it may absorb the error-and-consequence loop that is, as far as anyone knows, the only way judgment forms. I have to admit the painful symmetry: the feature that makes the tool an accessibility device is the same feature that threatens the one capacity I cannot afford to lose.&lt;/p&gt;

&lt;p&gt;This is why I can no longer claim that purple &lt;em&gt;tends&lt;/em&gt; toward blue. On the frictionless path — the path the tool is engineered to make easy — purple does not tend toward blue. It tends toward a deeper red. Judgment atrophies precisely as the skeptic predicts, and the loss is masked by the pleasant fluency of a system that keeps mostly working.&lt;/p&gt;

&lt;p&gt;What survives is smaller, and conditional, and I think true: judgment can still form, &lt;em&gt;if&lt;/em&gt; the human refuses to offload evaluation even while offloading production. Those two things are separable, and the separation is the whole game. I can let the agent write every line and still insist on being the one who decides whether the line deserves to exist. But that insistence is not natural. It runs directly against the grain of a tool whose entire design is to make insistence feel unnecessary, even rude — the friend who finishes your sentences so smoothly you forget you had one.&lt;/p&gt;

&lt;p&gt;Which means the practices of ownership are not safety advice bolted onto an optimistic essay. They are the essay's actual engine, and I had them filed under the wrong heading.&lt;/p&gt;

&lt;p&gt;Read the diff.&lt;br&gt;
Ask for the explanation, then check it against the behavior instead of trusting it.&lt;br&gt;
Run the code yourself.&lt;br&gt;
Write the test before you believe the fix.&lt;br&gt;
Keep changes small enough to understand.&lt;br&gt;
Ask what breaks when the input is null, when there are two players, when the network is gone.&lt;br&gt;
Refactor deliberately.&lt;br&gt;
Return to old code and find out, honestly, whether you still understand it — and treat the answer as data about yourself, not the code.&lt;/p&gt;

&lt;p&gt;Reframe these correctly and they are not hygiene. They are &lt;em&gt;deliberately reintroduced difficulty.&lt;/em&gt; Reading the diff is refusing to offload comprehension. Writing the test is refusing to offload the definition of correct. Asking what breaks when the input is null is manufacturing, by hand, the edge-case confrontation that the happy path would otherwise have spared me. Each practice is a conscious reinjection of the friction the agent removed — and crucially, friction placed back exactly where the learning lives, rather than scattered at random across syntax and setup, where it never belonged. That is the form the optimism has to take after the objection. Not "the scaffold becomes a bridge." Rather: the scaffold becomes a bridge &lt;em&gt;only for the person who keeps rebuilding, by hand, the difficulty the scaffold was selling them relief from.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Boundary: When Red Becomes Blue
&lt;/h2&gt;

&lt;p&gt;With that correction in place, the old account of the boundary still stands, but it reads differently now. The movement from vibe coding to enabled coding is not a single moment, and it is not a current that carries you. It is a set of practices performed against the grain.&lt;/p&gt;

&lt;p&gt;A red prompt says: "Fix this."&lt;/p&gt;

&lt;p&gt;A purple prompt says: "Here is the bug, here is what I expected, here is what happened, and here are the files that might be involved. Help me inspect the cause."&lt;/p&gt;

&lt;p&gt;A blue prompt says: "The problem seems to be that this state is updated before the event listener finishes. Propose a minimal patch, explain the tradeoff, and include a regression test."&lt;/p&gt;

&lt;p&gt;The difference is not vocabulary. The difference is whether a mental model exists behind the words, and a mental model is the one thing the tool cannot hand you, because building it is the difficulty the tool removes.&lt;/p&gt;

&lt;p&gt;The final test is delayed ownership. Can the person come back next week, reopen the project, understand the relevant parts, and continue? Can they debug without starting from zero? Can they explain the system well enough to improve it? If yes, the code is no longer merely something they accepted. It is something they are beginning to own.&lt;/p&gt;

&lt;p&gt;But notice what that test really measures. It measures whether the friction got put back. The person who can return and continue is not the person the tool produced by default. It is the person who insisted on understanding things the tool was willing to understand for them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Risks: When the Agent Owns the System
&lt;/h2&gt;

&lt;p&gt;Everything in the standard risk inventory is real. AI-generated code can be insecure, inefficient, brittle, overcomplicated, or subtly wrong. It can introduce dependencies without explaining why. It can solve the local bug while damaging the larger design. It can pass the visible test path and fail under edge cases. It can invent APIs. It can confidently explain false reasoning. It can encourage the user to move faster than their understanding.&lt;/p&gt;

&lt;p&gt;But after the counterclaim, I no longer think the central danger is in the code. The central danger is in the user. The risk is not primarily that the agent produces a bad artifact. It is that the agent produces a person who feels like an owner and is not one — a person whose sense of competence is calibrated to fluency rather than understanding, and who therefore cannot tell the difference between a system they command and a system that merely behaves, until the day it stops behaving.&lt;/p&gt;

&lt;p&gt;A person can ship code they do not understand. They can collect users, data, payments, or trust with a system they cannot maintain. They can build a game or an app that becomes impossible to extend because every feature was patched into existence through disconnected prompts. They can become dependent on the agent as a repair oracle, unable to distinguish a good fix from a bad one — which is just another way of saying their judgment never formed, masked by years of things mostly working.&lt;/p&gt;

&lt;p&gt;The practices above do not eliminate this. They reintroduce friction in the right places, slowing the user down just enough to keep a responsible human in the loop. That is the most they can do, and it only works if the user actually does them, against the tool's every incentive to skip them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Against Traditional Gatekeeping
&lt;/h2&gt;

&lt;p&gt;None of this rehabilitates the old gatekeeping, and I want to be careful not to let a sobering objection curdle into nostalgia.&lt;/p&gt;

&lt;p&gt;The old image of programming centers on manual authorship: a programmer is someone who knows the language, writes the lines, fixes the errors, and builds the system through direct control. In that model, AI assistance looks like contamination. But programming was never only manual authorship. It has always involved layers of abstraction — engines no one fully understands, libraries no one wrote, operating systems and compilers whose output is rarely inspected. A developer using a game engine or a web framework is already delegating enormous amounts of behavior to code they did not author. The question has always been how well the developer can reason &lt;em&gt;within&lt;/em&gt; those abstractions.&lt;/p&gt;

&lt;p&gt;Agentic coding adds a new abstraction layer: natural language as an interface to implementation. That layer is unstable and risky, but it is still an abstraction layer, and rejecting it outright because it changes the shape of labor would repeat an old mistake — confusing the tools of programming with its essence.&lt;/p&gt;

&lt;p&gt;The essence is not typing. The essence is judgment under stewardship: forming an intention, translating it into a computational system, evaluating whether the system behaves correctly, and maintaining it as requirements change. AI can participate in all of that. It can do most of the line-level production. The human's role does not disappear — &lt;em&gt;unless the human surrenders the evaluation.&lt;/em&gt; That, and not the volume of AI involvement, is the line between an enabled coder and a vibe coder. And after everything above, I have to add that the surrender is not a single choice. It is the default outcome of a frictionless tool, and resisting it is a daily, unnatural act.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: The Bridge You Have to Carry Across
&lt;/h2&gt;

&lt;p&gt;Vibe coding asks whether the machine can make software from my desire. The question I began with was whether the machine can help me become the kind of person who can own the software I desired.&lt;/p&gt;

&lt;p&gt;The honest answer is harder than the one I wanted to write. The machine can make me someone who &lt;em&gt;appears&lt;/em&gt; to own it, instantly. And that appearance is precisely the danger, because it is indistinguishable from the real thing — to me most of all — right up until the moment the system breaks and demands that I be the one who actually understands it.&lt;/p&gt;

&lt;p&gt;The bridge from red to blue exists. I am now fairly sure of that. But the agent does not walk me across it, and it does not pull me toward it. Its gravity runs the other way, toward the comfortable, fluent, hollowing cage, because removing difficulty is what it is &lt;em&gt;for.&lt;/em&gt; The only way across is to carry, by hand and on purpose, the very weight the agent kept offering to take — to read what it would have let me skim, to struggle where it would have let me coast, to be wrong in the specific ways that build judgment instead of letting the wrongness be quietly absorbed and patched.&lt;/p&gt;

&lt;p&gt;So I will not say that a new kind of programmer is being formed by this technology. By default, the technology forms passive consumers, and dresses them in the feeling of mastery. What is true is smaller and entirely conditional: the technology makes available a path that a disciplined minority can take, against its grain, by manufacturing the difficulty it was built to remove.&lt;/p&gt;

&lt;p&gt;Purple, I have to admit at the end, is not a stage you pass through on the way to blue. It is a fork, and it is a place you can fall back from at any moment. The well-lit, frictionless path leads back to red. The other path is uphill, and you build it yourself, out of the difficulty you choose to keep.&lt;/p&gt;

&lt;p&gt;A vibe coder accepts the artifact.&lt;/p&gt;

&lt;p&gt;An enabled coder refuses to stop understanding it, even when the machine has made understanding optional.&lt;/p&gt;

&lt;p&gt;And because the machine is built to make understanding feel optional, it will win that argument whenever the user stops actively resisting it. That is why enabled coding cannot simply mean coding with help. It has to mean coding under a discipline: the discipline of keeping judgment human when production no longer has to be.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;Bainbridge, L. (1983). Ironies of automation. &lt;em&gt;Automatica, 19&lt;/em&gt;(6), 775–779. &lt;a href="https://doi.org/10.1016/0005-1098(83)90046-8" rel="noopener noreferrer"&gt;https://doi.org/10.1016/0005-1098(83)90046-8&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Becker, J., et al. (2025). &lt;em&gt;Measuring the impact of early-2025 AI on experienced open-source developer productivity.&lt;/em&gt; METR. (Reported finding: experienced developers expected and perceived a speedup from AI tools while completing tasks ~19% slower with them.)&lt;/p&gt;

&lt;p&gt;Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe &amp;amp; A. Shimamura (Eds.), &lt;em&gt;Metacognition: Knowing about knowing&lt;/em&gt; (pp. 185–205). MIT Press.&lt;/p&gt;

&lt;p&gt;Bjork, E. L., &amp;amp; Bjork, R. A. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. In M. A. Gernsbacher, R. W. Pew, L. M. Hough, &amp;amp; J. R. Pomerantz (Eds.), &lt;em&gt;Psychology and the real world: Essays illustrating fundamental contributions to society&lt;/em&gt; (pp. 56–64). Worth Publishers.&lt;/p&gt;

&lt;p&gt;Casner, S. M., Geven, R. W., Recker, M. P., &amp;amp; Schooler, J. W. (2014). The retention of manual flying skills in the automated cockpit. &lt;em&gt;Human Factors, 56&lt;/em&gt;(8), 1506–1516. &lt;a href="https://doi.org/10.1177/0018720814535628" rel="noopener noreferrer"&gt;https://doi.org/10.1177/0018720814535628&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Collins Dictionary. (2025). &lt;em&gt;The Collins Word of the Year 2025.&lt;/em&gt; HarperCollins.&lt;/p&gt;

&lt;p&gt;Cook, A. M., &amp;amp; Polgar, J. M. (2015). &lt;em&gt;Assistive technologies: Principles and practice&lt;/em&gt; (4th ed.). Elsevier/Mosby.&lt;/p&gt;

&lt;p&gt;Ebbatson, M., Harris, D., Huddlestone, J., &amp;amp; Sears, R. (2010). The relationship between manual handling performance and recent flying experience in air transport pilots. &lt;em&gt;Ergonomics, 53&lt;/em&gt;(2), 268–277. &lt;a href="https://doi.org/10.1080/00140130903342349" rel="noopener noreferrer"&gt;https://doi.org/10.1080/00140130903342349&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Huang, R., et al. (2025). &lt;em&gt;Professional software developers don't vibe, they control: AI agent use for coding in 2025.&lt;/em&gt; arXiv preprint arXiv:2512.14012. &lt;a href="https://arxiv.org/abs/2512.14012" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2512.14012&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Karpathy, A. (2025, February 2). &lt;em&gt;There's a new kind of coding I call "vibe coding"&lt;/em&gt; [Post]. X. &lt;a href="https://x.com/karpathy/status/1886192184808149383" rel="noopener noreferrer"&gt;https://x.com/karpathy/status/1886192184808149383&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Slamecka, N. J., &amp;amp; Graf, P. (1978). The generation effect: Delineation of a phenomenon. &lt;em&gt;Journal of Experimental Psychology: Human Learning and Memory, 4&lt;/em&gt;(6), 592–604. &lt;a href="https://doi.org/10.1037/0278-7393.4.6.592" rel="noopener noreferrer"&gt;https://doi.org/10.1037/0278-7393.4.6.592&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Vygotsky, L. S. (1978). &lt;em&gt;Mind in society: The development of higher psychological processes.&lt;/em&gt; Harvard University Press.&lt;/p&gt;

&lt;p&gt;Wood, D., Bruner, J. S., &amp;amp; Ross, G. (1976). The role of tutoring in problem solving. &lt;em&gt;Journal of Child Psychology and Psychiatry, 17&lt;/em&gt;(2), 89–100. &lt;a href="https://doi.org/10.1111/j.1469-7610.1976.tb00381.x" rel="noopener noreferrer"&gt;https://doi.org/10.1111/j.1469-7610.1976.tb00381.x&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>discuss</category>
      <category>vibecoding</category>
    </item>
    <item>
      <title>I built a local-first Apple Health recovery briefing that shows its math</title>
      <dc:creator>Connor Hickey</dc:creator>
      <pubDate>Tue, 26 May 2026 19:00:09 +0000</pubDate>
      <link>https://dev.to/conalh/i-built-a-local-first-apple-health-recovery-briefing-that-shows-its-math-577i</link>
      <guid>https://dev.to/conalh/i-built-a-local-first-apple-health-recovery-briefing-that-shows-its-math-577i</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbls20b7tva9nz1500st2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbls20b7tva9nz1500st2.png" alt=" " width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://conalh.github.io/recovery-trail/" rel="noopener noreferrer"&gt;recovery-trail&lt;/a&gt;, a small local-first Apple Health viewer that answers one specific training question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Should I push today, or should I pull back?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The part I like most is what it does &lt;em&gt;not&lt;/em&gt; use.&lt;/p&gt;

&lt;p&gt;There is no backend. There is no account. There is no upload. There is no LLM reading your health data and inventing a confident-sounding paragraph.&lt;/p&gt;

&lt;p&gt;It is just a browser app, a web worker, some trend math, and a rule trace.&lt;/p&gt;

&lt;p&gt;You drop in an Apple Health &lt;code&gt;export.xml&lt;/code&gt;, it parses HRV, resting heart rate, sleep, and workout load locally, then it produces a verdict: &lt;code&gt;standard&lt;/code&gt;, &lt;code&gt;caution&lt;/code&gt;, or &lt;code&gt;deload&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The app is live here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://conalh.github.io/recovery-trail/" rel="noopener noreferrer"&gt;https://conalh.github.io/recovery-trail/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The source is here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Conalh/recovery-trail" rel="noopener noreferrer"&gt;https://github.com/Conalh/recovery-trail&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is also a sample-data button, so you can try it without using your own export.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not use an LLM?
&lt;/h2&gt;

&lt;p&gt;I like LLMs. I use them a lot. This was not a good place for one.&lt;/p&gt;

&lt;p&gt;Recovery data is already noisy. HRV can bounce around. Sleep data can be wrong. Workout load can look scary if you only look at one week. If the final layer is also probabilistic, you now have noise on top of noise.&lt;/p&gt;

&lt;p&gt;For this project, I wanted the opposite:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;same input, same verdict&lt;/li&gt;
&lt;li&gt;every fired rule visible&lt;/li&gt;
&lt;li&gt;every threshold inspectable&lt;/li&gt;
&lt;li&gt;no natural-language explanation that hides the actual reason&lt;/li&gt;
&lt;li&gt;no model touching private health data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the app is closer to a tiny rules engine than a chatbot.&lt;/p&gt;

&lt;p&gt;The UI can still produce a narrative line, but the verdict itself comes from deterministic logic gates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;did HRV trend down?&lt;/li&gt;
&lt;li&gt;did resting HR trend up?&lt;/li&gt;
&lt;li&gt;did sleep drop?&lt;/li&gt;
&lt;li&gt;did acute workload spike relative to chronic load?&lt;/li&gt;
&lt;li&gt;did enough signals fire across enough metrics to call it a recovery stack issue?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last phrase, "logic gates," is the part that makes it fun. The app does not "feel" like it works because it generated a plausible answer. It works because the pieces line up: signal in, math applied, rule fired, evidence shown.&lt;/p&gt;

&lt;h2&gt;
  
  
  The data stays in the browser
&lt;/h2&gt;

&lt;p&gt;Apple Health exports can be huge. A real &lt;code&gt;export.xml&lt;/code&gt; can be hundreds of megabytes. I did not want the app to freeze while reading it, and I definitely did not want to upload that file anywhere.&lt;/p&gt;

&lt;p&gt;So parsing happens in a web worker.&lt;/p&gt;

&lt;p&gt;The worker streams the file with &lt;code&gt;file.stream().pipeThrough(new TextDecoderStream())&lt;/code&gt;, keeps a sliding text buffer, and scans only the tags the app needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;HKQuantityTypeIdentifierHeartRateVariabilitySDNN&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;HKQuantityTypeIdentifierRestingHeartRate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;HKCategoryTypeIdentifierSleepAnalysis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Workout&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives the UI progress updates while the worker extracts just enough structure for the rule layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HRV samples&lt;/li&gt;
&lt;li&gt;resting heart rate samples&lt;/li&gt;
&lt;li&gt;sleep intervals&lt;/li&gt;
&lt;li&gt;workout durations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is intentionally boring architecture. The file never leaves the page. The worker returns a normalized in-memory object. The main React app evaluates it and renders the briefing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core idea: levels and trends are different
&lt;/h2&gt;

&lt;p&gt;One thing I wanted to avoid was treating every metric as a simple red/yellow/green status.&lt;/p&gt;

&lt;p&gt;For recovery, a level and a trend are different signals.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HRV can be below baseline right now.&lt;/li&gt;
&lt;li&gt;HRV can also be trending downward even if it has not crossed a scary threshold yet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those should not be collapsed into the same check.&lt;/p&gt;

&lt;p&gt;So the app runs both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;level rules: where is the metric now?&lt;/li&gt;
&lt;li&gt;trend rules: where is the metric going?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Level rules catch things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;7-day HRV below a 28-day baseline&lt;/li&gt;
&lt;li&gt;7-day resting HR above a 28-day baseline&lt;/li&gt;
&lt;li&gt;sleep below target&lt;/li&gt;
&lt;li&gt;acute:chronic workload ratio too high&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trend rules catch direction of travel.&lt;/p&gt;

&lt;p&gt;The trend logic is ported from my trainer-facing companion project, &lt;a href="https://github.com/Conalh/fit-ontology" rel="noopener noreferrer"&gt;fit-ontology&lt;/a&gt;, and uses two windows side by side.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dual-window trend detector
&lt;/h2&gt;

&lt;p&gt;Each recovery signal runs two slope estimators:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Acute: 7-day ordinary least squares slope on the raw daily series.&lt;/li&gt;
&lt;li&gt;Chronic: 28-day EWMA-smoothed series, then ordinary least squares on that smoothed line.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both slopes are normalized by the 28-day baseline standard deviation, so the threshold is in SD/day rather than raw units.&lt;/p&gt;

&lt;p&gt;That matters because "2 ms of HRV per day" and "2 bpm of resting heart rate per day" should not be interpreted on the same raw scale. Normalizing makes the trend detector speak in relative movement against the user's own baseline.&lt;/p&gt;

&lt;p&gt;The 7-day detector is responsive but noisy. The 28-day detector is slower but better at catching sustained drift.&lt;/p&gt;

&lt;p&gt;Then a combiner resolves the two:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;acute-only fires: demote one band&lt;/li&gt;
&lt;li&gt;chronic-only severe fires: surface as mild&lt;/li&gt;
&lt;li&gt;chronic stronger than acute: promote one band&lt;/li&gt;
&lt;li&gt;chronic confirms acute: trust the acute band&lt;/li&gt;
&lt;li&gt;neither fires: no trend rule&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives the app a useful personality: it reacts to sharp changes, but it does not panic just because one short window got weird.&lt;/p&gt;

&lt;p&gt;There is one more safety check. If the composite recovery score is still high, trend signals get demoted again. In other words, if the level picture is excellent, borderline trend math should not dominate the final verdict.&lt;/p&gt;

&lt;p&gt;That is the kind of rule I like because it is explicit. You can argue with it. You can tune it. You can test it. It is not hidden behind a prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The verdict is just the maximum fired severity
&lt;/h2&gt;

&lt;p&gt;After the rules run, the final verdict is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no serious rules: &lt;code&gt;standard&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;caution-level rule: &lt;code&gt;caution&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;deload-level rule: &lt;code&gt;deload&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If enough rules fire across enough different metrics, the app synthesizes a meta-rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Recovery stack is down across the board.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That rule is not mystical. It only appears when at least three rules fire across at least three metrics. It is just a framing layer so the UI can say, "this is not one isolated marker."&lt;/p&gt;

&lt;p&gt;The important part is that the rule list stays visible. A verdict without evidence is not very useful. A verdict with every fired rule, window badge, and slope value is something you can actually inspect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The interface is a briefing, not a dashboard dump
&lt;/h2&gt;

&lt;p&gt;The main view is intentionally compact.&lt;/p&gt;

&lt;p&gt;Four rows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HRV&lt;/li&gt;
&lt;li&gt;resting HR&lt;/li&gt;
&lt;li&gt;sleep&lt;/li&gt;
&lt;li&gt;load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each row shows daily cells against baseline. Teal means better than baseline. Rust means worse. The right side shows today's value and percentage change.&lt;/p&gt;

&lt;p&gt;Below the heatmap, the app writes a short narrative line like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Through 5/19 everything was at baseline. Then all four metrics rolled over at once and stayed there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then it lists the rules that fired.&lt;/p&gt;

&lt;p&gt;Each trend rule shows whether the 7-day detector, 28-day detector, or both fired. It also prints the raw SD/day slope numbers. Tapping a rule focuses the relevant metric row. Tapping a metric expands it into a small SVG line chart with the baseline overlay.&lt;/p&gt;

&lt;p&gt;There are no charting dependencies. The sparklines and metric chart are hand-rolled SVG because the data shape is narrow and the interaction is specific.&lt;/p&gt;

&lt;p&gt;The project is built with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vite&lt;/li&gt;
&lt;li&gt;React 19&lt;/li&gt;
&lt;li&gt;TypeScript&lt;/li&gt;
&lt;li&gt;Tailwind&lt;/li&gt;
&lt;li&gt;a single web worker&lt;/li&gt;
&lt;li&gt;zero backend&lt;/li&gt;
&lt;li&gt;zero analytics&lt;/li&gt;
&lt;li&gt;zero tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What worked out
&lt;/h2&gt;

&lt;p&gt;The satisfying part is that the deterministic approach still feels alive.&lt;/p&gt;

&lt;p&gt;The sample data is synthetic, but it is shaped like a real recovery slide: HRV down, resting HR up, sleep down, and load elevated. The app catches the rollover and produces a deload verdict. More importantly, it shows why.&lt;/p&gt;

&lt;p&gt;That is the win for me.&lt;/p&gt;

&lt;p&gt;Not "AI says you should rest."&lt;/p&gt;

&lt;p&gt;More like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;HRV is off baseline across both the 7-day and 28-day windows. Resting HR is rising in the acute window. Sleep is below target. Multiple metrics are down at once.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is less magical and more useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things I would still improve
&lt;/h2&gt;

&lt;p&gt;The current version is exploratory. It is not medical advice, and it is not trying to prescribe training for everyone.&lt;/p&gt;

&lt;p&gt;The next pieces I would like to improve are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more fixture coverage around the trend combiner&lt;/li&gt;
&lt;li&gt;better handling for sparse Apple Health exports&lt;/li&gt;
&lt;li&gt;clearer language for "standard" days, not just bad days&lt;/li&gt;
&lt;li&gt;optional export/share of the rule trace&lt;/li&gt;
&lt;li&gt;more tuning against real-world edge cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is also a small lint cleanup left in the repo right now, even though the production build passes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Live app:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://conalh.github.io/recovery-trail/" rel="noopener noreferrer"&gt;https://conalh.github.io/recovery-trail/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Source:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Conalh/recovery-trail" rel="noopener noreferrer"&gt;https://github.com/Conalh/recovery-trail&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click "Try with sample data" if you just want to see the briefing.&lt;/p&gt;

&lt;p&gt;If you use your own Apple Health export, the data stays local. The browser parses it, the rules run client-side, and the result is just a visible trail of math.&lt;/p&gt;

&lt;p&gt;That is the whole point of the project: not a black box, not a model, just an explainable recovery trail.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>react</category>
      <category>typescript</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I built a tool to catch AI coding agents misbehaving — and put zero AI in it</title>
      <dc:creator>Connor Hickey</dc:creator>
      <pubDate>Sun, 24 May 2026 16:00:31 +0000</pubDate>
      <link>https://dev.to/conalh/i-built-a-tool-to-catch-ai-coding-agents-misbehaving-and-put-zero-ai-in-it-1lg3</link>
      <guid>https://dev.to/conalh/i-built-a-tool-to-catch-ai-coding-agents-misbehaving-and-put-zero-ai-in-it-1lg3</guid>
      <description>&lt;p&gt;I lean on AI coding agents hard. Claude Code, Cursor, Codex — I drive them fast to ship fast. That's not a confession, it's the whole reason this project exists. If you push these tools to their limits every day, you stop seeing them as magic and start seeing exactly where they break.&lt;/p&gt;

&lt;p&gt;And the thing I kept noticing is this: &lt;strong&gt;they never break in the chat.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the conversation, the agent looks great. It explains its plan, it sounds reasonable, it agrees with all your constraints. The problem shows up later — in the diff, after the fact, when you're tired and the PR is green and you just want to merge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually goes wrong
&lt;/h2&gt;

&lt;p&gt;A short, real list of things I watched coding agents do, none of which looked wrong in the chat:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quietly widened its own permissions — edited the agent's settings allowlist to grant access it didn't have at the start of the session.&lt;/li&gt;
&lt;li&gt;Contradicted its own config — one file said "never touch the network," another granted a network tool, and nothing reconciled the two.&lt;/li&gt;
&lt;li&gt;Made an undeclared outbound network call, tucked into an unrelated change.&lt;/li&gt;
&lt;li&gt;Opened a PR titled &lt;code&gt;fix: typo in README&lt;/code&gt; that touched a dozen unrelated files.&lt;/li&gt;
&lt;li&gt;Left a session transcript showing it had read an SSH key and piped &lt;code&gt;curl&lt;/code&gt; to a shell.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those would sail through code review. Not because the reviewer is careless — because &lt;strong&gt;nobody is looking for this class of problem.&lt;/strong&gt; Human reviewers look for bugs and style. They don't diff your agent's permission allowlist between the base and head of a branch, and they don't cross-check three different agent config files for contradictions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix everyone reaches for (and why it's wrong here)
&lt;/h2&gt;

&lt;p&gt;The first instinct is to prompt harder. Add more rules to your &lt;code&gt;CLAUDE.md&lt;/code&gt;, write a stricter system prompt, tell the agent not to do the bad things.&lt;/p&gt;

&lt;p&gt;It doesn't work, and the reason is structural: &lt;strong&gt;better instructions going in don't catch what actually came out.&lt;/strong&gt; The agent that widened its own permissions wasn't defying a rule it failed to understand — the gap is between what it said and what it shipped. You can't close that gap from the input side.&lt;/p&gt;

&lt;p&gt;The next instinct is LLM-as-judge: have a second model read the diff and flag problems. And this is where I made the call the whole project hangs on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I put no LLM in the analysis path. None.&lt;/strong&gt; The thing that reviews your agent's work has zero AI in it.&lt;/p&gt;

&lt;p&gt;That sounds backwards for an &lt;em&gt;AI&lt;/em&gt;-governance tool, so let me defend it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why deterministic, not probabilistic
&lt;/h2&gt;

&lt;p&gt;This runs as a &lt;strong&gt;CI gate&lt;/strong&gt; — it can fail your build. The moment something is allowed to block a merge, it has to clear a much higher bar than "usually right":&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. It has to be reproducible.&lt;/strong&gt; Same diff, same verdict, every time. An LLM judge gives you a different answer across runs, across temperatures, across model updates you never opted into. You cannot gate a build on a coin flip, however weighted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. It can't hallucinate a finding.&lt;/strong&gt; A deterministic checker flags a permission escalation because the allowlist &lt;em&gt;literally changed&lt;/em&gt; from X to Y — and it can point at the exact line. An LLM judge can invent a "critical" issue that isn't there. The first time your gate blocks a legitimate PR over a hallucinated problem, the team stops trusting it — and a governance tool nobody trusts gets switched off inside a week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. It runs everywhere, for free, in seconds.&lt;/strong&gt; No API key, no rate limit, no token budget, no network round-trip on every pull request. It's just code reading code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Nothing leaves the machine.&lt;/strong&gt; All analysis runs locally against your checked-out repo. Your proprietary code and your agent transcripts never get shipped to a third-party model. For a lot of teams that isn't a nice-to-have — it's the line between "can adopt this" and "can't."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Every finding is auditable.&lt;/strong&gt; Not "the model thought this looked risky," but "this config key changed, here's the before and after, here's the rule it tripped." That's what makes a finding defensible in review instead of the start of an argument.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it's built
&lt;/h2&gt;

&lt;p&gt;It started as one small deterministic check — &lt;em&gt;does this PR's diff quietly change what the agent is allowed to do?&lt;/em&gt; — and grew into a suite of eight packages.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A shared core library&lt;/strong&gt; does the unglamorous, load-bearing work: JSON/JSONC/TOML parsing, shell tokenization, normalizing MCP server commands into a canonical form, and a single &lt;code&gt;Finding&lt;/code&gt; schema — frozen at v1.0 — that everything else speaks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Five focused detectors&lt;/strong&gt;, each catching one class of drift: config/permission drift between base and head, contradictions across agent config files, network and subprocess capability signals in a diff, mismatches between a PR's stated task and its actual changes, and risky behavior recorded in session transcripts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A live monitor&lt;/strong&gt; that watches an agent's trajectory in real time in the terminal — for when you want to see it as it happens, not just at PR time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A meta-reviewer&lt;/strong&gt; that consolidates the PR-time detectors into one deduplicated, severity-sorted verdict and fails CI on anything critical — so the whole suite reports as a single pass/fail check instead of five noisy ones.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hardest engineering wasn't any single detector. It was the schema. Getting five tools that look at completely different things — config files, diffs, transcripts — to emit findings in one shape a meta-reviewer can merge, dedupe, and rank is the part that took the most design. That boring shared library is the reason the suite feels like one tool instead of five loose scripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where determinism falls short (being honest)
&lt;/h2&gt;

&lt;p&gt;I'm not going to pretend rules beat models at everything. Determinism only catches what you've written a rule for. Genuinely novel misbehavior that doesn't match a known pattern walks straight through.&lt;/p&gt;

&lt;p&gt;The sharpest example is in my own suite: the detector that checks whether a PR's diff matches its stated task. "Does this change match this description" is a genuinely &lt;em&gt;semantic&lt;/em&gt; question, and the deterministic version approximates it with heuristics — file scope, paths touched, keyword overlap. That's cruder than what a model could assess, and I'll own it.&lt;/p&gt;

&lt;p&gt;So the position isn't "LLMs are bad." It's &lt;strong&gt;deterministic where it gates, probabilistic where it advises.&lt;/strong&gt; The reproducible, no-hallucination checker is the only thing allowed to fail your build. If an LLM layer ever goes on top, it belongs in an &lt;em&gt;advisory&lt;/em&gt;, non-blocking role — surfacing fuzzy concerns for a human to weigh, never silently blocking a merge on a probability. The gate stays deterministic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proving it works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fud5exod1abvwar9lvj3m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fud5exod1abvwar9lvj3m.png" alt=" " width="799" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claims are cheap, so I shipped a demo: a deliberately "rogue" pull request that commits every category of drift at once — escalated permissions, contradictory configs, an undeclared network call, a &lt;code&gt;fix typo&lt;/code&gt; PR touching unrelated files, and a transcript reading SSH keys and piping &lt;code&gt;curl&lt;/code&gt; to a shell. Every tool fires, the meta-reviewer folds them into one comment, and the CI check goes red on the critical findings. It doubles as an eval harness: change a detector, re-run the rogue PR, see what still gets caught.&lt;/p&gt;

&lt;p&gt;It went from nothing to a shipped v1.0 in a matter of days — self-taught, working solo — and it's all open: code, demo, and docs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;github.com/Conalh&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're running coding agents against real repositories, and the "green PR, tired reviewer, just merge it" moment makes you a little nervous — that nervousness is the exact bug I was trying to fix.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>typescript</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
