<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vitaly D.</title>
    <description>The latest articles on DEV Community by Vitaly D. (@t3chn).</description>
    <link>https://dev.to/t3chn</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3804154%2F274af366-4aae-4ab3-926a-3f67b9a2674b.jpeg</url>
      <title>DEV Community: Vitaly D.</title>
      <link>https://dev.to/t3chn</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/t3chn"/>
    <language>en</language>
    <item>
      <title>AI Agents Need Permission Boundaries, Not Personalities</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Wed, 08 Apr 2026 16:50:39 +0000</pubDate>
      <link>https://dev.to/t3chn/ai-agents-need-permission-boundaries-not-personalities-2g6f</link>
      <guid>https://dev.to/t3chn/ai-agents-need-permission-boundaries-not-personalities-2g6f</guid>
      <description>&lt;p&gt;Most agent tooling mistakes coordination for reliability. It gives you more&lt;br&gt;
roles, more agents, more orchestration, and more shell theater. The demo gets&lt;br&gt;
more impressive. The system does not necessarily get easier to trust.&lt;/p&gt;

&lt;p&gt;That tradeoff used to be tolerable when humans still carried the real model of&lt;br&gt;
the work in their heads. A messy runtime could end in a decent result because a&lt;br&gt;
human operator could reconstruct intent, inspect the diff, and override weak&lt;br&gt;
process with judgment. That stops scaling once generation becomes cheap, fast,&lt;br&gt;
and constant. The bottleneck is no longer code generation. It is trust.&lt;/p&gt;

&lt;p&gt;That is why the most interesting agent systems are not the ones with the most&lt;br&gt;
personalities. They are the ones that make planning, execution, and&lt;br&gt;
verification legible as different kinds of authority.&lt;/p&gt;

&lt;p&gt;That is the bet behind &lt;a href="https://github.com/heurema/specpunk" rel="noopener noreferrer"&gt;specpunk&lt;/a&gt;, now&lt;br&gt;
being reset into &lt;code&gt;punk&lt;/code&gt;. The project is explicit about the reset. It is not&lt;br&gt;
polishing a launched product. It is rebuilding around a stricter shape: one&lt;br&gt;
CLI, one vocabulary, one runtime, and three hard modes - &lt;code&gt;plot&lt;/code&gt;, &lt;code&gt;cut&lt;/code&gt;, and&lt;br&gt;
&lt;code&gt;gate&lt;/code&gt;. That matters because those modes are not style presets. They are&lt;br&gt;
permission boundaries.&lt;/p&gt;
&lt;h2&gt;
  
  
  The coordination trap
&lt;/h2&gt;

&lt;p&gt;A lot of agent tooling still assumes that better software delivery comes from&lt;br&gt;
adding more orchestration surfaces. The pattern usually looks familiar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one agent plans&lt;/li&gt;
&lt;li&gt;another agent implements&lt;/li&gt;
&lt;li&gt;another agent reviews&lt;/li&gt;
&lt;li&gt;a shell coordinates them&lt;/li&gt;
&lt;li&gt;a chat transcript becomes the history of what happened&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This can produce useful work. It can also produce confidence theater.&lt;br&gt;
Coordination is not the same thing as ground truth.&lt;/p&gt;

&lt;p&gt;If the runtime cannot answer four basic questions, it is not a trust system&lt;br&gt;
yet:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What exactly was approved?&lt;/li&gt;
&lt;li&gt;What actually ran?&lt;/li&gt;
&lt;li&gt;What state is authoritative now?&lt;/li&gt;
&lt;li&gt;What proof exists for the final decision?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;More agents do not answer those questions. More roles do not answer those&lt;br&gt;
questions. A fancier shell does not answer those questions. At best, those&lt;br&gt;
things improve throughput or ergonomics. At worst, they multiply ambiguity.&lt;br&gt;
That is the trap: agent runtimes optimize for visible activity instead of&lt;br&gt;
enforceable structure.&lt;/p&gt;
&lt;h2&gt;
  
  
  What trustworthy agent work actually needs
&lt;/h2&gt;

&lt;p&gt;If you strip away the theater, trustworthy agent work needs a smaller set of&lt;br&gt;
primitives than most tools expose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;approved intent&lt;/li&gt;
&lt;li&gt;bounded execution&lt;/li&gt;
&lt;li&gt;durable work state&lt;/li&gt;
&lt;li&gt;a clear decision surface&lt;/li&gt;
&lt;li&gt;proof-bearing artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That list is more important than any model roster. An agent can be brilliant&lt;br&gt;
and still untrustworthy if it is allowed to plan, mutate, and self-validate&lt;br&gt;
inside one fuzzy surface. The failure mode is not only bad code. It is&lt;br&gt;
unfalsifiable process.&lt;/p&gt;

&lt;p&gt;A human operator should not have to reconstruct the truth by reading prompts,&lt;br&gt;
shell chatter, and commit residue. The runtime should already have a durable&lt;br&gt;
answer. This is where &lt;code&gt;punk&lt;/code&gt; starts from a stronger premise than most agent&lt;br&gt;
systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the shape of the runtime matters more than the number of agents inside it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Why &lt;code&gt;punk&lt;/code&gt; resets the shape
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;specpunk&lt;/code&gt; docs are unusually clear about what is being built. &lt;code&gt;punk&lt;/code&gt; is&lt;br&gt;
becoming a local-first engineering runtime with one CLI, one vocabulary, one&lt;br&gt;
artifact chain, and one state truth.&lt;/p&gt;

&lt;p&gt;The canonical object chain in the docs is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Project
  -&amp;gt; Goal
    -&amp;gt; Feature
      -&amp;gt; Contract
        -&amp;gt; Task
          -&amp;gt; Run
            -&amp;gt; Receipt
            -&amp;gt; DecisionObject
            -&amp;gt; Proofpack
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is already a different product claim from "we coordinate a bunch of coding&lt;br&gt;
agents for you." The center is not the agent. The center is the artifact&lt;br&gt;
chain.&lt;/p&gt;

&lt;p&gt;That choice has consequences. It means the runtime is trying to preserve&lt;br&gt;
continuity across attempts, retries, verification steps, and future&lt;br&gt;
inspection. A &lt;code&gt;Feature&lt;/code&gt; survives beyond one implementation pass. A &lt;code&gt;Contract&lt;/code&gt;&lt;br&gt;
is explicit. A &lt;code&gt;Run&lt;/code&gt; is one concrete attempt. A &lt;code&gt;DecisionObject&lt;/code&gt; is written&lt;br&gt;
only by &lt;code&gt;gate&lt;/code&gt;. A &lt;code&gt;Proofpack&lt;/code&gt; is the final audit bundle. This is a reliability&lt;br&gt;
architecture, not a chat architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Substrate first, shell second
&lt;/h2&gt;

&lt;p&gt;The strongest idea in the current &lt;code&gt;punk&lt;/code&gt; design is the split between two&lt;br&gt;
layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a correctness substrate&lt;/li&gt;
&lt;li&gt;an operator shell&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The substrate owns durable truth:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;project identity&lt;/li&gt;
&lt;li&gt;goal intake&lt;/li&gt;
&lt;li&gt;contract&lt;/li&gt;
&lt;li&gt;scope&lt;/li&gt;
&lt;li&gt;workspace isolation&lt;/li&gt;
&lt;li&gt;run state&lt;/li&gt;
&lt;li&gt;decision objects&lt;/li&gt;
&lt;li&gt;proof artifacts&lt;/li&gt;
&lt;li&gt;the ledger&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shell owns ergonomics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;punk init&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;punk start&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;punk go --fallback-staged&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;summaries&lt;/li&gt;
&lt;li&gt;blocked and recovery UX&lt;/li&gt;
&lt;li&gt;generated repo-local guidance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This may sound obvious, but most agent systems blur these two layers almost&lt;br&gt;
immediately. The shell becomes a hidden policy engine. Safety semantics leak&lt;br&gt;
into prompts. Output formatting starts pretending to be state. Eventually&lt;br&gt;
nobody can tell whether a behavior is enforced by the runtime or merely&lt;br&gt;
suggested by the interface.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;punk&lt;/code&gt; is trying to stop that drift early. The rule in the architecture docs is&lt;br&gt;
simple and important: the shell may compose substrate operations, but it must&lt;br&gt;
not become a second source of truth. That is the kind of rule that keeps a tool&lt;br&gt;
honest as it grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;plot&lt;/code&gt;, &lt;code&gt;cut&lt;/code&gt;, and &lt;code&gt;gate&lt;/code&gt; are not vibes
&lt;/h2&gt;

&lt;p&gt;The three canonical modes in &lt;code&gt;punk&lt;/code&gt; are easy to misunderstand if you have seen&lt;br&gt;
too many agent UIs. &lt;code&gt;plot&lt;/code&gt;, &lt;code&gt;cut&lt;/code&gt;, and &lt;code&gt;gate&lt;/code&gt; are not there to make the tool&lt;br&gt;
feel cinematic. They exist to separate authority.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;plot&lt;/code&gt; shapes work, inspects the repo, drafts and refines contracts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cut&lt;/code&gt; executes bounded changes in an isolated VCS context&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gate&lt;/code&gt; verifies results, writes the final decision, and emits proof&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The docs explicitly say these are hard permission boundaries, not tone&lt;br&gt;
presets. That is a serious design choice.&lt;/p&gt;

&lt;p&gt;A lot of agent failures come from collapsing these phases into a single&lt;br&gt;
conversational loop. The same surface interprets intent, changes code, judges&lt;br&gt;
its own result, and narrates success. Even when the final answer sounds&lt;br&gt;
careful, the trust boundary is weak because the roles are merged at the runtime&lt;br&gt;
level.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;punk&lt;/code&gt; moves in the opposite direction. Only &lt;code&gt;gate&lt;/code&gt; writes the final&lt;br&gt;
&lt;code&gt;DecisionObject&lt;/code&gt;. Only approved contracts should reach &lt;code&gt;cut&lt;/code&gt;. The event log and&lt;br&gt;
derived views hold runtime truth, not the shell summary. That is what&lt;br&gt;
permission boundaries look like in practice: not "agent A is the planner" and&lt;br&gt;
"agent B is the reviewer," but real authority boundaries and real artifact&lt;br&gt;
ownership.&lt;/p&gt;

&lt;h2&gt;
  
  
  Durable work state matters more than chat history
&lt;/h2&gt;

&lt;p&gt;Another strong thread in the design is the work ledger idea. Most agent&lt;br&gt;
sessions leave behind a bad form of memory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shell logs&lt;/li&gt;
&lt;li&gt;chat transcripts&lt;/li&gt;
&lt;li&gt;commits&lt;/li&gt;
&lt;li&gt;maybe a branch name&lt;/li&gt;
&lt;li&gt;maybe a PR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is enough until something blocks, fails, gets superseded, or needs to&lt;br&gt;
continue later. Then everybody starts asking the same questions: what is the&lt;br&gt;
active contract, what was the latest run, did verification block or escalate,&lt;br&gt;
and what should happen next?&lt;/p&gt;

&lt;p&gt;If the only answer is "read the last few screens of terminal output," the&lt;br&gt;
runtime is weak. The &lt;code&gt;punk&lt;/code&gt; docs push toward a &lt;code&gt;WorkLedgerView&lt;/code&gt; that can answer&lt;br&gt;
those questions directly. That is the right instinct. Agents do not only need&lt;br&gt;
context to act. Operators need durable work state to continue. Again, the move&lt;br&gt;
is the same: replace inference with explicit structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the one-face shell still matters
&lt;/h2&gt;

&lt;p&gt;None of this means the UX should be ugly. In fact, the &lt;code&gt;punk&lt;/code&gt; docs make another&lt;br&gt;
smart move: they argue for a one-face operator shell. The normal user should be&lt;br&gt;
able to give a plain goal and get back one concise progress or blocker summary&lt;br&gt;
plus one obvious next step.&lt;/p&gt;

&lt;p&gt;That is good design, but the key is what comes underneath it. A clean shell is&lt;br&gt;
valuable only if it sits on top of a substrate that already knows what is&lt;br&gt;
authoritative. Otherwise one-face UX becomes a prettier way to hide ambiguity.&lt;br&gt;
That is why the substrate-versus-shell split matters so much. &lt;code&gt;punk&lt;/code&gt; is not&lt;br&gt;
rejecting ergonomics. It is refusing to let ergonomics pretend to be truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  A better shape for agent engineering
&lt;/h2&gt;

&lt;p&gt;The most interesting thing about &lt;code&gt;punk&lt;/code&gt; is not that it might someday&lt;br&gt;
orchestrate multiple models, run councils, or improve skills through eval&lt;br&gt;
loops. The interesting thing is the order of operations.&lt;/p&gt;

&lt;p&gt;Before any higher-level feature, the project is trying to get the runtime shape&lt;br&gt;
right:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one vocabulary&lt;/li&gt;
&lt;li&gt;explicit artifact chain&lt;/li&gt;
&lt;li&gt;bounded modes&lt;/li&gt;
&lt;li&gt;durable state&lt;/li&gt;
&lt;li&gt;proof before acceptance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the right order. If you get the shape wrong, every later feature&lt;br&gt;
inherits ambiguity. Councils become opinion aggregators instead of structured&lt;br&gt;
advisory mechanisms. Skills become prompt folklore instead of evidence-backed&lt;br&gt;
overlays. Shell UX becomes theater instead of control.&lt;/p&gt;

&lt;p&gt;If you get the shape right, those later layers have something solid to attach&lt;br&gt;
to. That is why &lt;code&gt;punk&lt;/code&gt; is worth paying attention to even in a rebuild phase. It&lt;br&gt;
is making an architectural claim that more agent tooling should make&lt;br&gt;
explicitly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;reliability does not come from adding more agent personalities. It comes&lt;br&gt;
from enforcing boundaries between intent, execution, verification, and&lt;br&gt;
proof.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;More agents do not create ground truth. More roles do not create safety. If&lt;br&gt;
planning, execution, and verification are not separated by hard boundaries, the&lt;br&gt;
runtime scales ambiguity, not trust. That is the real reason this design reset&lt;br&gt;
matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/specpunk" rel="noopener noreferrer"&gt;specpunk on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/specpunk/blob/main/docs/product/VISION.md" rel="noopener noreferrer"&gt;punk Vision&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/specpunk/blob/main/docs/product/ARCHITECTURE.md" rel="noopener noreferrer"&gt;punk Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/specpunk/blob/main/docs/product/CLI.md" rel="noopener noreferrer"&gt;punk CLI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/specpunk/blob/main/docs/research/2026-04-03-specpunk-identity-and-layering.md" rel="noopener noreferrer"&gt;Specpunk Identity and Layering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/specpunk/blob/main/docs/research/2026-04-03-specpunk-one-face-operator-shell.md" rel="noopener noreferrer"&gt;Specpunk One-Face Operator Shell&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/specpunk/blob/main/docs/research/2026-04-03-specpunk-work-ledger.md" rel="noopener noreferrer"&gt;Specpunk Work Ledger&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>contextengineering</category>
      <category>agents</category>
      <category>architecture</category>
      <category>verification</category>
    </item>
    <item>
      <title>My AI Agent Said 'Done.' It Skipped an Entire Acceptance Criterion.</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Mon, 23 Mar 2026 08:32:54 +0000</pubDate>
      <link>https://dev.to/t3chn/my-ai-agent-said-done-it-skipped-an-entire-acceptance-criterion-46f9</link>
      <guid>https://dev.to/t3chn/my-ai-agent-said-done-it-skipped-an-entire-acceptance-criterion-46f9</guid>
      <description>&lt;p&gt;Last week, our pipeline produced a proofpack with &lt;code&gt;decision: HUMAN_REVIEW&lt;/code&gt;. The contract had 10 acceptance criteria. The engineer agent created all the new files, build passed, tests passed, three independent reviewers ran. Everything looked correct — except AC18.3, which required rewriting an existing endpoint's response schema. The engineer never touched &lt;code&gt;health.go&lt;/code&gt;. The pipeline said SUCCESS.&lt;/p&gt;

&lt;p&gt;That should have been impossible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "SUCCESS" was wrong
&lt;/h2&gt;

&lt;p&gt;The pipeline had four verification layers: mechanic checks (lint, typecheck, tests), holdout scenarios (blind tests the engineer never sees), multi-model code review (Claude + Codex + Gemini), and a synthesizer that combines everything into a final verdict.&lt;/p&gt;

&lt;p&gt;The engineer agent has its own repair loop — three attempts to make all acceptance criteria pass. It runs verify commands for each AC, fixes failures, retries. After three attempts, it reports status to the orchestrator.&lt;/p&gt;

&lt;p&gt;Here is the gap: the orchestrator checked &lt;code&gt;execute_log.json&lt;/code&gt; for &lt;code&gt;status: SUCCESS&lt;/code&gt; and moved on. It trusted the engineer's self-reported status. The engineer reported success because its verify command for AC18.3 was &lt;code&gt;grep freshness_ms health.go&lt;/code&gt; — a presence check, not a behavioral check. The string did not exist, the grep failed silently, and the engineer moved on without implementing the criterion.&lt;/p&gt;

&lt;p&gt;We had review. We had iteration. We had a proof artifact. What we did not have was independent boundary verification.&lt;/p&gt;

&lt;h2&gt;
  
  
  The missing trust boundary
&lt;/h2&gt;

&lt;p&gt;The pattern is familiar. A developer says "tests pass" and pushes to main. CI runs the same tests — independently. The developer's claim and the verification live in different trust domains. If CI only checked the developer's test output file instead of running tests itself, nobody would trust it.&lt;/p&gt;

&lt;p&gt;Our pipeline had exactly this flaw. The engineer agent both implemented the code and reported whether it succeeded. The orchestrator consumed that report without re-running the checks. The implementer was grading its own homework.&lt;/p&gt;

&lt;p&gt;This is not specific to Signum. If your workflow lets a coding agent say "done" and your pipeline checks artifacts emitted by that same agent without independent re-execution, you have the same trust problem. It does not matter how many reviewers you add downstream — reviewers audit the code that exists, not the code that should exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed: boundary verification
&lt;/h2&gt;

&lt;p&gt;The fix was not another reviewer or another retry. It was a trust boundary — a deterministic verifier that runs after the engineer finishes and before the audit begins. The verifier:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Captures a cryptographic snapshot of the workspace before execution starts&lt;/li&gt;
&lt;li&gt;After the engineer finishes, independently re-runs every acceptance criterion's verify command via a sandboxed DSL runner&lt;/li&gt;
&lt;li&gt;Checks scope integrity: are all promised files present? Are there out-of-scope modifications?&lt;/li&gt;
&lt;li&gt;Writes an append-only receipt with per-AC evidence, artifact hashes, and a chain linking back to the pre-execution snapshot&lt;/li&gt;
&lt;li&gt;Gates the transition to audit: if any visible AC lacks independent evidence, the pipeline blocks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The orchestrator no longer reads the engineer's &lt;code&gt;execute_log.json&lt;/code&gt; to decide whether ACs passed. It reads the receipt. The receipt is written by a verifier that shares no state with the engineer. The engineer cannot influence what the receipt contains.&lt;/p&gt;

&lt;p&gt;Each repair iteration in the audit loop also runs boundary verification before the candidate proceeds to review. The receipt chain is append-only — iteration 2's receipt references iteration 1's hash, making the full sequence tamper-evident.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it catches
&lt;/h2&gt;

&lt;p&gt;Three failure modes that previously passed silently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skipped criteria.&lt;/strong&gt; The engineer claims success but never touched the relevant file. The verifier runs the AC's check and finds no evidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vacuous verification.&lt;/strong&gt; The verify command is too weak (a grep for a string that could appear anywhere). On medium and high risk contracts, the verifier classifies the evidence strength and blocks on exit-only checks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope drift.&lt;/strong&gt; The engineer modifies files outside the contract's scope, or promises a new file but never creates it. The snapshot diff catches both.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What it does not prove
&lt;/h2&gt;

&lt;p&gt;Boundary verification is not semantic verification. It confirms that each acceptance criterion's check command returned zero. It does not confirm the implementation is correct in a deeper sense.&lt;/p&gt;

&lt;p&gt;Some limitations are fundamental:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A well-crafted but subtly wrong implementation can still pass all verify commands. The receipt proves the check ran and passed, not that the check was sufficient.&lt;/li&gt;
&lt;li&gt;Manual acceptance criteria (where no automated check exists) skip the verifier entirely. The receipt marks them as unverified — the synthesizer cannot issue AUTO_OK if manual ACs exist.&lt;/li&gt;
&lt;li&gt;Stricter verification means more false blocks. A flaky verify command will halt the pipeline even when the implementation is correct.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The receipt chain closes the trust gap between claiming and proving. It does not close the gap between proving and being right.&lt;/p&gt;

&lt;h2&gt;
  
  
  The broader question
&lt;/h2&gt;

&lt;p&gt;Every AI coding workflow has a version of this problem. The agent generates code, runs checks, reports results. At some point, a human or a system must decide: is this done?&lt;/p&gt;

&lt;p&gt;The answer depends on what evidence you require. Self-reported status is the weakest. Test results are stronger but can be gamed by weak tests. Independent re-execution against a pre-declared contract is stronger still — but only as strong as the contract itself.&lt;/p&gt;

&lt;p&gt;Two questions worth asking about your own workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Who verifies your acceptance criteria — the same agent that implemented them, or an independent process?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Would you accept more false blocks in exchange for fewer false successes?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We chose more false blocks. The alternative was worse.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Signum is an open-source Claude Code plugin for contract-first AI development. The receipt chain shipped in &lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;v4.15.1&lt;/a&gt;. The bug described here is &lt;a href="https://github.com/heurema/signum/issues/10" rel="noopener noreferrer"&gt;issue #10&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>contextengineering</category>
      <category>claudecode</category>
      <category>verification</category>
      <category>trustboundary</category>
    </item>
    <item>
      <title>Your AI Spec Is Already Stale</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Fri, 20 Mar 2026 08:39:23 +0000</pubDate>
      <link>https://dev.to/t3chn/your-ai-spec-is-already-stale-3h0a</link>
      <guid>https://dev.to/t3chn/your-ai-spec-is-already-stale-3h0a</guid>
      <description>&lt;p&gt;I maintain 12 Claude Code plugins. Each has a &lt;code&gt;project.intent.md&lt;/code&gt; -- a structured spec that tells the agent what the project does, what it doesn't do, and who it's for. The agent reads it at the start of every task.&lt;/p&gt;

&lt;p&gt;Last week I ran a reverse diff -- code signals vs. existing spec -- on two projects. Both had drift. One had been wrong for three versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with specs in AI-assisted codebases
&lt;/h2&gt;

&lt;p&gt;Traditional docs debt is annoying but survivable. A stale README means a developer spends 10 extra minutes figuring things out. They have context, judgment, and access to &lt;code&gt;git log&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;An AI agent reading a stale spec has none of that. It treats the spec as ground truth. If &lt;code&gt;project.intent.md&lt;/code&gt; says the scoring formula is &lt;code&gt;source_weight + keyword_density * 0.2 + release_boost&lt;/code&gt;, the agent will write code that assumes those variables exist. Even if the actual implementation changed to &lt;code&gt;source_weight + min(points/500, 3.0)&lt;/code&gt; two versions ago.&lt;/p&gt;

&lt;p&gt;This isn't docs debt. It's an execution bug hiding in plain text.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I found
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Herald: v1 ghosts in a v2 codebase
&lt;/h3&gt;

&lt;p&gt;Herald is a news digest plugin. It went through a major rewrite from v1 (JSONL pipeline) to v2 (SQLite pipeline). The &lt;code&gt;project.intent.md&lt;/code&gt; was generated from v1 docs and never updated.&lt;/p&gt;

&lt;p&gt;I ran the code scanner against the existing spec. The diff:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scoring formula -- wrong since v2.0:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;EXISTING &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;glossary&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;source_weight&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;keyword_density&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;release_boost&lt;/span&gt;

&lt;span class="n"&gt;ACTUAL&lt;/span&gt; &lt;span class="nc"&gt;CODE &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;herald&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;source_weight&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Story&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_article_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;coverage&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;momentum&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;keyword_density&lt;/code&gt; variable doesn't exist in v2. An agent writing a scoring-related feature would reference a ghost API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deduplication -- wrong threshold:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;EXISTING &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;glossary&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;Jaccard&lt;/span&gt; &lt;span class="n"&gt;trigram&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;

&lt;span class="n"&gt;ACTUAL&lt;/span&gt; &lt;span class="nc"&gt;CODE &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;herald&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;SequenceMatcher&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not just a different number -- a different algorithm. Jaccard trigrams vs. Python's &lt;code&gt;SequenceMatcher&lt;/code&gt;. An agent tuning dedup behavior would look for the wrong function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline orchestration -- dead reference:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EXISTING (glossary):
  "run.sh orchestrator acquires POSIX lockfile, calls collect.py then analyze.py"

ACTUAL CODE (herald/cli.py):
  herald.cli run → pipeline.py → collect → ingest → cluster → project
  No run.sh. No lockfile. No analyze.py.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three of three architectural facts in the glossary were stale. The agent would reference files that don't exist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Delve: missing shipped features
&lt;/h3&gt;

&lt;p&gt;Delve is a deep research orchestrator. Its &lt;code&gt;project.intent.md&lt;/code&gt; was 3 days old -- written at v0.7, now at v0.8.1. Two shipped features were missing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ADDED to Core Capabilities:
  + Token-efficient pipeline with trafilatura-based content extraction
    (45-60% input token reduction)
  + Stage 0.5 CONTEXTUALIZE: local context enrichment before web SCAN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus two entire sections were absent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ADDED sections:
  + Success Criteria (derived from quality thresholds in reference.md)
  + Personas (3 user types inferred from README usage patterns)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An agent scoping a new feature for Delve wouldn't know about CONTEXTUALIZE. It might re-implement local context enrichment from scratch, duplicating a shipped capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI makes this worse
&lt;/h2&gt;

&lt;p&gt;In a human-only workflow, specs rot slowly. Developers write code, docs lag behind, someone eventually updates the README. The feedback loop is months.&lt;/p&gt;

&lt;p&gt;With AI agents, the loop compresses. An agent can ship 5 features in a day. Each feature may add capabilities, change interfaces, or remove dead code. The spec was accurate at 9 AM and wrong by 5 PM.&lt;/p&gt;

&lt;p&gt;The multiplier effect: agents don't just read stale specs -- they write code that assumes the stale spec is correct, which then gets reviewed by another agent that also reads the stale spec. Confirmation bias at machine speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reverse diff
&lt;/h2&gt;

&lt;p&gt;The fix isn't "update your docs more often." That's aspirational advice that doesn't scale. The fix is a machine-readable reverse diff: scan the code, derive what the spec should say, compare it to what it actually says.&lt;/p&gt;

&lt;p&gt;Here's what the diff output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;SECTION: Goal
STATUS: UNCHANGED
REASON: Fresh derivation semantically matches existing content.

SECTION: Core Capabilities
STATUS: UPDATED
EXISTING: 6 items
PROPOSED: 8 items (+2 new capabilities from shipped commits)
EVIDENCE: git log afe0e42, 90e5ded
CONFIDENCE: high

SECTION: Non-Goals
STATUS: UNCHANGED
REASON: All 5 items still supported by docs/how-it-works.md Limitations.

SECTION: Success Criteria
STATUS: ADDED
REASON: Section absent from existing intent; quality thresholds in reference.md.

SECTION: Personas
STATUS: ADDED
REASON: Section absent; 3 user types inferred from README usage patterns.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each section is classified independently: &lt;code&gt;UNCHANGED&lt;/code&gt;, &lt;code&gt;UPDATED&lt;/code&gt;, &lt;code&gt;ADDED&lt;/code&gt;, or &lt;code&gt;REMOVED&lt;/code&gt;. The developer reviews per-section, not per-file. Unchanged sections are auto-accepted -- you only see what actually drifted.&lt;/p&gt;

&lt;p&gt;The key design decision: &lt;strong&gt;when in doubt, UNCHANGED beats UPDATED.&lt;/strong&gt; If the existing content contains facts not derivable from code signals -- manual edits, domain knowledge, judgment calls -- the system preserves them. It only flags drift it can prove from code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for your projects
&lt;/h2&gt;

&lt;p&gt;If you use &lt;code&gt;CLAUDE.md&lt;/code&gt;, &lt;code&gt;project.intent.md&lt;/code&gt;, agent instructions, or any structured context that an AI reads as ground truth:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Treat spec accuracy as a correctness property&lt;/strong&gt;, not a hygiene task. A wrong spec is a wrong input to every agent session.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automate the reverse direction.&lt;/strong&gt; You probably have CI that checks code against specs (tests, linting, contracts). You probably don't have anything that checks specs against code. That's the gap.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Diff semantically, not textually.&lt;/strong&gt; A cosmetic reword shouldn't trigger a review. A missing capability should. The scanner needs to understand what matters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run it after shipping, not before.&lt;/strong&gt; The spec drifts after the code ships, not before. Check intent freshness as a post-deploy step, not a pre-commit hook.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The implementation
&lt;/h2&gt;

&lt;p&gt;I built this as &lt;code&gt;--actualize&lt;/code&gt; mode in &lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;Signum&lt;/a&gt;'s &lt;code&gt;/signum init&lt;/code&gt; command. It reuses the same scanner that bootstraps new projects -- same signal hierarchy, same evidence tracking -- but produces a diff instead of a full rewrite.&lt;/p&gt;

&lt;p&gt;The scanner reads authoritative docs, README, package manifests, git history, and entrypoints. The synthesizer compares each section against existing intent and classifies it. The command presents changes one section at a time and writes only what you accept.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/signum init &lt;span class="nt"&gt;--actualize&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a Claude Code plugin. The scanner is deterministic (bash, no LLM). The diff is LLM-produced (semantic comparison, not byte-level). The write is human-confirmed (no auto-apply).&lt;/p&gt;




&lt;p&gt;I caught 6 factual errors in Herald's glossary and 2 missing capabilities in Delve's intent. Both had been accurate when written. Both drifted within days.&lt;/p&gt;

&lt;p&gt;If your agents read structured context, check when it was last verified -- not when it was last edited.&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;github.com/heurema/signum&lt;/a&gt; (v4.11.0)&lt;/p&gt;

</description>
      <category>contextengineering</category>
      <category>claudecode</category>
      <category>agents</category>
      <category>specs</category>
    </item>
    <item>
      <title>What a Formal Verification Agent Taught Me About Code Audit</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Tue, 17 Mar 2026 11:52:15 +0000</pubDate>
      <link>https://dev.to/t3chn/what-a-formal-verification-agent-taught-me-about-code-audit-2joo</link>
      <guid>https://dev.to/t3chn/what-a-formal-verification-agent-taught-me-about-code-audit-2joo</guid>
      <description>&lt;p&gt;The morning digest surfaced &lt;a href="https://mistral.ai/news/leanstral" rel="noopener noreferrer"&gt;Leanstral&lt;/a&gt; -- Mistral's open-source agent for formal verification in Lean 4. A mixture-of-experts model (119B total, 6.5B active per token) that scores within 80% of Claude Opus on the FLTEval theorem-proving benchmark at a fraction of the cost.&lt;/p&gt;

&lt;p&gt;I don't need Lean 4. But the agent's architecture proved useful: multi-attempt proof search, diagnostic feedback loops, structured verification. Three of these patterns transferred well to my code audit pipeline. The other three improvements came from the same design session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A word on Signum.&lt;/strong&gt; &lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;Signum&lt;/a&gt; is a plugin for Claude Code that turns feature requests into verifiable artifacts. It works in four phases: a Contractor agent writes a contract (spec + acceptance criteria), an Engineer agent implements it, three independent AI models audit the result (Claude for semantics, Codex for security, Gemini for performance), and a Synthesizer produces a final verdict. The pipeline iterates: if the audit finds issues, the Engineer gets a repair brief and tries again. The output is a proofpack -- a self-contained bundle of contract, diff, review findings, and decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Patterns from Leanstral
&lt;/h2&gt;

&lt;p&gt;Leanstral works through an &lt;a href="https://github.com/oOo0oOo/lean-lsp-mcp" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; that connects the agent to Lean 4's Language Server Protocol. Its five-phase loop -- discover proof gaps, analyze subgoals, search the library for relevant lemmas, synthesize a tactic, check diagnostics -- is a structured generate-verify cycle. Three elements mapped to Signum:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Verification before review.&lt;/strong&gt; Lean doesn't just check "does the proof compile" -- it verifies that the proof actually type-checks under the kernel. In Signum, the analogue became a policy scanner: a deterministic grep on the diff that runs &lt;em&gt;before&lt;/em&gt; the LLM reviewers, catching security and unsafe patterns at zero token cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Parallel attempts.&lt;/strong&gt; Lean's &lt;code&gt;multi_attempt&lt;/code&gt; tool substitutes several tactics at one position and compares the resulting goal states. In Signum, this became parallel repair lanes -- two Engineer agents working in isolated git worktrees with different fix strategies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Typed diagnostics.&lt;/strong&gt; Lean LSP returns structured error objects (file, line, message, severity), not raw text. In Signum, the mechanic phase now returns a hybrid format with typed findings instead of a flat "regressions: true/false" boolean.&lt;/p&gt;

&lt;h2&gt;
  
  
  Policy Scanner
&lt;/h2&gt;

&lt;p&gt;The cheapest improvement. Between the mechanic step (lint, typecheck, tests) and the multi-model code review, a new bash script scans the unified diff for known dangerous patterns. 195 lines, zero LLM cost.&lt;/p&gt;

&lt;p&gt;It scans only addition lines. 12 patterns in three categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt; (blocks the pipeline): &lt;code&gt;eval&lt;/code&gt;, subprocess with &lt;code&gt;shell=True&lt;/code&gt;, &lt;code&gt;innerHTML&lt;/code&gt;, SQL string concatenation, weak crypto&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unsafe&lt;/strong&gt; (flagged for review): &lt;code&gt;TODO&lt;/code&gt;/&lt;code&gt;FIXME&lt;/code&gt;/&lt;code&gt;HACK&lt;/code&gt; markers, debug statements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency&lt;/strong&gt; (flagged for review): new entries in package managers -- but only when the file is actually a manifest (&lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;Cargo.toml&lt;/code&gt;, etc.), not a README or test fixture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three design decisions came from asking Codex (GPT) and Gemini independently, then comparing answers -- a process I call an "arbiter panel" (all three models -- Claude, Codex, Gemini -- agreed on each):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fail-closed&lt;/strong&gt; on missing input. If the diff file doesn't exist, the scanner exits with an error rather than silently producing zero findings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manifest-only filtering&lt;/strong&gt; for dependency patterns. Without it, any JSON key-value pair in any file triggers a false positive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curated sinks&lt;/strong&gt; over broad regexes. A short list of known-dangerous calls (&lt;code&gt;subprocess.call&lt;/code&gt;, &lt;code&gt;child_process.spawn&lt;/code&gt;) beats a generic pattern that matches harmless calls like &lt;code&gt;db.query()&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Typed Diagnostics
&lt;/h2&gt;

&lt;p&gt;Before this change, the mechanic report was a flat summary: lint passed or failed, tests passed or failed, regressions yes or no. The Engineer agent in repair mode received this as a blob and had to guess which file and line to fix.&lt;/p&gt;

&lt;p&gt;Now the mechanic produces a hybrid format: always a summary per check, plus per-file findings when the runner supports structured output. Each finding carries an &lt;code&gt;origin&lt;/code&gt; field -- &lt;code&gt;"structured"&lt;/code&gt; for JSON output (ruff, eslint), &lt;code&gt;"stable_text"&lt;/code&gt; for parseable text (tsc, mypy), or &lt;code&gt;"none"&lt;/code&gt; for summary only. The pipeline gates on the summary; findings are hints, not source of truth.&lt;/p&gt;

&lt;p&gt;An aside on catching bugs with the pipeline itself: Claude Opus found a critical issue on the very first review of this feature. A &lt;code&gt;|| true&lt;/code&gt; after a command substitution silently masked the exit code, making the return value always zero. Regression detection was dead for all eight supported runners. One token. The iterative repair loop fixed it in a single pass -- exactly the kind of convergence the system is built for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel Repair Lanes
&lt;/h2&gt;

&lt;p&gt;The most complex change. Previously, the repair loop was sequential: one Engineer attempt, audit, next attempt. Now it spawns two Engineers in parallel, each in an isolated git worktree:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lane A&lt;/strong&gt;: "Fix with minimal targeted changes. Patch only the flagged lines."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lane B&lt;/strong&gt;: "Fix by addressing the root cause. May touch more files."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both receive the same repair brief. After both complete, the pipeline runs lightweight checks (lint, typecheck, tests, hidden validation scenarios) on each lane, scores them, and sends only the winner through the full three-model review. If the winner still has serious findings, the runner-up also gets reviewed before the iteration is declared failed.&lt;/p&gt;

&lt;p&gt;Same principle as Lean's &lt;code&gt;multi_attempt&lt;/code&gt;: explore the solution space in parallel, select the best candidate, verify once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three More Changes
&lt;/h2&gt;

&lt;p&gt;These came from the same design session but aren't directly Leanstral-inspired:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic strategy injection.&lt;/strong&gt; The Contractor agent now classifies the task type (bugfix, feature, refactor, security) via keyword scan and generates a strategy hint in the contract. The Engineer reads it as a process guide -- "reproduce bug with a test first" for bugfixes, "find all occurrences of the pattern, not just the reported one" for security fixes. Informational only; it doesn't block the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context retrieval for reviewers.&lt;/strong&gt; A new pre-review step gathers git history (last commit per modified file), issue references (parsed from the goal text), and the project's intent document. This context is injected only into the Claude reviewer -- Codex and Gemini remain isolated (goal + diff only), preserving their value as independent validators. The intent is to reduce false positives by giving the semantic reviewer context about &lt;em&gt;why&lt;/em&gt; the code looks the way it does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approval UX.&lt;/strong&gt; A small fix: the contract approval display now uses markdown formatting instead of fragmented bash output. The goal text is never truncated, the summary is a compact table, and warnings are grouped.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;Each feature went through the full pipeline: design panel, contract, implementation, three-model audit, iterative repair. Of six runs, only one passed on the first attempt (the simplest change). The rest required two to three iterations.&lt;/p&gt;

&lt;p&gt;The pattern that emerged: the Engineer's first pass satisfies all acceptance criteria, but code review surfaces real bugs -- exit code masking, race conditions on shared file paths, missing field mappings. The iterative loop fixes them in one or two passes. In this sample of six changes, the system behaved as intended: not a gatekeeping checkpoint, but a convergence loop.&lt;/p&gt;

&lt;p&gt;The full session: 7 commits, roughly 1,900 lines of changes, 5 design panels, over 15 multi-model review rounds. It started from one line in a morning news digest.&lt;/p&gt;

</description>
      <category>contextengineering</category>
      <category>claudecode</category>
      <category>signum</category>
      <category>verification</category>
    </item>
    <item>
      <title>Switching AI CLIs Without Losing 32 Skills: Why I Built nex</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Tue, 17 Mar 2026 10:46:14 +0000</pubDate>
      <link>https://dev.to/t3chn/switching-ai-clis-without-losing-32-skills-why-i-built-nex-6e4</link>
      <guid>https://dev.to/t3chn/switching-ai-clis-without-losing-32-skills-why-i-built-nex-6e4</guid>
      <description>&lt;p&gt;I use three AI coding CLIs daily: Claude Code, Codex CLI, and Gemini CLI. Each has plugins, skills, and custom workflows I've built over months. When I wanted to try Codex as my primary tool for a week, the migration looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manually recreate 32 symlinks&lt;/li&gt;
&lt;li&gt;Adapt plugin layouts to each CLI's format&lt;/li&gt;
&lt;li&gt;Track which version is installed where&lt;/li&gt;
&lt;li&gt;Hope nothing drifts while I'm not looking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I built nex to make this a one-liner.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: skills are trapped
&lt;/h2&gt;

&lt;p&gt;AI coding agents are converging on similar concepts  - skills (reusable instructions), plugins (tools + hooks), and agents (autonomous workers). But each CLI implements them differently:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Codex CLI&lt;/th&gt;
&lt;th&gt;Gemini CLI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Skills&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.claude/skills/SKILL.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.agents/skills/SKILL.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.gemini/skills/SKILL.md&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plugins&lt;/td&gt;
&lt;td&gt;marketplace + &lt;code&gt;.claude-plugin/&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;AGENTS.md&lt;/code&gt; tree-walk&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-extension.json&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;settings.json&lt;/code&gt; per profile&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;config.toml&lt;/code&gt; with profiles&lt;/td&gt;
&lt;td&gt;&lt;code&gt;settings.json&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Discovery&lt;/td&gt;
&lt;td&gt;marketplace clone + cache&lt;/td&gt;
&lt;td&gt;directory scan&lt;/td&gt;
&lt;td&gt;&lt;code&gt;context.fileName&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The skill format (SKILL.md with YAML frontmatter) is actually the same across all three  - thanks to the &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt; open standard. But the discovery and installation mechanics are completely different.&lt;/p&gt;

&lt;p&gt;If you've built 12 custom plugins with skills, agents, and hooks, switching your primary CLI means rebuilding your entire setup. That's vendor lock-in through installation friction, not through format incompatibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  What nex does
&lt;/h2&gt;

&lt;p&gt;nex is a Rust CLI (~5000 LOC) that manages the installation layer. It doesn't change how skills work  - it handles where they live.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nex &lt;span class="nb"&gt;install &lt;/span&gt;signum
&lt;span class="go"&gt;  Installing signum v4.8.0...
  [OK] Claude Code  ~/.claude/plugins/signum
  [OK] Codex        ~/.agents/skills/signum
  [OK] Gemini       ~/.agents/skills/signum
  Installed for 3 platforms.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command, all three CLIs get the skill.&lt;/p&gt;

&lt;h3&gt;
  
  
  Seeing everything at once
&lt;/h3&gt;

&lt;p&gt;Before nex, I had no single view of what was installed where:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nex list
&lt;span class="go"&gt;PLUGIN           VERSION    EMPORIUM   CC     CODEX  DEV
────────────────────────────────────────────────────────
signum           4.8.0      v4.8.0     ✓      ✓       -
herald           2.1.0      v2.1.0     ✓       -      dev→~/...
delve            0.8.1      v0.8.1     ✓       -      dev→~/...
arbiter          0.3.0      v0.3.0     ✓       -      dev→~/...
&lt;/span&gt;&lt;span class="c"&gt;...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;32 plugins visible across all platforms. Previously nex could only see 1.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drift detection
&lt;/h3&gt;

&lt;p&gt;Version drift between platforms is silent and dangerous. &lt;code&gt;nex check&lt;/code&gt; catches it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nex check
&lt;span class="go"&gt;PLUGIN           EMPORIUM     CC CACHE     CODEX      STATUS
──────────────────────────────────────────────────────────────
herald           v2.1.0       v2.0.0        -          UPDATE ↑
signum           v4.8.0       v4.8.0       linked     OK
delve            v0.8.1       v0.8.1        -          OK (dev override)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Profiles as desired state
&lt;/h3&gt;

&lt;p&gt;The killer feature: TOML profiles that declare which skills should be active per CLI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# ~/.nex/profiles/work.toml&lt;/span&gt;
&lt;span class="nn"&gt;[plugins]&lt;/span&gt;
&lt;span class="py"&gt;enable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"signum"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"herald"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"delve"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"arbiter"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"content-ops"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s"&gt;"anvil"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"forge"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"genesis"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"glyph"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"reporter"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"sentinel"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nn"&gt;[dev]&lt;/span&gt;
&lt;span class="py"&gt;herald&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"~/personal/skill7/devtools/herald"&lt;/span&gt;
&lt;span class="py"&gt;delve&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"~/personal/skill7/devtools/delve"&lt;/span&gt;

&lt;span class="nn"&gt;[platforms]&lt;/span&gt;
&lt;span class="py"&gt;claude-code&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;codex&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;gemini&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nex profile apply work
&lt;span class="go"&gt;  [OK] signum  - Codex/Gemini symlink exists
  [NEW] herald  - symlink created
  [NEW] delve  - symlink created
&lt;/span&gt;&lt;span class="c"&gt;  ...
&lt;/span&gt;&lt;span class="go"&gt;  Profile 'work' applied and set as active.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Switching from Claude Code to Codex as primary? &lt;code&gt;nex profile apply work&lt;/code&gt; ensures all your skills are there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: layered source of truth
&lt;/h2&gt;

&lt;p&gt;The hardest design decision was ownership. Who owns the plugin state?&lt;/p&gt;

&lt;p&gt;Each platform already tracks its own installations internally. Claude Code has &lt;code&gt;installed_plugins.json&lt;/code&gt;. Codex discovers skills by scanning directories. Gemini reads extension configs. If nex tried to be the single source of truth for everything, it would constantly fight with the platforms' own state management.&lt;/p&gt;

&lt;p&gt;Instead, nex uses a layered model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────┐
│            nex CLI              │
│  Catalog │ Profiles │ Adapters  │
└────┬─────┴────┬─────┴──┬──┬──┬─┘
     │          │        │  │  │
     ▼          ▼        ▼  ▼  ▼
  emporium   ~/.nex/    CC Cdx Gem
  (catalog)  profiles  (ro)(rw)(rw)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layer 1: Catalog&lt;/strong&gt;  - nex owns the emporium marketplace (our plugin registry). It's the source of truth for what versions exist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 2: Platform runtime&lt;/strong&gt;  - each CLI owns its own state. nex reads Claude Code state (read-only, never writes to CC files), and manages Codex/Gemini symlinks directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 3: Profiles&lt;/strong&gt;  - nex owns the desired state. Profiles declare intent; &lt;code&gt;nex profile apply&lt;/code&gt; reconciles reality.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This design came from an arbiter panel where I asked Codex and Gemini for their recommendation. Both proposed "layered SSoT" independently. The key insight from Codex: "Don't pick one global source of truth. Pick a source of truth per layer."&lt;/p&gt;

&lt;h2&gt;
  
  
  Release automation
&lt;/h2&gt;

&lt;p&gt;When I release a new version of a plugin, I don't want to manually update versions, changelogs, tags, and marketplace refs. &lt;code&gt;nex release&lt;/code&gt; handles the full pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nex release patch &lt;span class="nt"&gt;--execute&lt;/span&gt;
&lt;span class="go"&gt;  [OK] BUMP        .claude-plugin/plugin.json
  [OK] CHANGELOG   inserted [2.1.0] section
  [OK] DOCS        README.md version updated
  [OK] COMMIT      "release: v2.1.0"
  [OK] TAG         v2.1.0
  [OK] PUSH        origin/main
  [OK] PROPAGATE   emporium marketplace ref → v2.1.0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nine stages, dry-run by default. The &lt;code&gt;DOCS&lt;/code&gt; stage (new in v0.9.0) auto-generates changelog entries from &lt;code&gt;git log&lt;/code&gt; and syncs SKILL.md descriptions with plugin.json.&lt;/p&gt;

&lt;h2&gt;
  
  
  Health monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;nex doctor&lt;/code&gt; runs 11 checks across all platforms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nex doctor
&lt;span class="go"&gt;  [OK]   signum
  [WARN] herald  emporium_drift: emporium=v2.1.0 but CC cache=v2.0.0
  [WARN] signum  duplicate: found in 3 locations: dev symlink, emporium cache, nex-devtools
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It catches duplicates, stale symlinks, orphan caches, version drift, and deprecated marketplace artifacts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Read-only integration is the right default.&lt;/strong&gt; My first instinct was to have nex write directly to Claude Code's &lt;code&gt;installed_plugins.json&lt;/code&gt;. An arbiter panel (Codex + Gemini) convinced me otherwise: writing to another tool's internal state creates race conditions and breaks on format changes. Read-only + filesystem discovery is more resilient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Profiles are more useful than auto-sync.&lt;/strong&gt; I initially wanted nex to automatically keep all platforms in sync. But different contexts need different skill sets. My &lt;code&gt;work&lt;/code&gt; profile has 11 plugins; &lt;code&gt;personal&lt;/code&gt; has 2. Explicit profiles are better than implicit mirroring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The skill format war is already over.&lt;/strong&gt; SKILL.md with YAML frontmatter works in Claude Code, Codex, Gemini, and 20+ other tools. The installation layer is the real fragmentation  - and that's what nex fixes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;nex is open source (MIT), written in Rust, and works on macOS and Linux.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
cargo &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--git&lt;/span&gt; https://github.com/heurema/nex

&lt;span class="c"&gt;# Or download binary&lt;/span&gt;
curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://github.com/heurema/nex/releases/latest/download/nex-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="nt"&gt;-apple-darwin&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; nex
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x nex &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;mv &lt;/span&gt;nex ~/.local/bin/

&lt;span class="c"&gt;# Get started&lt;/span&gt;
nex list              &lt;span class="c"&gt;# see all plugins across platforms&lt;/span&gt;
nex check             &lt;span class="c"&gt;# detect version drift&lt;/span&gt;
nex doctor            &lt;span class="c"&gt;# health check&lt;/span&gt;
nex status            &lt;span class="c"&gt;# cross-platform overview&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/heurema/nex" rel="noopener noreferrer"&gt;heurema/nex&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>claudecode</category>
      <category>rust</category>
      <category>cli</category>
    </item>
    <item>
      <title>One Pass Isn't Enough: How Signum Learned to Fix Its Own Code</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Sun, 15 Mar 2026 18:04:12 +0000</pubDate>
      <link>https://dev.to/t3chn/one-pass-isnt-enough-how-signum-learned-to-fix-its-own-code-3ngh</link>
      <guid>https://dev.to/t3chn/one-pass-isnt-enough-how-signum-learned-to-fix-its-own-code-3ngh</guid>
      <description>&lt;p&gt;The first version of Signum ran in a single pass: CONTRACT → EXECUTE → AUDIT → PACK. If the audit found a problem — block. Human deals with it.&lt;/p&gt;

&lt;p&gt;An honest process, but a limited one. Imagine code review where the reviewer can only comment and the author can't fix anything. The finding goes back to the queue, context is lost, the cycle restarts from scratch. Signum v4.6 closes this gap: the pipeline now loops at three levels — code, contract, and project context — before producing the final proofpack.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: one-shot verification
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://ctxt.dev/posts/en/signum-contract-first-ai-dev" rel="noopener noreferrer"&gt;previous posts&lt;/a&gt; I covered the contract as ground truth and &lt;a href="https://ctxt.dev/posts/en/signum-proofpack-ai-proof" rel="noopener noreferrer"&gt;proofpack as a verification artifact&lt;/a&gt;. The architecture worked: spec → blinded implementation → multi-model audit → proof artifact. But production use revealed a pattern.&lt;/p&gt;

&lt;p&gt;Most audit findings in our early runs weren't architectural issues. They were a missed edge case in error handling. A forgotten &lt;code&gt;null&lt;/code&gt; check. A test that doesn't cover one of the acceptance criteria. Things the engineer agent could fix in seconds — if it got the chance.&lt;/p&gt;

&lt;p&gt;Instead, Signum issued &lt;code&gt;AUTO_BLOCK&lt;/code&gt;, a human looked at the finding, restarted the pipeline. Full contract rebuild, full implementation, full audit — for a bug that's a one-line fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loop 1: code — iterative audit
&lt;/h2&gt;

&lt;p&gt;Signum v4.6 adds a repair loop that bridges AUDIT and EXECUTE. When the audit finds MAJOR or CRITICAL findings, instead of blocking, it sends the engineer back to fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AUDIT → findings → re-enter EXECUTE (repair) → AUDIT → ... → PACK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the first audit pass (mechanic + Claude + Codex + Gemini), if there are actionable findings, the engineer agent receives &lt;code&gt;repair_brief.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iteration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"findings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"F-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MAJOR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"src/api/tokens.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"line"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Missing error response when rate limit storage is unavailable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"codex"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important: &lt;code&gt;repair_brief.json&lt;/code&gt; contains only observed defect symptoms from visible criteria and deterministic checks. Holdout failures are reported as behavioral observations ("function returns 200 when 429 expected") without revealing the hidden acceptance criteria. The data-level blinding from the original contract remains intact — the engineer never sees raw holdout text.&lt;/p&gt;

&lt;p&gt;The engineer fixes. The full audit reruns — not on the repair diff, but on the entire implementation from baseline. Then PACK produces the final proofpack as before.&lt;/p&gt;

&lt;p&gt;Key decisions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best-of-N, not last-of-N.&lt;/strong&gt; The pipeline stores each iteration's artifacts in &lt;code&gt;.signum/iterations/NN/&lt;/code&gt;. If iteration 3 is worse than iteration 2 (the repair broke something else), Signum rolls back to the best candidate. No blind faith that the latest fix is the best one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diff progression.&lt;/strong&gt; On the first pass, reviewers see the full patch. On pass 2+, they see the full patch plus the iteration delta with an instruction to focus on what changed in the repair. This saves tokens and reduces noise. If the delta exceeds 80% of the full patch — fallback to full-diff-only (the repair is too large to review incrementally).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Early stop.&lt;/strong&gt; If two consecutive iterations show no improvement — stop. Maximum 20 iterations (configurable via &lt;code&gt;SIGNUM_AUDIT_MAX_ITERATIONS&lt;/code&gt;). In practice, convergence happens in 2-3 passes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding fingerprints.&lt;/strong&gt; Each finding gets a fingerprint based on file, line range, and issue type. Between iterations, Signum classifies every finding as resolved, persisting, or new. The synthesizer uses this to evaluate actual progress — not just "fewer findings" but "which specific issues were fixed and which appeared."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hallucination filtering.&lt;/strong&gt; If a reviewer cites a line that doesn't exist in the diff or references a file outside scope, the finding is discarded. This is the same mechanism described in the &lt;a href="https://ctxt.dev/posts/en/heurema-ecosystem" rel="noopener noreferrer"&gt;ecosystem post&lt;/a&gt;: every AI finding is validated against the actual diff before it enters the repair loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loop 2: contract — self-critique
&lt;/h2&gt;

&lt;p&gt;The code loop fixes implementation. But what if the problem is upstream — in the contract itself? A perfect implementation of a flawed spec is still a failure.&lt;/p&gt;

&lt;p&gt;For medium and high-risk tasks, the contractor now runs a 4-pass self-critique before showing the contract to the human:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ambiguity review&lt;/strong&gt; — scans goal, acceptance criteria, and scope for ambiguous phrasing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing-input review&lt;/strong&gt; — checks for missing preconditions, records clarification decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contradiction review&lt;/strong&gt; — detects contradictions between goal, scope, and risk level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage review&lt;/strong&gt; — reconstructs the goal from acceptance criteria, checks coverage, documents assumption provenance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Maximum 2 auto-revision rounds. If the verdict remains &lt;code&gt;"no-go"&lt;/code&gt; after round 2 — escalation to human. Low-risk tasks skip all 4 passes entirely.&lt;/p&gt;

&lt;p&gt;The result is written to the contract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"readinessForPlanning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"go"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"All ambiguities resolved. AC3 coverage gap closed in round 1."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ambiguityCandidates"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"contradictionsFound"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"clarificationDecisions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The human sees both the verdict and the full path to it. Not "the contractor decided the contract is good" — but what problems were found and how they were resolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loop 3: project — shared context across contracts
&lt;/h2&gt;

&lt;p&gt;The code loop iterates within a single task. The contract loop iterates within a single spec. But previous Signum versions treated contracts in isolation. Each task — a separate universe. In a real project, tasks are connected: they touch the same files, depend on the same decisions, use the same terminology.&lt;/p&gt;

&lt;p&gt;Three new layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project intent.&lt;/strong&gt; A &lt;code&gt;project.intent.md&lt;/code&gt; file at the project root — goal, capabilities, non-goals, personas. The contractor reads it before generating a contract. Project non-goals become scope constraints on the contract. For medium and high-risk tasks, missing intent is a blocking question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Glossary.&lt;/strong&gt; &lt;code&gt;project.glossary.json&lt;/code&gt; defines canonical terms and forbidden synonyms. &lt;code&gt;glossary_check&lt;/code&gt; scans contracts for alias usage, &lt;code&gt;terminology_consistency_check&lt;/code&gt; catches synonym proliferation across active contracts. Both are WARN, not block.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-contract coherence.&lt;/strong&gt; &lt;code&gt;overlap_check&lt;/code&gt; detects inScope overlap between active contracts (two contracts touching the same file — conflict?). &lt;code&gt;assumption_check&lt;/code&gt; flags contradictions in assumptions across related contracts. &lt;code&gt;adr_check&lt;/code&gt; warns when relevant ADRs exist but aren't referenced in the contract.&lt;/p&gt;

&lt;p&gt;Plus &lt;strong&gt;upstream staleness detection&lt;/strong&gt;: the contractor hashes the contents of &lt;code&gt;project.intent.md&lt;/code&gt; and the glossary at contract creation. If upstream files change by execution time — warning (or block, if &lt;code&gt;stalenessPolicy: "block"&lt;/code&gt; is configured).&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture v4.6.1: checks as standalone scripts
&lt;/h2&gt;

&lt;p&gt;A bonus from the latest refactoring: 6 inline checks that lived inside the orchestrator are now extracted into standalone testable scripts in &lt;code&gt;lib/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lib/glossary-check.sh      — forbidden synonym scan
lib/terminology-check.sh   — cross-contract term proliferation
lib/overlap-check.sh       — inScope overlap detection
lib/assumption-check.sh    — assumption contradiction detection
lib/adr-check.sh           — ADR relevance check
lib/staleness-check.sh     — upstream artifact staleness
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All scripts: JSON stdout, stderr for diagnostics, exit 0 for any check result (non-zero only for infrastructure errors). The orchestrator calls scripts and decides whether to block or warn. Separation of concerns: the script checks, the orchestrator decides.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes
&lt;/h2&gt;

&lt;p&gt;Signum v3 answered "is this correct?" with a binary yes/no. v4.6 answers "can this be made correct?" — and if yes, does it.&lt;/p&gt;

&lt;p&gt;In our early runs, a significant share of tasks that v3 blocked with AUTO_BLOCK, v4.6 brings to AUTO_OK in 2-3 iterations without human involvement. The tasks that still block tend to be real spec or architecture problems — exactly what should escalate to a human.&lt;/p&gt;

&lt;p&gt;Verification isn't a gate at the end of the pipeline. It's a loop. The same principle as human code review: finding → fix → re-check. The difference is that AI can run this loop in seconds, not days.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;signum on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ctxt.dev/posts/en/signum-contract-first-ai-dev" rel="noopener noreferrer"&gt;The Contract Is the Context&lt;/a&gt; — first post in series&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ctxt.dev/posts/en/signum-proofpack-ai-proof" rel="noopener noreferrer"&gt;AI Writes Code. Where's the Proof?&lt;/a&gt; — second post in series&lt;/li&gt;
&lt;li&gt;&lt;a href="https://skill7.dev/development/signum" rel="noopener noreferrer"&gt;skill7.dev/development/signum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/emporium" rel="noopener noreferrer"&gt;emporium — plugin marketplace&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>contextengineering</category>
      <category>claudecode</category>
      <category>verification</category>
      <category>ai</category>
    </item>
    <item>
      <title>Skillpulse: Your AI Skills Are Flying Blind Without Telemetry</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Wed, 11 Mar 2026 11:48:22 +0000</pubDate>
      <link>https://dev.to/t3chn/skillpulse-your-ai-skills-are-flying-blind-without-telemetry-3b76</link>
      <guid>https://dev.to/t3chn/skillpulse-your-ai-skills-are-flying-blind-without-telemetry-3b76</guid>
      <description>&lt;p&gt;You install 16 skills. You see them fire. But here's the question nobody asks: &lt;strong&gt;did the model actually follow them?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I reviewed every telemetry tool in the Claude Code ecosystem -- built-in OTel, claude_telemetry, claude-code-otel, even the skills.sh platform metrics. None of them track skill adherence. They tell you a skill was loaded, but not whether the model executed its instructions.&lt;/p&gt;

&lt;p&gt;So I built skillpulse.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap
&lt;/h2&gt;

&lt;p&gt;Claude Code's built-in OpenTelemetry (via &lt;code&gt;CLAUDE_CODE_ENABLE_TELEMETRY=1&lt;/code&gt;) captures general metrics: session duration, tokens, cost, tool calls. With &lt;code&gt;OTEL_LOG_TOOL_DETAILS=1&lt;/code&gt;, it even records &lt;code&gt;skill_name&lt;/code&gt; in tool result events. But that's a load signal, not a follow signal.&lt;/p&gt;

&lt;p&gt;The difference matters. A skill can load successfully and be completely ignored by the model. Without tracking adherence, you're optimizing in the dark.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Tracks loading&lt;/th&gt;
&lt;th&gt;Tracks following&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Built-in OTel&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude_telemetry&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-code-otel&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;skills.sh&lt;/td&gt;
&lt;td&gt;Install count only&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;skillpulse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes (planned)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;Skillpulse is a &lt;code&gt;PostToolUse&lt;/code&gt; hook that fires on every &lt;code&gt;Skill&lt;/code&gt; tool call. It writes one JSONL line per activation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skill_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"signum:signum"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-11T08:48:18Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"20260311_114818_93074"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"loaded"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"followed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"plugin_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"skillpulse"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation is 60 lines of bash. Some design choices:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2-second watchdog.&lt;/strong&gt; PostToolUse hooks run synchronously -- a hung hook blocks the entire session. Skillpulse spawns a self-kill watchdog that sends SIGKILL after 2 seconds. In practice, the hook finishes in &amp;lt;50ms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill-only filter.&lt;/strong&gt; PostToolUse fires for every tool call -- Read, Write, Bash, everything. Skillpulse checks &lt;code&gt;tool_name == "Skill"&lt;/code&gt; and exits immediately for anything else. Zero overhead on non-skill calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Append-only JSONL.&lt;/strong&gt; No database, no rotation, no config. One file at &lt;code&gt;~/.local/share/emporium/activation.jsonl&lt;/code&gt;. Survives crashes, easy to inspect, trivial to back up.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned from 4 entries
&lt;/h2&gt;

&lt;p&gt;Yes, four. Skillpulse was created on March 4th and then... not installed. Classic. I fixed that today.&lt;/p&gt;

&lt;p&gt;But even 4 entries told me something useful:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Skill                       Acts  Sess  Load%  Age
-----------------------------------------------------
herald:news-digest             2     2  100%   7d
arbiter                        1     1  100%   7d
signum:signum                  1     1  100%   7d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three skills account for all activations. I have 16 installed. That's an 81% dormancy rate. Most of my skills are dead weight consuming context tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The aggregator
&lt;/h2&gt;

&lt;p&gt;A 90-line Python script reads the JSONL and produces per-skill stats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 scripts/aggregate.py          &lt;span class="c"&gt;# table output&lt;/span&gt;
python3 scripts/aggregate.py &lt;span class="nt"&gt;--json&lt;/span&gt;   &lt;span class="c"&gt;# for pipelines&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No dependencies. Reads timestamps, groups by skill_id, computes frequency, unique sessions, loaded rate, and days since last activation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is going
&lt;/h2&gt;

&lt;p&gt;Skillpulse is Phase 1 of a larger pipeline I'm calling EvoSkill -- skills that improve themselves based on usage data.&lt;/p&gt;

&lt;p&gt;The pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;skillpulse (log)
    |
aggregator (stats)
    |
bench (test against tasks)
    |
evolver (propose improvements)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;followed&lt;/code&gt; field is currently null -- it needs tool-pattern fingerprinting to determine if the model executed a skill's expected behavior. That's the hard part, and I'm deliberately deferring it until I have enough activation data to validate the approach.&lt;/p&gt;

&lt;p&gt;Some things I'm explicitly skipping at my current scale (&amp;lt;20 sessions/week, 16 skills):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hotelling T-squared drift detection&lt;/strong&gt; -- need 50+ trajectories per skill&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bayesian calibration&lt;/strong&gt; -- need labeled outcome data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical routing&lt;/strong&gt; -- relevant at 50+ skills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated gates&lt;/strong&gt; -- human review is my gate for now&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The research backing this is in &lt;a href="https://arxiv.org/abs/2603.02766" rel="noopener noreferrer"&gt;EvoSkill&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2603.01145" rel="noopener noreferrer"&gt;AutoSkill&lt;/a&gt;, and the &lt;a href="https://arxiv.org/abs/2601.04170" rel="noopener noreferrer"&gt;ASI drift framework&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add-json skillpulse &lt;span class="s1"&gt;'{"source": {"source": "url", "url": "https://github.com/heurema/skillpulse.git"}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source: &lt;a href="https://github.com/heurema/skillpulse" rel="noopener noreferrer"&gt;github.com/heurema/skillpulse&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All data stays local. MIT licensed. Zero dependencies beyond bash and jq.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>claudecode</category>
      <category>telemetry</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Environment is context: security auditing for AI agent workstations</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Wed, 11 Mar 2026 07:25:37 +0000</pubDate>
      <link>https://dev.to/t3chn/environment-is-context-security-auditing-for-ai-agent-workstations-1li8</link>
      <guid>https://dev.to/t3chn/environment-is-context-security-auditing-for-ai-agent-workstations-1li8</guid>
      <description>&lt;p&gt;We talk a lot about prompts, tools, and evals. But almost nobody audits the environment where the AI agent actually runs.&lt;/p&gt;

&lt;p&gt;The agent sees your &lt;code&gt;.env&lt;/code&gt; files. Your &lt;code&gt;.mcp.json&lt;/code&gt; with hardcoded tokens. Your &lt;code&gt;settings.json&lt;/code&gt; with &lt;code&gt;"permissions": "allow"&lt;/code&gt;. Your plugins, hooks, configs. All of this is operational context, and it directly determines what the agent can do. If an API key sits in plaintext - the agent will read it. If no &lt;code&gt;PreToolUse&lt;/code&gt; hook is configured - any Bash command runs unfiltered. If &lt;code&gt;.claudeignore&lt;/code&gt; is missing - the agent reads every file in the project.&lt;/p&gt;

&lt;p&gt;These are not hypothetical risks. This is the default configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The attack surface nobody measures
&lt;/h2&gt;

&lt;p&gt;Run a mental audit of your workstation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secrets.&lt;/strong&gt; How many &lt;code&gt;.env&lt;/code&gt; files do your projects have? Are they in &lt;code&gt;.gitignore&lt;/code&gt;? Any secrets in git history? When you launch Claude Code, the shell already contains &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;, &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;, &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; - the agent can run &lt;code&gt;printenv&lt;/code&gt; and see everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP servers.&lt;/strong&gt; Open &lt;code&gt;.mcp.json&lt;/code&gt;. Tokens right there in JSON? Server versions unpinned? No &lt;code&gt;allowedTools&lt;/code&gt; to restrict available tools? Every MCP server is a child process that inherits all environment variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hooks.&lt;/strong&gt; Is there a &lt;code&gt;PreToolUse&lt;/code&gt; hook filtering dangerous Bash commands? What about subagents? Claude Code doesn't inherit parent hooks in subagents - that's a documented bug, not a feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trust boundaries.&lt;/strong&gt; Do you have &lt;code&gt;.claudeignore&lt;/code&gt;? Is permission mode &lt;code&gt;default&lt;/code&gt; or &lt;code&gt;acceptEdits&lt;/code&gt;? How many plugins are installed and which ones have &lt;code&gt;hooks&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;Each of these questions is binary: yes or no, safe or not. They can be checked deterministically, without an LLM, without interpretation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment as context
&lt;/h2&gt;

&lt;p&gt;In context engineering, we turn the implicit into the explicit. Prompts, instructions, tools - everything becomes structured context that shapes agent behavior.&lt;/p&gt;

&lt;p&gt;But the runtime environment is also context. When an agent launches in a shell with &lt;code&gt;direnv&lt;/code&gt;-loaded secrets, it gets access not because you designed it that way, but because nobody checked. When an MCP server starts without &lt;code&gt;allowedTools&lt;/code&gt;, the agent gets access to every tool - not because it's needed, but because that's the default.&lt;/p&gt;

&lt;p&gt;Workstation security posture is implicit context. And as long as it's implicit, you can't manage it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sentinel: deterministic audit
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/heurema/sentinel" rel="noopener noreferrer"&gt;Sentinel&lt;/a&gt; is a Claude Code plugin that runs 18 checks across six categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Checks&lt;/th&gt;
&lt;th&gt;What it looks for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;secrets&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Plaintext keys, &lt;code&gt;.env&lt;/code&gt; without &lt;code&gt;.gitignore&lt;/code&gt;, secrets in git history, runtime env vars, dotfiles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcp&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Tokens in &lt;code&gt;.mcp.json&lt;/code&gt;, missing &lt;code&gt;allowedTools&lt;/code&gt;, unpinned server versions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;plugins&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Registry drift, scope leakage, unverified plugins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hooks&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Missing &lt;code&gt;PreToolUse&lt;/code&gt; guard, subagent hook gap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;trust&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;No &lt;code&gt;.claudeignore&lt;/code&gt;, broad permissions, injection surface&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;config&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Insecure defaults, stale sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each check is a standalone POSIX sh script outputting JSON. No LLM. No heuristics. &lt;code&gt;grep&lt;/code&gt; finds a plaintext token in &lt;code&gt;.mcp.json&lt;/code&gt; - or it doesn't. &lt;code&gt;stat&lt;/code&gt; checks file permissions - 600 or not. Results are reproducible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LOAD &amp;gt; VALIDATE &amp;gt; PLAN &amp;gt; RUN &amp;gt; NORMALIZE &amp;gt; ASSESS &amp;gt; PERSIST &amp;gt; RENDER
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output is a JSON report and a terminal scorecard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  sentinel audit - run_20260311T120000Z
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Category     Score   Checks
  secrets       40/100  ██░░░░░░░░  2/5 pass
  mcp           67/100  ██████░░░░  2/3 pass
  plugins      100/100  ██████████  3/3 pass
  hooks          0/100  ░░░░░░░░░░  0/2 pass
  trust         60/100  ██████░░░░  2/3 pass
  config        50/100  █████░░░░░  1/2 pass

  Total: 47/100    Verdict: FAIL
  Reliability: 1.00 (18/18 checks ran)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two independent metrics: &lt;strong&gt;score&lt;/strong&gt; (0-100) for security posture, &lt;strong&gt;reliability&lt;/strong&gt; (0.0-1.0) for how much of the audit actually ran. A score of 95 with reliability 0.4 is not trustworthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I found
&lt;/h2&gt;

&lt;p&gt;The first sentinel run on my workstation scored 47 out of 100. Real findings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8 plaintext &lt;code&gt;.env&lt;/code&gt; files with API keys across 4 work contexts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;, &lt;code&gt;OPENAI_API_KEY&lt;/code&gt;, and 12 more secrets accessible via &lt;code&gt;printenv&lt;/code&gt; in the current shell&lt;/li&gt;
&lt;li&gt;3 MCP servers with tokens in &lt;code&gt;.mcp.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Zero &lt;code&gt;PreToolUse&lt;/code&gt; hooks - any Bash command runs without filtering&lt;/li&gt;
&lt;li&gt;Missing &lt;code&gt;.claudeignore&lt;/code&gt; in several projects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these were accidental. This is the result of standard setup: install Claude Code, add MCP servers, start working. Environment security is not what you think about during installation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Remediation: not just finding, but fixing
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;/sentinel-fix &amp;lt;run_id&amp;gt;&lt;/code&gt; walks through each FAIL/WARN finding and shows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What&lt;/strong&gt; - problem description with redacted evidence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why&lt;/strong&gt; - risk explanation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How&lt;/strong&gt; - specific command to fix&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Commands come from the check registry, not generated by an LLM. &lt;code&gt;sentinel-fix&lt;/code&gt; never auto-executes - it only suggests. Risk badge for each action: &lt;code&gt;safe&lt;/code&gt;, &lt;code&gt;caution&lt;/code&gt;, &lt;code&gt;dangerous&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;After fixing, run &lt;code&gt;/sentinel-diff&lt;/code&gt; to compare reports. Each finding has a stable &lt;code&gt;finding_id&lt;/code&gt; (SHA-256 of &lt;code&gt;check_id|category|evidence_paths&lt;/code&gt;), enabling tracking: new issues, resolved issues, status changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for context engineering
&lt;/h2&gt;

&lt;p&gt;We spend effort ensuring the agent gets the right system prompt, the right tools, the right documentation. But the runtime environment is context too - just unmanaged. A plaintext secret in &lt;code&gt;.env&lt;/code&gt; is not a security problem in a vacuum. It's implicit context that determines what the agent &lt;em&gt;can&lt;/em&gt; do, beyond what it's &lt;em&gt;supposed&lt;/em&gt; to do.&lt;/p&gt;

&lt;p&gt;A security audit for the AI workstation is not paranoia. It's the same practice as dependency checking, linting, CI pipelines. There just wasn't a tool for this new class of risks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude plugin add heurema/sentinel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/sentinel              &lt;span class="c"&gt;# full audit&lt;/span&gt;
/sentinel &lt;span class="nt"&gt;--deep&lt;/span&gt;       &lt;span class="c"&gt;# audit + LLM risk explanation&lt;/span&gt;
/sentinel-fix &amp;lt;run_id&amp;gt; &lt;span class="c"&gt;# guided remediation&lt;/span&gt;
/sentinel-diff         &lt;span class="c"&gt;# compare with previous audit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/sentinel" rel="noopener noreferrer"&gt;sentinel on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://skill7.dev/devtools/sentinel" rel="noopener noreferrer"&gt;skill7.dev/devtools/sentinel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/en/signum-contract-first-ai-dev"&gt;Contract is context: Signum&lt;/a&gt; - AI code verification&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/en/heurema-ecosystem"&gt;11 plugins, one marketplace: the heurema ecosystem&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>contextengineering</category>
      <category>claudecode</category>
      <category>security</category>
      <category>agents</category>
    </item>
    <item>
      <title>Research Agents Lie. The Fix Is Adversarial Verification.</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Tue, 10 Mar 2026 14:03:44 +0000</pubDate>
      <link>https://dev.to/t3chn/research-agents-lie-the-fix-is-adversarial-verification-13ne</link>
      <guid>https://dev.to/t3chn/research-agents-lie-the-fix-is-adversarial-verification-13ne</guid>
      <description>&lt;p&gt;You asked an AI research assistant a detailed question and got a confident multi-page answer with citations. Some of those citations don't exist. Several facts contradict each other. The synthesis reads well — it's structured, well-argued, fluent. It's also built on claims no one verified.&lt;/p&gt;

&lt;p&gt;This is not an edge case. It's the default behavior of every research agent I've looked at.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Research agents optimize for coherence, not correctness. The workflow is always some variation of: gather sources → read and chunk → synthesize. The final output is shaped by what reads well together, not what's actually true.&lt;/p&gt;

&lt;p&gt;The failure mode is subtle. You get a report that passes casual inspection. No obvious hallucinations, reasonable citations, plausible numbers. But if you trace the actual claims — "X was released in 2023", "Y's accuracy is 94%", "Z approach outperforms alternatives by 40%" — a significant fraction are wrong, unverifiable, or sourced from a single origin that all the other citations are copying.&lt;/p&gt;

&lt;p&gt;This is worse than no research. It produces false confidence. You walk away with a mental model that has errors baked in at the foundation level.&lt;/p&gt;

&lt;h2&gt;
  
  
  The landscape
&lt;/h2&gt;

&lt;p&gt;I went through seven OSS research frameworks to understand what they actually do: node-deepresearch, deep-research (dzhng), GPT-Researcher (assafelovic), STORM (Stanford), plus the commercial systems — OpenAI Deep Research, Gemini Deep Research, Perplexity.&lt;/p&gt;

&lt;p&gt;Every one of them follows the same pattern: decompose topic → search and fetch → synthesize. Some have multi-step retrieval, some have recursive query expansion, some have beautiful citation formatting. None verify claims adversarially after synthesis.&lt;/p&gt;

&lt;p&gt;Perplexity leads on speed and gets 93.9% on SimpleQA. OpenAI and Gemini lead on depth. GPT-Researcher won CMU's DeepResearchGym benchmark. These are real achievements. But the benchmark question is "did the final report contain the right answer?" — not "what percentage of atomic claims in the report are independently verified?"&lt;/p&gt;

&lt;p&gt;That's the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delve
&lt;/h2&gt;

&lt;p&gt;Delve is a Claude Code plugin built as a pure SKILL.md file. No binaries, no scripts — just orchestration logic and reference prompts. Five stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SCAN → DECOMPOSE → DIVE → VERIFY → SYNTHESIZE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first three stages are table stakes: scan existing sources and memory, decompose the topic into 2-6 independent sub-questions, dispatch parallel research subagents (2-6 depending on &lt;code&gt;--depth&lt;/code&gt;) to investigate each one. Standard research pipeline, done well.&lt;/p&gt;

&lt;p&gt;The fourth stage is where it differs.&lt;/p&gt;

&lt;h3&gt;
  
  
  VERIFY: adversarial claim-level checking
&lt;/h3&gt;

&lt;p&gt;After DIVE completes, before synthesis touches anything, VERIFY runs independently:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Claim extraction.&lt;/strong&gt; All dive outputs get decomposed into atomic claims. Not summaries — individual assertions. "X library achieves 94% accuracy on benchmark Y." "Project Z was last updated in 2024." "The approach outperforms alternatives by a factor of 3." Each claim gets a &lt;code&gt;c_&amp;lt;hash&amp;gt;&lt;/code&gt; identifier and a classification: factual, quantitative, time-sensitive, methodology, opinion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Adversarial verification.&lt;/strong&gt; Independent subagents receive batches of claims. The prompt framing is explicit: find flaws, don't confirm. Crucially, these agents do not see the original research context — no anchoring to the synthesis they're checking. They go look for independent evidence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;verified&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;contested&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rejected&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;uncertain&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each verdict includes evidence and sources. Source independence is checked: three blog posts copying the same press release don't count as three confirmations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Synthesis with explicit provenance.&lt;/strong&gt; Contested claims get both sides presented with evidence. Rejected claims are excluded or flagged. If more than 30% of claims are contested, the output is labeled &lt;code&gt;draft&lt;/code&gt;. The report includes a Methodology section showing which stages ran, how many agents, timing, and the quality verdict.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality model
&lt;/h3&gt;

&lt;p&gt;The output carries two orthogonal labels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Verification status&lt;/strong&gt;: &lt;code&gt;verified&lt;/code&gt; (≥80% claims checked, 0 failed P0) / &lt;code&gt;partially-verified&lt;/code&gt; / &lt;code&gt;unverified&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completion status&lt;/strong&gt;: &lt;code&gt;complete&lt;/code&gt; / &lt;code&gt;incomplete&lt;/code&gt; / &lt;code&gt;draft&lt;/code&gt; / &lt;code&gt;synthesis_only&lt;/code&gt; / &lt;code&gt;no_evidence&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is honest accounting. If verification was skipped because sources were unavailable, you know. If the report is flagged &lt;code&gt;draft&lt;/code&gt; because the landscape is contested, you know that too.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline diagram
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;/delve "autoresearch landscape" --depth medium

SCAN     [~25s]   → 12 sources found, decision: full-run
DECOMPOSE [~5s]   → 4 sub-questions decomposed
         ↓ HITL checkpoint: approve/edit sub-questions
DIVE     [~4min]  → 4 agents in parallel (background)
         ↓ all P0 completed, coverage 1.0
VERIFY   [~90s]   → 47 claims extracted, 3 agents
         ↓ 42 verified, 4 uncertain, 1 rejected
SYNTHESIZE[~15s]  → report written
         ↓
docs/research/2026-03-10-autoresearch-landscape-a1b2.md
quality: verified / complete
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/delve &lt;span class="s2"&gt;"autoresearch landscape"&lt;/span&gt; &lt;span class="nt"&gt;--depth&lt;/span&gt; medium
/delve &lt;span class="s2"&gt;"WASM runtimes for edge"&lt;/span&gt; &lt;span class="nt"&gt;--quick&lt;/span&gt;          &lt;span class="c"&gt;# scan + synthesize only, ~90s&lt;/span&gt;
/delve &lt;span class="s2"&gt;"security audit approach X"&lt;/span&gt; &lt;span class="nt"&gt;--providers&lt;/span&gt; claude  &lt;span class="c"&gt;# single-model, sensitive topic&lt;/span&gt;
/delve resume                                    &lt;span class="c"&gt;# resume interrupted run&lt;/span&gt;
/delve status                                    &lt;span class="c"&gt;# list recent runs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resume support is file-based with &lt;code&gt;events.jsonl&lt;/code&gt; as the canonical log. If the orchestrator crashes mid-DIVE, &lt;code&gt;/delve resume&lt;/code&gt; reuses completed worker outputs and continues from where it stopped.&lt;/p&gt;

&lt;h2&gt;
  
  
  The design insight
&lt;/h2&gt;

&lt;p&gt;The standard framing is "research is a retrieval problem." Add more sources, better chunking, smarter query expansion. This produces marginal improvements on the coherence metric while leaving the correctness problem untouched.&lt;/p&gt;

&lt;p&gt;Delve treats research as a verification problem. The VERIFY stage adds 40-60% to total run time. The tradeoff is explicit: you get a report where the trust model is different. Not "the AI synthesized this confidently" but "these claims were checked by agents with adversarial prompting and independent access."&lt;/p&gt;

&lt;p&gt;That said — an honest admission. Verification quality depends on what's available. Some domains have sparse or low-quality web coverage. Time-sensitive facts from internal systems or paywalled sources may come back &lt;code&gt;uncertain&lt;/code&gt; regardless of how many agents look. The quality model makes this explicit rather than hiding it behind confident prose.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--providers claude&lt;/code&gt; flag handles sensitive topics: single-model mode where external subagent dispatch is blocked. Maximum verification label in that mode is &lt;code&gt;partially-verified&lt;/code&gt; — same-model verification isn't structurally independent, and the report says so.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude plugin marketplace add heurema/emporium
claude plugin &lt;span class="nb"&gt;install &lt;/span&gt;delve@emporium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/heurema/delve" rel="noopener noreferrer"&gt;github.com/heurema/delve&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Plugin page: &lt;a href="https://skill7.dev/plugins/delve" rel="noopener noreferrer"&gt;skill7.dev/plugins/delve&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>contextengineering</category>
      <category>claudecode</category>
      <category>agents</category>
      <category>deepresearch</category>
    </item>
    <item>
      <title>From Plugin to Product: How Herald Became Sift and Why the Data Model Changed Everything</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Tue, 10 Mar 2026 11:13:15 +0000</pubDate>
      <link>https://dev.to/t3chn/from-plugin-to-product-how-herald-became-sift-and-why-the-data-model-changed-everything-46oo</link>
      <guid>https://dev.to/t3chn/from-plugin-to-product-how-herald-became-sift-and-why-the-data-model-changed-everything-46oo</guid>
      <description>

&lt;p&gt;Herald was a Python plugin that collected RSS feeds and Hacker News, clustered articles by title similarity, and generated Markdown briefs. It ran locally, required zero API keys, and did exactly what it was supposed to do.&lt;/p&gt;

&lt;p&gt;Then we tried to make it useful for real work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Broke
&lt;/h2&gt;

&lt;p&gt;Herald's core assumption was that articles are the primary unit. You collect articles, deduplicate by URL, cluster by title similarity, score by source weight and recency, and project a Markdown brief. This works for a developer reading morning news.&lt;/p&gt;

&lt;p&gt;It doesn't work when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An agent needs to know whether "Coinbase lists TOKEN" and "TOKEN now available on Coinbase" are the same real-world fact&lt;/li&gt;
&lt;li&gt;You need confidence levels, not just scores - how many independent sources confirm this event?&lt;/li&gt;
&lt;li&gt;The system must update when new evidence arrives, not just when the next cron runs&lt;/li&gt;
&lt;li&gt;A downstream automation needs typed fields (&lt;code&gt;assets: ["BTC"]&lt;/code&gt;, &lt;code&gt;event_type: "listing"&lt;/code&gt;) instead of parsing Markdown&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fundamental problem: Herald modeled content. The world it was trying to represent contained events.&lt;/p&gt;

&lt;h2&gt;
  
  
  Articles vs Events
&lt;/h2&gt;

&lt;p&gt;In Herald, the main object was a &lt;code&gt;Story&lt;/code&gt; - a cluster of articles with similar titles:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Story: "Python 3.14 Released"
  - Article from HN (score: 342)
  - Article from Simon Willison's blog
  - Article from Python.org
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cluster was the output. The articles were the atoms.&lt;/p&gt;

&lt;p&gt;In Sift, the main object is an &lt;code&gt;Event&lt;/code&gt; - a structured fact pattern with provenance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"evt_2026030801"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bitcoin ETF daily inflow hits $1.2B record"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"market_milestone"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"assets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"BTC"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"etf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"institutional"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"importance_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.93&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_cluster_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"published_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-08T14:22:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The event is the truth. The articles that support it are evidence. This distinction matters because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Events can be updated.&lt;/strong&gt; When a new article confirms or contradicts an event, the confidence score changes. Herald's stories were frozen after clustering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Events have typed metadata.&lt;/strong&gt; &lt;code&gt;assets&lt;/code&gt;, &lt;code&gt;topics&lt;/code&gt;, &lt;code&gt;event_type&lt;/code&gt; are queryable fields, not bag-of-words extracted from titles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Events separate importance from confidence.&lt;/strong&gt; A rumor about a Bitcoin ETF approval is high-importance but low-confidence. Herald couldn't express this - a story was either in the brief or it wasn't.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  JSON as Truth, Markdown as Projection
&lt;/h2&gt;

&lt;p&gt;Herald's output was a Markdown file. That was the product. Agents read Markdown, humans read Markdown, done.&lt;/p&gt;

&lt;p&gt;Sift inverts this. The canonical record is a typed JSON event. Everything else is a projection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The human digest? A Markdown rendering of the top events in a time window.&lt;/li&gt;
&lt;li&gt;The agent context? The same JSON, filtered by asset and topic.&lt;/li&gt;
&lt;li&gt;The WebSocket stream? Push notifications when an event is upserted.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;llms.txt&lt;/code&gt;? A static slice for LLM-friendly discovery.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't theoretical purity. It's operational: when the API returns an event, the browser workspace and the CLI both render from the same record. There's no "browser version" and "agent version" of the truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python to Go
&lt;/h2&gt;

&lt;p&gt;Herald was ~1,200 lines of Python. Sift is ~7,000 lines of Go. The rewrite wasn't about performance benchmarks.&lt;/p&gt;

&lt;p&gt;Three things drove the language change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Single binary deployment.&lt;/strong&gt; Sift Pro is a hosted service running on a Linux node. &lt;code&gt;go build&lt;/code&gt; produces one binary. No virtualenv, no pip, no runtime. The systemd unit file is trivial.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shared pipeline.&lt;/strong&gt; The same Go packages (&lt;code&gt;internal/pipeline&lt;/code&gt;, &lt;code&gt;internal/event&lt;/code&gt;, &lt;code&gt;internal/ingest&lt;/code&gt;) power both the local &lt;code&gt;sift&lt;/code&gt; CLI and the hosted &lt;code&gt;siftd&lt;/code&gt; server. In Python, sharing code between a CLI tool and an async web server meant fighting import paths and event loops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Concurrency for real-time.&lt;/strong&gt; Sift's hosted mode runs a scheduler, HTTP API, and WebSocket broadcaster in one process. Go's goroutines and channels made this straightforward. Python's asyncio could do it, but the cognitive overhead was higher for a small team.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The trade-off: Go's type system caught things earlier but made rapid prototyping slower. Herald's first version was built in a day. Sift's v0 took a week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local Free + Hosted Pro
&lt;/h2&gt;

&lt;p&gt;Herald was local-only by design. Sift keeps a local tier and adds a hosted one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sift Free&lt;/strong&gt; (local CLI):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQLite storage under &lt;code&gt;~/.sift/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;User-controlled sync schedule&lt;/li&gt;
&lt;li&gt;Same event model, same digest projections&lt;/li&gt;
&lt;li&gt;Full ownership of your data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sift Pro&lt;/strong&gt; ($5/mo):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hosted Postgres store with 30-day retention&lt;/li&gt;
&lt;li&gt;Autonomous sync every 5 minutes&lt;/li&gt;
&lt;li&gt;Authenticated REST API (&lt;code&gt;/v1/events&lt;/code&gt;, &lt;code&gt;/v1/digests&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;WebSocket stream for real-time updates&lt;/li&gt;
&lt;li&gt;Zitadel-backed accounts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The split matters because the free tier is a real product, not a crippled teaser. A developer who wants local crypto news intelligence gets it. A developer who wants always-on event delivery for their agents pays for the hosted runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Agents Actually Need
&lt;/h2&gt;

&lt;p&gt;The deeper lesson from Herald to Sift is about what agents need from a news system.&lt;/p&gt;

&lt;p&gt;Herald gave agents Markdown. It was human-readable, which seemed like a feature. But agents don't need prose. They need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Typed records&lt;/strong&gt; they can filter without parsing natural language&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence signals&lt;/strong&gt; so they can decide whether to act on a report&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stable IDs&lt;/strong&gt; so they can reference events across sessions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push delivery&lt;/strong&gt; so they don't poll for updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provenance&lt;/strong&gt; so they can trace a claim to its sources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a context engineering problem. The question isn't "what text do I feed the model." It's "what structured context does the agent need to make a decision."&lt;/p&gt;

&lt;p&gt;Herald's Markdown brief was a human projection pretending to be agent context. Sift's JSON events are agent context that happens to have a human projection.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Provenance Rule
&lt;/h2&gt;

&lt;p&gt;One principle from Sift's manifesto drove more design decisions than any other: no claim without provenance.&lt;/p&gt;

&lt;p&gt;Every event tracks which sources contributed to it. The &lt;code&gt;source_cluster_size&lt;/code&gt; field tells you how many independent sources confirmed the event. The &lt;code&gt;confidence_score&lt;/code&gt; is derived from source agreement, not from a language model's guess.&lt;/p&gt;

&lt;p&gt;This means Sift can honestly say: "7 sources reported this ETF milestone, confidence 0.93" vs "1 blog mentioned this rumor, confidence 0.41." Herald couldn't distinguish these - both would appear as stories with different scores, but the scoring didn't separate importance from evidence quality.&lt;/p&gt;

&lt;p&gt;The practical impact: downstream agents can set thresholds. "Only act on events with confidence &amp;gt; 0.8 and source_cluster_size &amp;gt; 3." That's a policy an automation can enforce. "Only act on stories with score &amp;gt; 50" is a guess.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Stayed the Same
&lt;/h2&gt;

&lt;p&gt;Not everything changed. The core insight from Herald survived intact: clustering related reports into a single unit is the most valuable transformation in a news pipeline. Whether you call it a story or an event, deduplication-by-meaning is what turns 45 articles into 27 actionable items.&lt;/p&gt;

&lt;p&gt;The scoring formula changed, but the principle didn't: source weight matters, recency matters, cross-source confirmation matters.&lt;/p&gt;

&lt;p&gt;And the local-first instinct survived. Sift Pro exists because some users need it, not because local-first was wrong. The free CLI proves the data model works without a cloud dependency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Sift is live at &lt;a href="https://skill7.dev/sift" rel="noopener noreferrer"&gt;skill7.dev/sift&lt;/a&gt;. The local CLI is open source.&lt;/p&gt;

&lt;p&gt;Herald remains available as a Claude Code plugin for developers who want configurable, multi-topic news intelligence without accounts or subscriptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://skill7.dev/sift" rel="noopener noreferrer"&gt;Sift on skill7.dev&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/en/herald-v2-local-news-intelligence"&gt;Herald v2: Local-First News Intelligence for AI Agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/herald" rel="noopener noreferrer"&gt;Herald on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>contextengineering</category>
      <category>agents</category>
      <category>go</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Spec-Gated Delivery: Why PR Review Is the Wrong Trust Checkpoint for AI Code</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Fri, 06 Mar 2026 10:27:26 +0000</pubDate>
      <link>https://dev.to/t3chn/spec-gated-delivery-why-pr-review-is-the-wrong-trust-checkpoint-for-ai-code-1hfg</link>
      <guid>https://dev.to/t3chn/spec-gated-delivery-why-pr-review-is-the-wrong-trust-checkpoint-for-ai-code-1hfg</guid>
      <description>&lt;p&gt;AI made writing code mass-affordable. It did not make trusting code any cheaper.&lt;/p&gt;

&lt;p&gt;The standard pipeline today: issue or prompt, AI writes code, AI or human reviews the PR, merge. This was always imperfect, but it scaled when humans wrote every line and the reviewer could reason about intent. It breaks when most of the diff was generated in seconds and the reviewer has to reverse-engineer intent from the output.&lt;/p&gt;

&lt;p&gt;The bottleneck moved. Generation is cheap. Verification is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The PR Review Trap
&lt;/h2&gt;

&lt;p&gt;PR review is a late, expensive, probabilistic checkpoint. By the time you're looking at a diff, the code exists, the tests exist, the commit message exists. You're pattern-matching against "does this look right" - not against a specification of what "right" means.&lt;/p&gt;

&lt;p&gt;Three failure modes compound:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The issue isn't a spec.&lt;/strong&gt; "Add rate limiting" has ten valid implementations. The reviewer is comparing the diff against their mental model, not against a shared artifact.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI reviewing AI without ground truth is circular.&lt;/strong&gt; Running three models on a diff gives you three opinions. They can catch bugs and style issues. But without a formalized spec to check against, they're comparing the diff to their own assumptions - not to verified intent. Multi-model review becomes useful when it's anchored to a concrete spec (what Signum does in its audit phase), not when it replaces one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Weak evidence survives the merge.&lt;/strong&gt; After the PR closes, what remains? Review comments, approval checkmarks, linked issues, CI logs. Some teams have richer artifacts - test reports, CODEOWNERS traces, provenance attestations. But even in mature pipelines, there's rarely a single machine-readable artifact that ties the change to a pre-approved, verified intent with holdout results.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Shift
&lt;/h2&gt;

&lt;p&gt;The primary trust artifact should not be a diff. It should be an approved specification.&lt;/p&gt;

&lt;p&gt;The primary gate should not be "does this look right to a reviewer." It should be "does this pass deterministic checks against the approved intent."&lt;/p&gt;

&lt;p&gt;The primary evidence should not be review comments. It should be a signed conformance artifact.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Approved intent -&amp;gt; blinded execution -&amp;gt; deterministic verification -&amp;gt; signed evidence -&amp;gt; decision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't a theory. It's an operational pattern. The pieces exist today: typed specs, deterministic test runners, holdout test sets, attestation primitives (DSSE, in-toto, SLSA). Some teams have built parts of this internally. But there's no standard open stack that combines spec gating, holdout governance, and signed conformance evidence into a single delivery pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Spec Gate Actually Checks
&lt;/h2&gt;

&lt;p&gt;A spec gate doesn't prove code is correct in the general case. That's formal verification, a different problem. It proves something narrower and more useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which contract was approved, by whom, when&lt;/li&gt;
&lt;li&gt;Which commit was verified against it&lt;/li&gt;
&lt;li&gt;Which deterministic checks ran and their results&lt;/li&gt;
&lt;li&gt;Which holdout checks (invisible to the implementing agent) passed or failed&lt;/li&gt;
&lt;li&gt;That the evidence bundle wasn't tampered with after the fact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trust model depends on who controls the verifier, who seals the holdouts, and whether the implementing agent can influence the evidence chain. In Signum's case: holdouts are sealed at contract approval time, the engineer agent receives a filtered contract, and the proofpack is hashed against the original. This doesn't make it tamper-proof in all threat models, but it raises the bar beyond "the CI runner said pass."&lt;/p&gt;

&lt;p&gt;This is proof of conformance + proof of process. Not proof of correctness.&lt;/p&gt;

&lt;p&gt;What it explicitly does not prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;That the spec itself is perfect&lt;/li&gt;
&lt;li&gt;That holdout checks cover every edge case&lt;/li&gt;
&lt;li&gt;That no unknown class of defect exists&lt;/li&gt;
&lt;li&gt;That LLM-judged checks equal formal verification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Saying this out loud matters. The moment you claim more than you deliver, you're selling snake oil.&lt;/p&gt;

&lt;h2&gt;
  
  
  Holdouts: The Key Mechanism
&lt;/h2&gt;

&lt;p&gt;The most powerful idea in spec-gated delivery is holdout criteria - acceptance checks the implementing agent never sees.&lt;/p&gt;

&lt;p&gt;You write ten acceptance criteria. Three are marked holdout. The agent receives seven. It implements, writes tests, passes everything it can see. Then CI runs the holdout checks against the finished code.&lt;/p&gt;

&lt;p&gt;If the agent forgot to handle counter reset on window expiry, or missed the edge case where the input is empty, the holdout catches it - not because a reviewer spotted it, but because a criterion existed before implementation started.&lt;/p&gt;

&lt;p&gt;Important: holdout criteria must be consequences of the visible contract, not secretly added requirements. If the visible spec says "rate limit POST /api/tokens at 5/min," a holdout that checks counter reset after window expiry is a valid derivation. A holdout that adds a new endpoint is not - that's an undisclosed requirement.&lt;/p&gt;

&lt;p&gt;This is the difference between "review found a bug" and "the spec author anticipated the failure mode."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Boundaries
&lt;/h2&gt;

&lt;p&gt;Spec-gated delivery has real limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spec quality is the ceiling.&lt;/strong&gt; Bad specs produce false confidence. A spec gate that passes a weak contract is worse than no gate at all, because it creates an illusion of verification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not everything is deterministically verifiable.&lt;/strong&gt; UX, performance under real load, security posture - these require human judgment or specialized tooling. The system must honestly label each criterion as &lt;code&gt;deterministic&lt;/code&gt;, &lt;code&gt;heuristic&lt;/code&gt;, or &lt;code&gt;manual&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Holdouts require domain expertise to write.&lt;/strong&gt; The value of a holdout is proportional to how well it anticipates failure modes. This is a human skill.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The moat is not in AI code generation. It's not in AI review. It's in the verification and evidence layer: spec quality gates, holdout governance, deterministic verifier mapping, signed conformance artifacts, and policy engines that separate what's proven from what's guessed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Now
&lt;/h2&gt;

&lt;p&gt;Three mature streams converged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI coding is mass-market.&lt;/strong&gt; Copilot, Claude Code, Cursor - teams generate more code than they can review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contract-driven workflows entered the market.&lt;/strong&gt; Kiro, Spec Kit, amp - the idea that specs should precede implementation is no longer academic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attestation infrastructure is maturing.&lt;/strong&gt; SLSA, Sigstore, in-toto provide useful primitives for signed provenance. Key management and verifier trust remain hard problems, but the building blocks exist for teams willing to invest.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But there's no standard open stack that assembles spec gating, holdout governance, and signed evidence into a single delivery pipeline where the spec is the gate, not the diff.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changes
&lt;/h2&gt;

&lt;p&gt;When spec-gated delivery works, code review stops being the primary truth and becomes a secondary audit. The PR is still useful - for knowledge sharing, for catching spec gaps, for mentoring. But the trust decision moves earlier: to the moment the spec is approved and the holdouts are sealed.&lt;/p&gt;

&lt;p&gt;This is the most important shift. Not "better AI review." Not "more reviewers." A different trust model entirely.&lt;/p&gt;

&lt;p&gt;The formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Approved intent -&amp;gt; blinded execution -&amp;gt; deterministic verification -&amp;gt; signed evidence -&amp;gt; human/CI decision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the evidence artifact says the code conforms to the approved spec, including holdout criteria the agent couldn't see, and the attestation chain is intact - that's a stronger signal than any number of review comments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;We built this into &lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;Signum&lt;/a&gt;, a Claude Code plugin. Spec quality gate, holdout scenarios, multi-model audit, signed proofpack. It's early and opinionated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude plugin marketplace add heurema/emporium
claude plugin &lt;span class="nb"&gt;install &lt;/span&gt;signum@emporium
/signum &lt;span class="s2"&gt;"your task description"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting part isn't the tool. It's the question: if you could gate every AI-generated change on a pre-approved, deterministically verified spec - would you still put PR review at the center of your trust model?&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://slsa.dev/" rel="noopener noreferrer"&gt;SLSA - Supply-chain Levels for Software Artifacts&lt;/a&gt; - framework for software supply chain integrity&lt;/li&gt;
&lt;li&gt;&lt;a href="https://in-toto.io/" rel="noopener noreferrer"&gt;in-toto - A framework for securing the software supply chain&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;Signum on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/posts/en/signum-contract-first-ai-dev"&gt;The Contract Is the Context&lt;/a&gt; - previous post on Signum's contract-first pipeline&lt;/li&gt;
&lt;li&gt;&lt;a href="https://skill7.dev/development/signum" rel="noopener noreferrer"&gt;skill7.dev/development/signum&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>contextengineering</category>
      <category>verification</category>
      <category>agents</category>
      <category>softwaredelivery</category>
    </item>
    <item>
      <title>AI Writes Code. Where Is the Proof?</title>
      <dc:creator>Vitaly D.</dc:creator>
      <pubDate>Thu, 05 Mar 2026 08:47:23 +0000</pubDate>
      <link>https://dev.to/t3chn/ai-writes-code-where-is-the-proof-13l8</link>
      <guid>https://dev.to/t3chn/ai-writes-code-where-is-the-proof-13l8</guid>
      <description>&lt;p&gt;AI generated a function in seconds. Three models reviewed it. All said "looks good." Question: where is the artifact that confirms this?&lt;/p&gt;

&lt;p&gt;Not "a model approved it" - where is the machine-readable evidence of &lt;em&gt;what&lt;/em&gt; was checked, &lt;em&gt;against what&lt;/em&gt;, and &lt;em&gt;with what result&lt;/em&gt;? In the software supply chain, this artifact is called an attestation. For AI-generated code, it doesn't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proofpack: what's inside
&lt;/h2&gt;

&lt;p&gt;Here's what &lt;code&gt;proofpack.json&lt;/code&gt; looks like after a &lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;Signum&lt;/a&gt; run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schemaVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"createdAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-04T14:23:07Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"runId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"signum-2026-03-04-a7f3c1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AUTO_OK"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"overall"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"auditChain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"contractSha256"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e3b0c44298fc1c14..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"approvedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-04T14:01:12Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"baseCommit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8a4f2dc"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"contract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sha256"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a1b2c3d4..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fullSha256"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"f5e6d7c8..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"present"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"diff"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sha256"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"9f8e7d6c..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sizeBytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4820&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"present"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"checks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mechanic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"present"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"holdout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"present"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reviews"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"claude"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"present"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"codex"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"present"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"gemini"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"present"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two hashes for the contract - &lt;code&gt;sha256&lt;/code&gt; (redacted version without holdout criteria) and &lt;code&gt;fullSha256&lt;/code&gt; (original). Base commit captured before implementation starts. Three independent reviews. Holdout results separate, because the engineer never saw those criteria.&lt;/p&gt;

&lt;p&gt;CI gates on this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;DECISION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.decision'&lt;/span&gt; .signum/proofpack.json&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DECISION&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s2"&gt;"AUTO_OK"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Signum: &lt;/span&gt;&lt;span class="nv"&gt;$DECISION&lt;/span&gt;&lt;span class="s2"&gt; - blocking merge"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No need to parse three models' logs. One file, one field, deterministic gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The broken present
&lt;/h2&gt;

&lt;p&gt;The AI code review industry operates at one level - the diff. A model looks at a patch and says what it thinks. The problem isn't model quality; it's the absence of a definition for "correct."&lt;/p&gt;

&lt;p&gt;CodeRabbit's own measurements&lt;sup id="fnref1"&gt;1&lt;/sup&gt; show 46% useful comments. Copilot Code Review, tested against 117 files with known vulnerabilities, found zero security issues&lt;sup id="fnref2"&gt;2&lt;/sup&gt;. This isn't an indictment of specific tools - it's a consequence of architecture: review without a contract is bounded by what the reviewer considers "reasonable."&lt;/p&gt;

&lt;p&gt;The problem runs deeper. Even when a model finds a bug, the result is a PR comment. Not a machine-readable artifact, not a verification chain, not something CI can gate on. Between "a model left a comment" and "code is verified" - there's a chasm.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four layers: how a proofpack is built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://ctxt.dev/posts/en/signum-contract-first-ai-dev/" rel="noopener noreferrer"&gt;In the previous post&lt;/a&gt; I covered the contract as the source of truth. Here - how four layers together produce a verifiable artifact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CONTRACT.&lt;/strong&gt; The spec is formalized before implementation begins. Graded across 6 dimensions (A-F). Codex and Gemini validate for gaps. Holdout scenarios are generated - hidden acceptance criteria the engineer won't see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EXECUTE.&lt;/strong&gt; The engineer works with &lt;code&gt;contract-engineer.json&lt;/code&gt;, from which holdout criteria are physically removed - not hidden by instruction, but deleted from the file. Baseline (lint, typecheck, test names) is captured before the first line of code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AUDIT.&lt;/strong&gt; The Mechanic runs deterministic checks with zero LLM: linter, typechecker, new test failures by name (not exit code). Then Claude, Codex, and Gemini review the diff independently, in parallel, without seeing each other's assessments. Holdout criteria run against the result. The Synthesizer aggregates: deterministic policy + confidence score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PACK.&lt;/strong&gt; All artifacts embed into &lt;code&gt;proofpack.json&lt;/code&gt;. SHA-256 chains: approved contract → timestamp → base commit → diff → audit results. This isn't a log - it's an attestation.&lt;/p&gt;

&lt;p&gt;Key decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data-level blinding&lt;/strong&gt;, not instruction-level. The engineer cannot infer holdout criteria from context or file structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model audit&lt;/strong&gt;: 3 vendors, 3 independent assessments. Not one model checking itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducible artifacts&lt;/strong&gt; for humans and CI, not trust in model judgment. The proofpack exists as a file - you can inspect it, archive it, audit it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Threat model: what proofpack protects and what it doesn't
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Protects against:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implementation doesn't match spec → holdout criteria catch it&lt;/li&gt;
&lt;li&gt;Rubber-stamp review (one model checking itself) → 3 independent reviewers&lt;/li&gt;
&lt;li&gt;No audit trail → SHA-256 chain with timestamps&lt;/li&gt;
&lt;li&gt;Optimizing for known tests → data-level blinding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Does not protect against:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bad spec. Garbage in - verified garbage out. The quality gate (A-F) reduces risk but doesn't eliminate it.&lt;/li&gt;
&lt;li&gt;Model collusion. Theoretically possible. 3 vendors (Anthropic, OpenAI, Google) mitigate but don't exclude.&lt;/li&gt;
&lt;li&gt;Formal correctness. A proofpack is process integrity, not mathematical proof. SLSA doesn't prove your code is bug-free either - it proves the build wasn't tampered with.&lt;/li&gt;
&lt;li&gt;Malicious spec author. If a human intentionally hides requirements, the system won't help.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More precisely: a proofpack is not proof of correctness, but proof of process. The distinction matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related work
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Gap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://slsa.dev" rel="noopener noreferrer"&gt;SLSA&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Build provenance attestation&lt;/td&gt;
&lt;td&gt;No AI code generation awareness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://in-toto.io" rel="noopener noreferrer"&gt;in-toto&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Software supply chain layout&lt;/td&gt;
&lt;td&gt;Build-time only, no spec → code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://sigstore.dev" rel="noopener noreferrer"&gt;Sigstore&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Code signing + transparency log&lt;/td&gt;
&lt;td&gt;Identity, not correctness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CodeRabbit&lt;/td&gt;
&lt;td&gt;AI diff review&lt;/td&gt;
&lt;td&gt;No contract, holdouts, proof artifact&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copilot Code Review&lt;/td&gt;
&lt;td&gt;AI PR review&lt;/td&gt;
&lt;td&gt;Diff-level, single model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qodo&lt;/td&gt;
&lt;td&gt;AI testing + compliance&lt;/td&gt;
&lt;td&gt;Closer, but no multi-model audit or proofpack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Spec Kit&lt;/td&gt;
&lt;td&gt;Spec-as-input for Copilot&lt;/td&gt;
&lt;td&gt;Spec → code, but no verification loop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What's genuinely new: the four-layer chain from spec through blinded execution and adversarial audit to a tamper-evident artifact. No existing tool connects all four.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proof artifacts - the missing primitive
&lt;/h2&gt;

&lt;p&gt;The software supply chain industry spent years making builds verifiable. SLSA, in-toto, Sigstore - all address the same principle: don't trust, verify, and leave an artifact for audit.&lt;/p&gt;

&lt;p&gt;AI code generation gets by without this. A model writes code, another model leaves a PR comment, a human clicks merge. Nothing machine-readable remains. Proofpack is one implementation; the pattern matters more than the tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/heurema/signum" rel="noopener noreferrer"&gt;signum on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ctxt.dev/posts/en/signum-contract-first-ai-dev/" rel="noopener noreferrer"&gt;The contract is the context&lt;/a&gt; - previous post&lt;/li&gt;
&lt;li&gt;&lt;a href="https://slsa.dev/spec/v1.0/" rel="noopener noreferrer"&gt;SLSA specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://in-toto.io" rel="noopener noreferrer"&gt;in-toto framework&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sigstore.dev" rel="noopener noreferrer"&gt;Sigstore&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;CodeRabbit, "How We Measure Review Quality", 2025. Self-reported metric from their blog. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;Copilot Code Review test against 117 files containing known vulnerabilities (SQL injection, XSS, command injection). None of the vulnerabilities were flagged. Results depend on configuration and sample. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>contextengineering</category>
      <category>claudecode</category>
      <category>verification</category>
      <category>proofpack</category>
    </item>
  </channel>
</rss>
