<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Werner Kasselman</title>
    <description>The latest articles on DEV Community by Werner Kasselman (@wernerk_au).</description>
    <link>https://dev.to/wernerk_au</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891657%2F74cf21db-2405-4ca2-a8a1-cd612b022882.png</url>
      <title>DEV Community: Werner Kasselman</title>
      <link>https://dev.to/wernerk_au</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wernerk_au"/>
    <language>en</language>
    <item>
      <title>What's new in llm-cli-gateway</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Tue, 19 May 2026 04:27:38 +0000</pubDate>
      <link>https://dev.to/wernerk_au/whats-new-in-llm-cli-gateway-58b8</link>
      <guid>https://dev.to/wernerk_au/whats-new-in-llm-cli-gateway-58b8</guid>
      <description>&lt;p&gt;A few weeks ago I wrote &lt;a href="https://medium.com/@wernerk/why-cli-wrapping-beats-api-proxying-for-multi-llm-development-1ddd492c7153" rel="noopener noreferrer"&gt;Why CLI Wrapping Beats API Proxying for Multi-LLM Development&lt;/a&gt;, the case for spawning &lt;code&gt;claude&lt;/code&gt;, &lt;code&gt;codex&lt;/code&gt;, and &lt;code&gt;gemini&lt;/code&gt; as child processes instead of proxying to their APIs. Three things have changed since I published that piece. Two of them fix real limitations I named at the time, and one of them is a new capability that I wish had been there from the start and I think it's worth a follow-up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Codex sessions are now real, not bookkeeping
&lt;/h2&gt;

&lt;p&gt;In the original post I said llm-cli-gateway uses real CLI continuity flags, "&lt;code&gt;--continue&lt;/code&gt; and &lt;code&gt;--resume&lt;/code&gt;, not bookkeeping". That was true for Claude and Gemini. For Codex it was, frankly, not quite there.&lt;/p&gt;

&lt;p&gt;Codex did not have a documented resume mechanism at the time. So when you opened a Codex session through the gateway, the session record was real (UUID, created/lastUsed timestamps, the active-session-per-CLI invariant) but the &lt;code&gt;codex&lt;/code&gt; process itself started fresh on every request. The gateway tagged subsequent requests as belonging to a session, you could see the session in &lt;code&gt;session_list&lt;/code&gt;, but Codex did not know that.&lt;/p&gt;

&lt;p&gt;Codex shipped &lt;code&gt;exec resume &amp;lt;session-id&amp;gt;&lt;/code&gt; and &lt;code&gt;exec resume --last&lt;/code&gt;, and the gateway now wires both. If you pass a real Codex session UUID (the kind that lives in &lt;code&gt;~/.codex/sessions/&lt;/code&gt;), &lt;code&gt;codex_request&lt;/code&gt; invokes &lt;code&gt;exec resume&lt;/code&gt; and you get genuine continuity, the same tool-use history, file context, and partial work the CLI itself preserves. &lt;code&gt;resumeLatest: true&lt;/code&gt; pins to the most recent session without you having to look the UUID up.&lt;/p&gt;

&lt;p&gt;Two caveats worth naming up front. First, only real Codex UUIDs are accepted, gateway-issued &lt;code&gt;gw-*&lt;/code&gt; IDs are rejected on resume, because there is no Codex-side session for them to attach to. Second, &lt;code&gt;--full-auto&lt;/code&gt; is dropped on resume, which is a Codex constraint and not something the gateway can paper over. The trade-off is reasonable, in that you keep the continuity, but need to restate the approval policy.&lt;/p&gt;

&lt;p&gt;Codex now sits where Claude and Gemini sit. The bullet that said "Session continuity using real CLI flags, not bookkeeping" is now true for all of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grok makes four, on purpose
&lt;/h2&gt;

&lt;p&gt;xAI shipped an official Grok CLI (the &lt;code&gt;grok-build&lt;/code&gt; TUI) and I added it as the fourth provider. The tools mirror the others one-for-one, &lt;code&gt;grok_request&lt;/code&gt; and &lt;code&gt;grok_request_async&lt;/code&gt;, sessions through &lt;code&gt;--resume&lt;/code&gt; / &lt;code&gt;--continue&lt;/code&gt;, model registry entries, self-update via &lt;code&gt;grok update&lt;/code&gt;, the same circuit-breaker and approval-gate plumbing, the same flight recorder, the same metrics. Auth follows the same shape, a prior &lt;code&gt;grok login&lt;/code&gt; (OAuth) or a &lt;code&gt;GROK_CODE_XAI_API_KEY&lt;/code&gt; environment variable, with &lt;code&gt;GROK_DEFAULT_MODEL&lt;/code&gt;, &lt;code&gt;GROK_MODELS&lt;/code&gt;, and &lt;code&gt;GROK_MODEL_ALIASES&lt;/code&gt; all honoured.&lt;/p&gt;

&lt;p&gt;The interesting question is not whether to add Grok (the parity work is mechanical) but why. The case is consensus diversity.&lt;/p&gt;

&lt;p&gt;Claude, Codex, and Gemini cover Anthropic, OpenAI, and Google. That lineup is well-suited for parallel review work, but it is three of the same kind of organisation, three model families that share a lot of training data lineage and a lot of post-training tendencies. When you ask all three to red-team the same change, the disagreements are real, but the agreements are sometimes less informative than they look, because you are sampling three points from a narrower distribution than the org names suggest.&lt;/p&gt;

&lt;p&gt;Grok's training lineage sits outside the OpenAI/Anthropic/Google adjacent triangle. So when a four-way consensus check returns 4/4 agreement on a security finding, the signal is stronger than 3/3. And when Grok dissents alone, that is a data point worth reading, not a vote to discard. The value is not that Grok is better at reviews than the others (I do not believe that, and the workflows do not assume it). The value is independence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Durable job results and auto-dedup
&lt;/h2&gt;

&lt;p&gt;This is the change that came from running the gateway against real work for a few months and watching the same failure happen over and over.&lt;/p&gt;

&lt;p&gt;The original architecture had a soft spot. Async jobs run long, sometimes longer than the orchestrating agent's polling window. The agent gives up, reissues the request, and the whole Codex or Claude invocation starts over. The CLI work you just paid 90 seconds for is thrown away and replaced with a second 90-second run that does exactly the same thing. I lost track of how much wall time this cost me before I sat down and fixed it properly.&lt;/p&gt;

&lt;p&gt;The fix is two pieces, both wired into the existing flight recorder SQLite database at &lt;code&gt;~/.llm-cli-gateway/logs.db&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Every async job persists&lt;/strong&gt; to a new &lt;code&gt;jobs&lt;/code&gt; table on every state transition (start, throttled output flush, completion). &lt;code&gt;llm_job_status&lt;/code&gt; and &lt;code&gt;llm_job_result&lt;/code&gt; transparently fall back to the durable store when the in-memory job is gone, so a caller can collect a result regardless of how long ago the work finished. Retention defaults to 30 days, configurable via &lt;code&gt;LLM_GATEWAY_JOB_RETENTION_DAYS&lt;/code&gt;. Jobs still "running" when the gateway stops are marked &lt;code&gt;orphaned&lt;/code&gt; on next boot, and the partial output stays readable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identical requests within a dedup window short-circuit&lt;/strong&gt; onto the existing running or completed job. The default window is 1 hour, configurable via &lt;code&gt;LLM_GATEWAY_DEDUP_WINDOW_MS&lt;/code&gt;. The "polling timed out, reissue, run it all again" loop is structurally gone. For the case where the prior result is actually wrong and you want a fresh invocation rather than a re-attach, every request tool accepts &lt;code&gt;forceRefresh: true&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The change moves the gateway closer to what I wanted it to be from the start, a durable result-collection layer for CLI agents rather than a thin process spawner that hopes the caller is still listening when the CLI finishes. 20 new tests cover persistence, dedup, restart-orphan, retention, and Grok parity, and the full suite passes at 322 tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes about the original argument
&lt;/h2&gt;

&lt;p&gt;Nothing, actually. The thesis from the first post still stands, that CLI wrapping gives you capabilities (real file access, real test execution, real session state) that API proxying fundamentally cannot. These three updates strengthen the same case rather than contradict it.&lt;/p&gt;

&lt;p&gt;What they fix is the gap between the thesis and the implementation. Codex sessions now carry the same real-CLI continuity as Claude and Gemini. The consensus pattern now has a fourth, vendor-independent voice. And the long-running-job failure mode that always threatened to undercut the whole CLI-spawning approach is gone, because the result lives on disk regardless of who is or is not still polling for it.&lt;/p&gt;

&lt;p&gt;If you are evaluating llm-cli-gateway against an API proxy, the comparison is slightly different now than it was in March, on three specific axes. That seemed worth writing down.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;Mistral shipped Mistral Vibe — their official open-source CLI coding agent, powered by Devstral 2.  Will be adding it next for even more diversity!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;llm-cli-gateway is MIT licensed. npm: &lt;a href="https://npmjs.com/package/llm-cli-gateway" rel="noopener noreferrer"&gt;llm-cli-gateway&lt;/a&gt; | GitHub: &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;verivus-oss/llm-cli-gateway&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>cli</category>
    </item>
    <item>
      <title>Here's what stopped breaking, when you make LLM agents author in two formats</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Wed, 06 May 2026 04:39:27 +0000</pubDate>
      <link>https://dev.to/wernerk_au/i-make-llm-agents-author-in-two-formats-heres-what-stopped-breaking-4i0j</link>
      <guid>https://dev.to/wernerk_au/i-make-llm-agents-author-in-two-formats-heres-what-stopped-breaking-4i0j</guid>
      <description>&lt;p&gt;LLM agents will happily produce a thousand lines of plausible Markdown describing work that doesn't compile, isn't tested, and contradicts a decision the same agent wrote down two files earlier. If you want to review their output without re-reading every paragraph, some of the work product has to be machine-checkable.&lt;/p&gt;

&lt;p&gt;You also can't push everything into a schema. Intent, tradeoffs, the alternative you rejected: that material dies in JSON. The interesting question is the boundary. What belongs in prose, what belongs in structure, and what falls out when you draw the line in the wrong place.&lt;/p&gt;

&lt;p&gt;I landed on this after running it for real. I introduced the runtime layer later, when I expanded this to multiple repos, and saw the flat files stopped scaling.&lt;/p&gt;

&lt;h2&gt;
  
  
  The split
&lt;/h2&gt;

&lt;p&gt;Every unit of agent work produces three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Narrative.&lt;/strong&gt; Markdown specs, designs, plans, notes. The human-readable record: intent, tradeoffs, what was rejected, context a future reader needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure.&lt;/strong&gt; TOML files encoding the work itself: a dependency DAG, a traceability map (&lt;code&gt;INT → FEAT → REQ → DEC → IMP → CODE → TEST → OUT&lt;/code&gt;), and a review-readiness bundle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence.&lt;/strong&gt; Review artifacts that answer &lt;em&gt;"is this actually reviewable, and does the claim match the proof?"&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Markdown carries what structure can't. Intent and reasoning. Why the design has this shape. What was rejected. What the author worried about. Schema fields can't express ambivalence. Specs change during brainstorm and review, and prose is the right medium for that conversation; forcing every change through schema churn throttles thinking. Six months later, the reviewer needs narrative, not a graph.&lt;/p&gt;

&lt;p&gt;TOML carries what prose can't reliably:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Machine-checkable invariants. &lt;code&gt;blocks&lt;/code&gt; is the exact inverse of &lt;code&gt;depends_on&lt;/code&gt;. Every &lt;code&gt;ART:&lt;/code&gt; has exactly one producer. Every &lt;code&gt;consumes&lt;/code&gt; matches a &lt;code&gt;produces&lt;/code&gt;. These are enforced by validators, not by hoping a human noticed.&lt;/li&gt;
&lt;li&gt;Graph queries. &lt;em&gt;What's ready to start? What's the critical path? Which units conflict on files? Which &lt;code&gt;REQ:&lt;/code&gt; has no downstream &lt;code&gt;TEST:&lt;/code&gt;?&lt;/em&gt; These are queries over structure, not reading comprehension.&lt;/li&gt;
&lt;li&gt;Stable identifiers. Prose drifts. &lt;code&gt;U07a&lt;/code&gt;, &lt;code&gt;REQ:auth-001&lt;/code&gt;, &lt;code&gt;ART:schema-v2&lt;/code&gt; don't.&lt;/li&gt;
&lt;li&gt;Diff-readable state. A status transition is a one-line diff, not a paragraph to re-read.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Frame the split as narrative vs. structure, each in the medium that protects its own invariants. Calling it "docs vs. config" gets it wrong because both formats are doing real review-time work; one of them just gets to be checked by &lt;code&gt;python -m&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why TOML and not YAML or JSON
&lt;/h2&gt;

&lt;p&gt;I picked TOML deliberately. YAML loses on parse ambiguity. The &lt;code&gt;country: NO&lt;/code&gt; problem (Norway gets parsed as the boolean &lt;code&gt;false&lt;/code&gt; under YAML 1.1) is real and gets worse when an LLM is generating the file under time pressure. JSON loses on the human-authoring axis: trailing commas explode, every string needs quotes, comments are forbidden. TOML parses unambiguously, reads cleanly enough to author and review by hand, and ships in the Python stdlib (&lt;code&gt;tomllib&lt;/code&gt; since 3.11), so my validators stay dependency-light.&lt;/p&gt;

&lt;p&gt;For agent-authored, human-reviewed structure, TOML is the boring choice. It wins because it's boring.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three review pillars came from failure data
&lt;/h2&gt;

&lt;p&gt;The review-readiness package didn't exist on day one. I added it after running an iteration-chain analysis across seven real review cycles and finding that almost every re-review came from one of three deficiencies, in the same order, over and over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing prerequisite artifacts.&lt;/strong&gt; Review blocked not on conceptual disagreement but on the absence of required planning docs, cross-links, prior diagrams, or test plans. The reviewer couldn't judge readiness because the artifact class wasn't actually complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ambiguous contracts.&lt;/strong&gt; Ordering rules, normalization, precedence, fallback, schema shape: reviewers had to infer semantics the author never wrote down. Every inference round added a re-review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overclaimed completeness.&lt;/strong&gt; "Ready for implementation." "Production ready." "All findings resolved." Unbacked by proof, or backed by proof narrower than the claim. Each one cost another round.&lt;/p&gt;

&lt;p&gt;Three failure modes, three artifacts. A &lt;em&gt;readiness gate&lt;/em&gt; answers whether the artifact class is complete enough to review at all, and blocks opening a review until it passes. A &lt;em&gt;contract declaration&lt;/em&gt; makes behavioral semantics explicit up front so reviewers never have to invent them. An &lt;em&gt;evidence matrix&lt;/em&gt; binds every strong claim to a concrete proof artifact, a stated scope, and a list of known exclusions; a claim broader than its evidence fails validation.&lt;/p&gt;

&lt;p&gt;The workflow is strict and intentionally rude. Fill the readiness gate first; if blocked, don't open review. Fill the contract second; vague statements get rejected. Fill the evidence matrix last; if a claim can't be backed by proof and bounded exclusions, downgrade the claim. Don't stretch the proof.&lt;/p&gt;

&lt;p&gt;The validator's exit code is authoritative. No human override of a failed validation without updating the file to pass cleanly. I made this rule on purpose, because &lt;em&gt;"it's close enough"&lt;/em&gt; was the phrase that caused most of the re-reviews I measured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where flat TOML stopped working
&lt;/h2&gt;

&lt;p&gt;Flat TOML works great for authoring and validation. It stopped working the moment agents started mutating state during execution.&lt;/p&gt;

&lt;p&gt;The hand-calculated &lt;code&gt;[computed]&lt;/code&gt; sections were the first thing to rot. Critical path, conflict groups, progress percentages: all derived values, all authored by hand, all stale the moment a unit advanced. A human spots the inconsistency on re-read. An agent doesn't.&lt;/p&gt;

&lt;p&gt;Editing &lt;code&gt;status = "in_progress"&lt;/code&gt; in a text file leaves no record of when, by whom, from what prior state, against what evidence. For process control, "who moved this to done, and on what proof?" is not optional.&lt;/p&gt;

&lt;p&gt;There was no programmatic query layer either. &lt;em&gt;"Which tier-1 units are runnable right now?"&lt;/em&gt; required parsing TOML, walking the graph in Python, and rebuilding the same derivations every time.&lt;/p&gt;

&lt;p&gt;And flat files don't compose across a fleet. Once more than one repo is under the same policy regime, per-repo TOML is the wrong shape for fleet-wide gating, policy packs, exception lifecycles, and release trains.&lt;/p&gt;

&lt;p&gt;So I added a runtime layer, additively. The templates and validators didn't change.&lt;/p&gt;

&lt;p&gt;A per-repository runtime imports a filled TOML file once. After that, an embedded SurrealDB is the source of truth. Status transitions go through a typed API with validation. Every change persists with timestamps and actor identity. Computed values become live queries instead of hand-edited fields. You can still export a TOML snapshot for human review, but it's a derived artifact, not the authority.&lt;/p&gt;

&lt;p&gt;A fleet-wide control plane (FastAPI + Postgres) handles policy packs, signed snapshot intake, exception lifecycles, and release-train readiness across many repos. There's no flat-file counterpart; the multi-repo problem just isn't expressible in per-repo files.&lt;/p&gt;

&lt;p&gt;The practical rule: TOML is the authoring medium and the interchange format. The database is the runtime authority. The TOML file you imported is stale from the first state transition onward. Treat it like a git tag — a snapshot in time, not live state.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you actually get
&lt;/h2&gt;

&lt;p&gt;Four things, none of which prose-only or structure-only would deliver alone.&lt;/p&gt;

&lt;p&gt;Parallel agent execution without stepping on each other, because the DAG encodes &lt;code&gt;depends_on&lt;/code&gt;, &lt;code&gt;blocks&lt;/code&gt;, and &lt;code&gt;files_modify&lt;/code&gt; conflict groups explicitly. Agents pick runnable units from the same layer and the system knows who may run concurrently.&lt;/p&gt;

&lt;p&gt;Traceability from intent to test. Every requirement has a downstream realization path through implementation, code, and test. Unverified requirements and unmapped code surface as computed gaps in a query, not as gut feeling six weeks into review.&lt;/p&gt;

&lt;p&gt;Reviews that fail at the right boundary. Readiness gates block un-reviewable work before a reviewer sees it. Explicit contracts stop the semantic-inference spiral. Evidence matrices stop overclaimed completeness from reaching review at all.&lt;/p&gt;

&lt;p&gt;State that is queryable, auditable, versioned, and composable across repos. Single-repo: &lt;em&gt;"what's ready now?"&lt;/em&gt; in one query. Fleet-wide: &lt;em&gt;"is this release train green across every repo under policy?"&lt;/em&gt; — also one query, against the control plane.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operating rules
&lt;/h2&gt;

&lt;p&gt;Distilled from getting this wrong before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Author narrative in Markdown. Author structure in TOML. Don't mix.&lt;/li&gt;
&lt;li&gt;Validator exit code 0 is the only pass signal. No manual override.&lt;/li&gt;
&lt;li&gt;Don't edit state fields by hand once they're in the runtime. Use the API.&lt;/li&gt;
&lt;li&gt;Don't claim "complete," "production-ready," or "all findings resolved" without an evidence matrix. If the matrix is thin, the claim is wrong.&lt;/li&gt;
&lt;li&gt;When behavior depends on ordering, fallback, normalization, precedence, or authority, write the contract before review, not during.&lt;/li&gt;
&lt;li&gt;Computed fields belong to the runtime. Don't hand-calculate them.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Worked example: this article
&lt;/h2&gt;

&lt;p&gt;I dogfood the same split. The Anti-AI-Tell style guide (&lt;code&gt;mr-k-man/llm-tips&lt;/code&gt; on GitHub) is Markdown: rationale, evidence base, the prose rules humans read. The matching contract is TOML — 49 machine-checkable rules with regexes, density thresholds, and applicability tags. And the audit workflow is a 10-unit DAG, also in TOML, that orchestrates inventory, scan, triage, fix, and regression as discrete units that run in parallel where the dependency graph permits.&lt;/p&gt;

&lt;p&gt;I ran the DAG on this article before publishing.&lt;/p&gt;

&lt;p&gt;The pre-fix audit found two hits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;AIS:ST02&lt;/code&gt; structural: tricolon-fraction 60% (3 of 5 single-token enumerations were three-item).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AIS:F03&lt;/code&gt; formatting: inline-bold density 1.43 per 200 words (10 bolds in 1398 words; budget 7).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weighted score: 1.0 + 0.25 = 1.25. The rewrite threshold is 3, so this routed to surgical-edit, not rewrite-from-scratch.&lt;/p&gt;

&lt;p&gt;Three line-level edits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stripped four bullet-label &lt;code&gt;**&lt;/code&gt; markers in the "TOML carries..." list. The bullets already carry the structure; the bold was decoration.&lt;/li&gt;
&lt;li&gt;Expanded a three-item prerequisite-artifacts list (docs, cross-links, test plans) to four by adding "prior diagrams".&lt;/li&gt;
&lt;li&gt;Expanded a three-item adjective list (queryable, auditable, composable) to four by adding "versioned". The added word is true: the runtime persists history.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Regression scan: zero hits. Tricolon fraction 1 of 5 (20%, under the 30% threshold). Bold density 0.86 per 200 words (under 1.0). Linter exit 0.&lt;/p&gt;

&lt;p&gt;You're reading the post-fix version. Everything is in &lt;a href="https://github.com/mr-k-man/llm-tips" rel="noopener noreferrer"&gt;&lt;code&gt;mr-k-man/llm-tips&lt;/code&gt;&lt;/a&gt; on GitHub: the source guide at &lt;a href="https://github.com/mr-k-man/llm-tips/blob/main/style_guide.md" rel="noopener noreferrer"&gt;&lt;code&gt;style_guide.md&lt;/code&gt;&lt;/a&gt;, the contract at &lt;a href="https://github.com/mr-k-man/llm-tips/blob/main/tools/style_policy.toml" rel="noopener noreferrer"&gt;&lt;code&gt;tools/style_policy.toml&lt;/code&gt;&lt;/a&gt;, the linter at &lt;a href="https://github.com/mr-k-man/llm-tips/blob/main/tools/lint_writing_style.py" rel="noopener noreferrer"&gt;&lt;code&gt;tools/lint_writing_style.py&lt;/code&gt;&lt;/a&gt;, and the audit DAG at &lt;a href="https://github.com/mr-k-man/llm-tips/blob/main/tools/audit_dag.toml" rel="noopener noreferrer"&gt;&lt;code&gt;tools/audit_dag.toml&lt;/code&gt;&lt;/a&gt;. MIT-licensed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;If you put LLM agents on real work, decide which invariants you want a validator to enforce and which you want a human reviewer to negotiate. Draw that line on purpose. Then accept that flat files have a ceiling: the moment your agents start mutating state, something has to own the audit trail and the live derivations, and a text file isn't it.&lt;/p&gt;

&lt;p&gt;Narrative carries judgement; structure carries invariants. Force either of them to carry live state and you'll lose the audit trail inside a week.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>architecture</category>
    </item>
    <item>
      <title>the next software stack needs more than code generation</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Wed, 22 Apr 2026 04:55:17 +0000</pubDate>
      <link>https://dev.to/wernerk_au/the-next-software-stack-needs-more-than-code-generation-3aep</link>
      <guid>https://dev.to/wernerk_au/the-next-software-stack-needs-more-than-code-generation-3aep</guid>
      <description>&lt;p&gt;Most people in software are staring at the wrong milestone. Models write API handlers, unit tests, and migrations fast enough that typing isn't the limiting factor anymore. In a world of high-concurrency agents, the act of writing code is no longer the bottleneck. That part of the problem is finished.&lt;/p&gt;

&lt;p&gt;The real trouble starts the moment that code lands. Why was this change made? Which requirement forced it? And who actually checked the risky paths in the auth flow? You can still answer those questions today, but it takes a kind of technical archaeology—digging through PR threads, Slack messages, and documentation that was out of date the day it was written. That workflow held up while humans set the pace. It breaks the moment you stop being the bottleneck.&lt;/p&gt;

&lt;h3&gt;
  
  
  the velocity trap
&lt;/h3&gt;

&lt;p&gt;Most teams run AI-assisted development through a loop of prompt, branch, code, review, and merge. At low volume, it holds up. Then usage increases. You start seeing changes that look fine but carry no clear origin story. A feature flag shows up in production with a name nobody recognizes. An environment variable gets added "just to make something work" and stays there for six months because nobody is sure what it’s gating.&lt;/p&gt;

&lt;p&gt;Then we have a growing crowd of "psychosis coders" who think they are shipping masterpieces because they saw an agent move a cursor. They hit approve the second the diff looks plausible, never noticing the trail of empty TODO comments, shallow mocks, and tests that don't actually assert anything meaningful. They are shipping "passable" trash masquerading as velocity.&lt;/p&gt;

&lt;p&gt;Maintaining real quality at agentic speeds requires a gauntlet. In my own work, I have to run Model B against Model A like a caffeine-fueled nitpicker for ten rounds just to reach consensus. Then Model C does the same dance. This cross-model review is mandatory to maintain velocity without the system collapsing into a pile of actual slop.&lt;/p&gt;

&lt;p&gt;But even this gauntlet is a patch, not a solution. We are burning a mountain of tokens to force quality through a pipe that was never meant to handle it. This is "Approval Theater" as a survival strategy. No, your carefully crafted markdowns, prompt engineering nor harness stacking solves this.&lt;/p&gt;

&lt;h3&gt;
  
  
  why clean merges still fail
&lt;/h3&gt;

&lt;p&gt;Agent A updates &lt;code&gt;PricingEngine::price()&lt;/code&gt; to apply a discount based on &lt;code&gt;User::join_date&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Agent B removes &lt;code&gt;join_date&lt;/code&gt; from &lt;code&gt;User&lt;/code&gt; and introduces a &lt;code&gt;UserMetadata&lt;/code&gt; lookup that returns &lt;code&gt;Option&amp;lt;NaiveDate&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The pricing path now depends on a value that may not exist. In the failure case, the lookup returns &lt;code&gt;None&lt;/code&gt;, and a later fallback resolves that missing value to &lt;code&gt;Money::default()&lt;/code&gt;, producing &lt;code&gt;0.00&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Both changes compile. Both pass their unit tests. Because they don't touch the same lines of code, Git merges them without a single conflict.&lt;/p&gt;

&lt;p&gt;In production, the pricing logic fails. Revenue doesn't drop to zero. That would be obvious. It becomes inconsistent instead. Some users are charged correctly. Others hit the missing metadata path and get a zero price. Support tickets appear first. Finance notices the reconciliation mismatch three weeks later.&lt;/p&gt;

&lt;p&gt;You're left trying to unwind two changes that were never evaluated together. Each was correct in isolation; the failure only existed in the interaction. A human developer might have caught that by holding the context in their head, but that assumption doesn't scale when dozens of agents are moving at once.&lt;/p&gt;

&lt;h3&gt;
  
  
  the idempotency crisis
&lt;/h3&gt;

&lt;p&gt;There is a deeper, uglier problem with agents and Git: retries. When a prompt fails or a network timeout hits, an agent often tries again. In a standard Git flow, this leads to double-commits, "dirty" working directories, or a messed-up HEAD state that requires a human to untangle. Then come additional worktrees and agents not checking if they're on the right branch in the right tree, or simply sticking to documentation paths you've specified instead of pollution the root with markdowns. &lt;/p&gt;

&lt;p&gt;Git wasn't built for idempotent operations from a thousand concurrent workers. It was built for a human at a terminal who can see when a command failed. If the next stack doesn't have request-level idempotency built into the storage layer, you aren't building a system; you're building a race condition.&lt;/p&gt;

&lt;h3&gt;
  
  
  files are the wrong primitive now
&lt;/h3&gt;

&lt;p&gt;Git shows you what changed in the text, but it doesn't show you why. You see two files modified, but you can’t see the requirement that triggered the edit. We review diffs and guess at intent. &lt;/p&gt;

&lt;p&gt;Agents don't operate on files; they operate on relationships. A discount rule depends on a user attribute; a billing flow depends on an auth decision. When we take that rich graph of intent and flatten it into files, we lose the fidelity of the work. This mismatch leads to "clean" merges that are semantically murky, repeated edits to the same symbols, and retries that converge on something other than what we actually meant to build.&lt;/p&gt;

&lt;h3&gt;
  
  
  building the floor
&lt;/h3&gt;

&lt;p&gt;I'm building a stack that treats intent as the primary object, not the diff. It's not one tool. It's a set of components doing work Git was never designed for.&lt;/p&gt;

&lt;p&gt;aivcs is the version control core: a 9-crate Rust workspace. It uses blake3 for content-addressed hashing and groups changes around intent as an Episode instead of scattering them across commits. An Episode carries the requirement that triggered the change, the symbols actually touched, and the evidence (tests, benchmarks, profiles) attached when the work lands. It can import Git history as a baseline and export structured Episodes back into a branch, so teams don’t have to migrate all at once.&lt;/p&gt;

&lt;p&gt;trstr is the parsing layer. It’s spec-grounded, not grammar-by-example. When an agent edits a symbol, the system knows what that symbol is, not just which bytes moved. Tree-sitter is built for editor features. This needs stricter guarantees.&lt;/p&gt;

&lt;p&gt;sqry handles symbol-level indexing. It builds the graph from a rule like “apply a legacy discount” to every call site, call chain, and dependent type that touches it. That’s what lets an Episode carry semantic scope instead of a file list. It’s also how you catch the &lt;code&gt;PricingEngine&lt;/code&gt; / &lt;code&gt;UserMetadata&lt;/code&gt; class of failure before merge.&lt;/p&gt;

&lt;p&gt;wsmux is the concurrency layer: a CRDT over the code graph. When dozens of agents edit the same repository, the merge surface isn’t text. It’s operations on symbols and relationships. wsmux makes those edits converge instead of producing two clean merges that disagree at runtime.&lt;/p&gt;

&lt;p&gt;The storage layer is idempotent by construction. The same operation with the same content and intent resolves to the same Episode. Retries don’t duplicate work. A thousand workers hitting a flaky network stop being a race condition.&lt;/p&gt;

&lt;p&gt;This doesn’t replace Git. It sits alongside it.&lt;/p&gt;

&lt;p&gt;The goal is simple: when something changes, you can answer why without digging through history. Decisions travel with the change. Evidence is attached when the change is made, not reconstructed later.&lt;/p&gt;

&lt;p&gt;The system remembers what changed. It should also remember why.&lt;/p&gt;

&lt;p&gt;The bottleneck moved. The stack didn’t. That gap is where the risk lives.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>architecture</category>
      <category>startup</category>
    </item>
  </channel>
</rss>
