<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alexander Velikiy</title>
    <description>The latest articles on DEV Community by Alexander Velikiy (@great_cto).</description>
    <link>https://dev.to/great_cto</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2992733%2F562b24e1-823c-4331-b706-2e6cbdf9cb64.jpg</url>
      <title>DEV Community: Alexander Velikiy</title>
      <link>https://dev.to/great_cto</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/great_cto"/>
    <language>en</language>
    <item>
      <title>June under the hood: the board becomes a pult, prompts evolve behind a holdout gate, logs shrink 99.5%</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Thu, 11 Jun 2026 08:39:57 +0000</pubDate>
      <link>https://dev.to/great_cto/june-under-the-hood-the-board-becomes-a-pult-prompts-evolve-behind-a-holdout-gate-logs-shrink-4hm9</link>
      <guid>https://dev.to/great_cto/june-under-the-hood-the-board-becomes-a-pult-prompts-evolve-behind-a-holdout-gate-logs-shrink-4hm9</guid>
      <description>&lt;p&gt;The last two posts were about the pivot — autopilots, live connectors, the operator console. This one is about the engine room: four upgrades that shipped in the same June sprint and that you'd otherwise only discover by reading the changelog. Users keep telling us they don't read the changelog. Fair.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The board is now a pult, not a mirror
&lt;/h2&gt;

&lt;p&gt;Until v2.64 the dev board &lt;em&gt;showed&lt;/em&gt; you the pipeline: tasks, gates, costs. To act on anything you went back to the terminal.&lt;/p&gt;

&lt;p&gt;Now approving a gate (or pressing &lt;strong&gt;Run&lt;/strong&gt;) &lt;strong&gt;spawns a Claude Code agent headlessly in the project and streams its output into the board&lt;/strong&gt; — assistant text, tool calls, result, parsed from &lt;code&gt;stream-json&lt;/code&gt; and pushed over SSE. There's a Run-agent panel with a prompt field and a live stream, and an &lt;strong&gt;Approve + ▶run&lt;/strong&gt; button right on the gate card. Approve the plan, watch the implementation start, without touching a terminal.&lt;/p&gt;

&lt;p&gt;Running an autonomous agent that edits files from a web page is exactly as dangerous as it sounds, so the guardrails came first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;same-origin only, and the project must live under &lt;code&gt;$HOME&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;one run per project&lt;/strong&gt; — a second Run gets a 409&lt;/li&gt;
&lt;li&gt;hard timeout (SIGTERM → SIGKILL), 2000-line ring buffer, child stdin closed&lt;/li&gt;
&lt;li&gt;permission mode defaults to &lt;strong&gt;acceptEdits&lt;/strong&gt; — full autonomy is an explicit opt-in env var, never the default&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verified end-to-end with a stub binary (all four guardrails, Stop button) and a real &lt;code&gt;claude&lt;/code&gt; run.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Prompts now have to prove they got better
&lt;/h2&gt;

&lt;p&gt;Every agent in GreatCTO learns from lessons. The uncomfortable question: when the system rewrites an agent's prompt based on a lesson, &lt;em&gt;who checks the rewrite didn't make it worse?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;v2.37 closed the loop, porting the generate→evaluate→gate cycle from &lt;a href="https://github.com/hexo-ai/sia" rel="noopener noreferrer"&gt;hexo-ai/sia&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eval cases split into &lt;strong&gt;tuning&lt;/strong&gt; (visible to the prompt-improver) and &lt;strong&gt;holdout&lt;/strong&gt; (gate-only, anti-overfit)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;promotion gate&lt;/strong&gt; blocks any candidate prompt that regresses on the holdout split — exit codes, not vibes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/prompt-evolve&lt;/code&gt; runs lesson → candidate → holdout gate → PROMOTE/REJECT, with a per-agent generation ledger you can audit&lt;/li&gt;
&lt;li&gt;Each agent gets a generational changelog: which lesson, what held-out delta, full provenance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A learned improvement can no longer ship until it's re-proven on cases it never saw. The same loop later gated the compression layer below — turtles all the way down, but each turtle is tested.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Context compression: 31,475 chars of CI log → 155
&lt;/h2&gt;

&lt;p&gt;Agents read logs, test output, JSON dumps. Most of it is repetition. v2.38 added a compression layer — deterministic, $0, no LLM, no native deps, concepts borrowed from &lt;a href="https://github.com/chopratejas/headroom" rel="noopener noreferrer"&gt;chopratejas/headroom&lt;/a&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CI log&lt;/td&gt;
&lt;td&gt;31,475 → 155 chars (&lt;strong&gt;−99.5%&lt;/strong&gt;), FATAL/ERROR/stacks kept verbatim&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;−43% minified, −98% with array crush&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noisy test run&lt;/td&gt;
&lt;td&gt;−86%, the FAIL preserved&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The part that makes aggressive compression safe: &lt;strong&gt;CCR — Compressed Context with Retrieval&lt;/strong&gt;. Anything dropped is stored locally, content-addressed, and recoverable on demand; the memory filter appends a recall footer listing what it filtered. Lossless-on-demand. And a fidelity eval (through the v2.37 holdout gate, naturally) ensures a compressor only ships if the key fact survives.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;l3-support&lt;/code&gt; compresses logs and &lt;code&gt;qa-engineer&lt;/code&gt; compresses test output before reasoning — fewer tokens spent re-reading the same stack trace twelve times.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Scope creep is now caught mechanically
&lt;/h2&gt;

&lt;p&gt;The classic agent failure: asked to fix the webhook, also "improved" the auth module. v2.39 added governance inspired by NaCl, all machine-checkable at $0:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;impl-brief per task&lt;/strong&gt; — files-to-modify allowlist, &lt;strong&gt;files-NOT-to-modify denylist&lt;/strong&gt;, API contract, test spec. senior-dev refuses to commit out of scope; a denylist hit is a hard fail, override only via a signed exception&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;/trace&lt;/strong&gt; — requirement → use-case → task → test traceability for impact analysis and coverage gaps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gap-closure waves&lt;/strong&gt; — adopt strict gates on a legacy repo incrementally: criticals never deferred, every deferred gap held by a signed, &lt;em&gt;expiring&lt;/em&gt; exception. Never a silent bypass.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Also in June
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fable 5 support&lt;/strong&gt; — &lt;code&gt;agent-model: fable&lt;/code&gt; pins every managed agent to Claude Fable 5; the board's agent runner passes the model through verbatim.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CSRF guard on the board&lt;/strong&gt; — cross-origin mutations now 403. A malicious page can no longer POST to your localhost and approve a gate. (Found by our own &lt;code&gt;/audit&lt;/code&gt;, fixed the same day.)&lt;/li&gt;
&lt;li&gt;The pre-push hook can no longer hang a push, and gate-approve survives GUI-launched shells with a minimal PATH.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of it: open source, MIT, zero telemetry, &lt;a href="https://github.com/avelikiy/great_cto" rel="noopener noreferrer"&gt;github.com/avelikiy/great_cto&lt;/a&gt;. The full gory detail lives in the &lt;a href="https://github.com/avelikiy/great_cto/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt; — but now you don't have to read it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>devtools</category>
      <category>architecture</category>
    </item>
    <item>
      <title>The operator console: where the autopilot's work waits for a signature</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Thu, 11 Jun 2026 08:21:51 +0000</pubDate>
      <link>https://dev.to/great_cto/the-operator-console-where-the-autopilots-work-waits-for-a-signature-3ppj</link>
      <guid>https://dev.to/great_cto/the-operator-console-where-the-autopilots-work-waits-for-a-signature-3ppj</guid>
      <description>&lt;p&gt;Last post ended with the autopilot pausing at a human checkpoint. Pausing is easy — any workflow engine can stop. The hard questions are operational: &lt;em&gt;where&lt;/em&gt; does the case wait, &lt;em&gt;who&lt;/em&gt; is allowed to sign it, &lt;em&gt;what&lt;/em&gt; do they see before signing, and what happens when the write fails at 2am?&lt;/p&gt;

&lt;p&gt;That's what we built through v2.46–v2.63: the &lt;strong&gt;operator console&lt;/strong&gt;. &lt;code&gt;great-cto board&lt;/code&gt; → &lt;code&gt;/autopilot.html&lt;/code&gt;. It's the Operate-mode surface — the app for the licensed humans the flow escalates to, not for the engineer who wired it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Durable runs: the signature crosses a process boundary
&lt;/h2&gt;

&lt;p&gt;A run persists to disk and survives restarts. &lt;code&gt;startRun&lt;/code&gt; advances the flow to the gate and parks it as &lt;code&gt;awaiting-approval&lt;/code&gt;; &lt;code&gt;approve(id, who)&lt;/code&gt; resumes it &lt;strong&gt;and executes the irreversible write&lt;/strong&gt;; &lt;code&gt;reject&lt;/code&gt; ends it with nothing irreversible run. Every transition appends to an immutable audit trail.&lt;/p&gt;

&lt;p&gt;The v2.43 safety invariant now holds &lt;em&gt;end to end&lt;/em&gt;: the 837 claim is submitted &lt;strong&gt;only because a coder signed its protecting gate&lt;/strong&gt; — provable across a process boundary, because the approve happens in a different process than the start.&lt;/p&gt;

&lt;p&gt;We demonstrated it on medical coding live: intake → code → NCCI edits (three live connectors) → &lt;strong&gt;pause&lt;/strong&gt; → the coder signs in the inbox → &lt;strong&gt;the claim goes out&lt;/strong&gt; → completed. The reject path submits nothing.&lt;/p&gt;

&lt;p&gt;Flows can require several signatures in sequence. Tax needs two: the preparer signs with their PTIN, then the taxpayer signs Form 8879 — the IRS e-file fires only after both. The board pushes a notification to the signer the moment a gate opens.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the signer actually sees
&lt;/h2&gt;

&lt;p&gt;A queue, then a case drawer. The drawer carries everything a decision needs in one panel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The decision criteria&lt;/strong&gt; — the SOP this case is judged against&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence, connector by connector&lt;/strong&gt; — exactly what each integration found, with its live/stub flag and per-call latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An AI-drafted determination&lt;/strong&gt; — a templated rationale composed from the evidence, reviewed before signing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The audit trail&lt;/strong&gt; — tamper-evident, with a "✓ verified" badge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Signing an irreversible write opens a &lt;strong&gt;signature ceremony&lt;/strong&gt;: an alert dialog that names exactly what will execute — the gated step, its blast radius, the gate protecting it — and requires explicit confirmation. No one "accidentally approves" a wire transfer because the button was where their cursor happened to be.&lt;/p&gt;

&lt;p&gt;And because humans override machines (that's the point), &lt;strong&gt;overrides are logged&lt;/strong&gt;: sign against the AI recommendation and the divergence is recorded — case, recommendation, decision, who. Your regulator will ask. Now there's an answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The routing dial
&lt;/h2&gt;

&lt;p&gt;Not every case deserves a human minute. Admin Settings sets a per-tenant &lt;strong&gt;confidence floor&lt;/strong&gt;: a low-confidence approve is downgraded to escalate, and clean high-confidence cases are flagged auto-eligible. The dial moves as your trust does — start with everything escalated, widen straight-through as the override rate stays flat.&lt;/p&gt;

&lt;p&gt;Around the queue, the things an operation actually needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Roles&lt;/strong&gt; — operators sign; admins and compliance-leads see QA and Ops; invite links are scoped, with email invites and an impersonation banner when acting via a token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart views&lt;/strong&gt; — All · Auto-eligible · Escalated · SLA at-risk · High blast, with SLA-aware sort and regulatory-deadline clocks on each case&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QA sampling&lt;/strong&gt; — a deterministic ~20% of closed cases lands in a QA queue to be scored 1–5; results land on the run, the audit, and Analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bulk actions&lt;/strong&gt; — multi-select (or "select auto-eligible") → approve / reject / escalate with a reason, RBAC-checked per case&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyboard-first&lt;/strong&gt; — ⌘K palette, &lt;code&gt;j/k&lt;/code&gt; queue cursor, &lt;code&gt;a/r/e/b&lt;/code&gt; decisions, &lt;code&gt;?&lt;/code&gt; cheatsheet&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Ops tab: because writes fail
&lt;/h2&gt;

&lt;p&gt;The least glamorous tab is the one that earns the trust. For admins and compliance-leads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KPI tiles&lt;/strong&gt; — runs, connector calls, estimated cost, average latency, retries, over-budget, dead-letters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dead-letter queue&lt;/strong&gt; — every failed post-gate write with its connectors and error, and a one-click &lt;strong&gt;↻ Requeue&lt;/strong&gt; that re-runs the write and recovers the run to &lt;code&gt;completed&lt;/code&gt;. An off-tab badge makes a stuck write visible without clicking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connector health&lt;/strong&gt; — per-connector 🟢/🔴, call count, failure rate, p95 latency, last error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metering by industry&lt;/strong&gt; — per-vertical runs / calls / latency / cost, sorted by spend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Retries never double-submit: an idempotency key, stable per run, is threaded into every write.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enterprise polish, measured
&lt;/h2&gt;

&lt;p&gt;v2.63 was a full UI/UX pass, and we held it to numbers rather than adjectives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accessibility&lt;/td&gt;
&lt;td&gt;WCAG 2.2 AA — &lt;strong&gt;axe-core: 0 violations&lt;/strong&gt;, all tabs, both themes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Themes&lt;/td&gt;
&lt;td&gt;light/dark (&lt;code&gt;prefers-color-scheme&lt;/code&gt; + persist), white-label accent per tenant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Realtime&lt;/td&gt;
&lt;td&gt;SSE pushes a change the instant any run mutates — console, CLI, or webhook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale&lt;/td&gt;
&lt;td&gt;render cap keeps 500+ case queues smooth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reliability&lt;/td&gt;
&lt;td&gt;durable-runtime e2e across &lt;strong&gt;all 25 verticals&lt;/strong&gt; (start → gate → sign → write), 348/348 lib tests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Multi-tenant scoping means an operator sees only their tenant's queue. Cases export to CSV, because the auditor's tooling is Excel and pretending otherwise helps no one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;"Human in the loop" is usually a checkbox in a pitch deck. Operationally it's a product: an inbox with SLA clocks, a drawer with evidence, a ceremony for the point of no return, override logs, QA sampling, and a dead-letter queue for the night the provider's API was down.&lt;/p&gt;

&lt;p&gt;That product is what makes it safe to let the autopilot run the volume. Try it: &lt;code&gt;npx great-cto init&lt;/code&gt;, then &lt;code&gt;great-cto board&lt;/code&gt;. Screenshots on the &lt;a href="https://greatcto.systems/#review" rel="noopener noreferrer"&gt;landing&lt;/a&gt;; the run store, runtime, and console are all in the &lt;a href="https://github.com/avelikiy/great_cto" rel="noopener noreferrer"&gt;repo&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>autopilots</category>
      <category>hitl</category>
      <category>product</category>
    </item>
    <item>
      <title>We pivoted: GreatCTO is now AI autopilots for business</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Thu, 11 Jun 2026 08:20:49 +0000</pubDate>
      <link>https://dev.to/great_cto/we-pivoted-greatcto-is-now-ai-autopilots-for-business-5hck</link>
      <guid>https://dev.to/great_cto/we-pivoted-greatcto-is-now-ai-autopilots-for-business-5hck</guid>
      <description>&lt;p&gt;For a year GreatCTO was an engineering-process engine: agents, gates, reviewers, compliance packs. Good product. Wrong headline.&lt;/p&gt;

&lt;p&gt;Here's the thing we kept observing: the people who got the most value weren't buying "a better SDLC." They were buying &lt;em&gt;the outcome of a business function&lt;/em&gt; — claims coded, contracts reviewed, invoices matched, taxes filed. The pipeline was the means.&lt;/p&gt;

&lt;p&gt;So in v2.40 we said it out loud: &lt;strong&gt;GreatCTO ships AI autopilots for business.&lt;/strong&gt; Products that sell the outcome of a service, not a tool to a specialist. Packs, reviewers and gates didn't go anywhere — they became the under-the-hood trust layer instead of the headline.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an autopilot actually is
&lt;/h2&gt;

&lt;p&gt;A flow. One file per vertical — &lt;code&gt;flows/&amp;lt;vertical&amp;gt;.flow.json&lt;/code&gt; — the single source of truth that renders the CLI behavior, the runtime, and the landing page from the same data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;steps&lt;/strong&gt; — intake → process → decide → deliver, each tagged with the agent, the tools, and whether a human signs it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;connectors&lt;/strong&gt; — the real-world integrations the steps call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gates&lt;/strong&gt; — where a named, licensed human signs before the flow continues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;owner&lt;/strong&gt; — one accountable person who answers for what the autopilot does&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The four autopilot invariants are machine-checkable (&lt;code&gt;autopilot-gate.mjs&lt;/code&gt;): judgment boundary (confidence → escalation), accuracy-as-SLA, per-decision audit trail, per-outcome unit economics. Not a manifesto — a validator that exits 1.&lt;/p&gt;




&lt;h2&gt;
  
  
  6 → 16 → 25 verticals
&lt;/h2&gt;

&lt;p&gt;We started with six (legal docs, medical coding, procurement, accounting, managed IT, tax). Then the expansion criterion clicked: a vertical is a fit when it pairs &lt;strong&gt;a large displaceable-labor pool with a legally-required named human who signs the risky call&lt;/strong&gt;. That's the exact shape the safety engine is built for.&lt;/p&gt;

&lt;p&gt;Ten more landed in v2.44 — prior-auth ($35–56B), KYC/AML ($61B), managed SOC, insurance claims (~$36–38B), mortgage underwriting, title &amp;amp; escrow, provider credentialing, collections, freight brokerage, clinical-trial ops. Then immigration, appraisal, payroll, workers-comp, estate planning, patent prosecution. &lt;strong&gt;Twenty-five total&lt;/strong&gt;, every one shipping green on &lt;code&gt;--validate&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Each carries its own compliance reviewer: False Claims Act + NCCI for coding, OFAC + BSA for AML, FDCPA + Reg F for collections, Circular 230 + §7216 for tax, FMCSA for freight. The regulation is a step in the flow, not a PDF you read later.&lt;/p&gt;




&lt;h2&gt;
  
  
  "Live" means live
&lt;/h2&gt;

&lt;p&gt;A flow that calls mocked connectors is a demo. By v2.45, &lt;strong&gt;all verticals exercise at least one live connector&lt;/strong&gt; — 17 live in the catalog, keyless by default (deterministic real logic or a curated public slice), switching to the real provider the moment you add a credential.&lt;/p&gt;

&lt;p&gt;A few favorites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;um-criteria&lt;/strong&gt; (prior-auth) — CMS NCD/LCD-style medical-necessity matching that &lt;strong&gt;never auto-denies&lt;/strong&gt;. Missing criteria escalates to the medical director. By design, not by prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sar-filing&lt;/strong&gt; (AML) — generates a FinCEN SAR, and the filing is &lt;strong&gt;blocked without the BSA Officer's signature&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;comms-outreach&lt;/strong&gt; (collections) — FDCPA/Reg F 7-in-7, TCPA, and the 8am–9pm window enforced as ALLOW/BLOCK per contact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;primary-source&lt;/strong&gt; (credentialing) — OIG LEIE / SAM exclusion screening as a hard block, plus a real NPI Luhn check.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The permission is never the wound
&lt;/h2&gt;

&lt;p&gt;The scariest failure mode of an agent isn't going rogue. It's doing &lt;em&gt;exactly what it's permitted to do&lt;/em&gt;, irreversibly, at machine speed, with no human hesitation. (Hat tip to Oleksandr Torlo's essay "The Permission Was the Wound.")&lt;/p&gt;

&lt;p&gt;v2.43 made the boundary a &lt;strong&gt;runtime invariant&lt;/strong&gt;, not a convention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every flow step is tagged &lt;code&gt;reversible&lt;/code&gt; or not, with a blast radius. Money moves, claim submission, e-signing, tax filing — irreversible.&lt;/li&gt;
&lt;li&gt;The runtime &lt;strong&gt;refuses to execute an irreversible step autonomously&lt;/strong&gt;. No prior human gate → &lt;code&gt;blocked-unsafe&lt;/code&gt;. Gate present → the step runs only after it's signed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;validateFlow()&lt;/code&gt; enforces it statically: irreversible ⟹ preceded by a human checkpoint, and every autopilot names an accountable owner. All 25 verticals ship green.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The autopilot does the volume. The point of no return always waits for a person.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quality is earned, not declared
&lt;/h2&gt;

&lt;p&gt;Every vertical gets a 0–100 scorecard: seven weighted dimensions, golden + adversarial cases run through the reviewer with an LLM judge, and a regression gate so a score can't silently decay. Two measure→improve→re-measure cycles took legaltech from 85 to 94.75 and msp from 78 to 98.5.&lt;/p&gt;

&lt;p&gt;If we're going to claim an autopilot can hold a function, the claim should be a number someone measured — and a gate that fails CI when it stops being true.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where this leaves you
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;npx great-cto init&lt;/code&gt;, name the function, and you get the flow — agents, connectors, human checkpoints, the compliance pack for your domain. The pipeline that built features for a year now &lt;em&gt;runs business functions&lt;/em&gt;, with the same receipts: &lt;a href="https://greatcto.systems/autopilots.html" rel="noopener noreferrer"&gt;all 25 autopilots&lt;/a&gt;, each with its flow, gates, and live-connector badges.&lt;/p&gt;

&lt;p&gt;Next post: what happens after the flow pauses — the operator console where a human actually signs.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>autopilots</category>
      <category>compliance</category>
      <category>opensource</category>
    </item>
    <item>
      <title>great_cto: what's new — three features and the move to Opus 4.8</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Fri, 29 May 2026 14:26:10 +0000</pubDate>
      <link>https://dev.to/great_cto/greatcto-whats-new-three-features-and-the-move-to-opus-48-1p41</link>
      <guid>https://dev.to/great_cto/greatcto-whats-new-three-features-and-the-move-to-opus-48-1p41</guid>
      <description>&lt;p&gt;While you were sleeping (or heroically fixing prod), &lt;code&gt;great_cto&lt;/code&gt; — the engineering-process engine for solo founders and teams up to 50 people — picked up some new tricks. No fluff: three features that actually change your daily grind, plus a model upgrade that didn't require re-mortgaging the apartment.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Discovery pipeline: think before you code
&lt;/h2&gt;

&lt;p&gt;A timeless genre: write first, find out &lt;em&gt;what&lt;/em&gt; you should have written later. Until now the pipeline started with the architect, and everything "before" — problem research, prioritization, the PRD — lived in your head, your notes, and three browser tabs you were too scared to close. That gap is now filled by two commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/discover&lt;/code&gt;&lt;/strong&gt; — a full product-discovery cycle. Builds an &lt;strong&gt;Opportunity-Solution Tree&lt;/strong&gt; (Teresa Torres' framework): desired outcome → opportunities → solutions → experiments. Ranks opportunities by Opportunity Score = &lt;code&gt;Importance × (1 − Satisfaction)&lt;/code&gt; and tosses in ≥3 solutions for each. Output lands in &lt;code&gt;docs/discovery/OST-&amp;lt;slug&amp;gt;.md&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/prd&lt;/code&gt;&lt;/strong&gt; — a structured 8-section PRD, from Executive Summary to success criteria. Asks at most 4 clarifying questions (not 40, like that one meticulous stakeholder) and hands you a finished doc in &lt;code&gt;docs/requirements/&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The PM agent also finally learned to &lt;strong&gt;prioritize features&lt;/strong&gt; when there's more than one and they're all "urgent": pick from Opportunity Score / ICE / RICE / MoSCoW. The full new route: &lt;code&gt;/discover → /prd → /architect → /pm → senior-dev&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Quota warning at session start
&lt;/h2&gt;

&lt;p&gt;There's a special genre of pain: hitting the rate limit right in the middle of a heavy pipeline, when the result was &lt;em&gt;just&lt;/em&gt; around the corner. The new &lt;code&gt;quota-check.mjs&lt;/code&gt; hook checks your Claude Code quota &lt;strong&gt;at the start of every session&lt;/strong&gt; and tactfully clears its throat ahead of time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⚡ &lt;strong&gt;70%+&lt;/strong&gt; — prefer the fast-path for big features&lt;/li&gt;
&lt;li&gt;🔴 &lt;strong&gt;85%+&lt;/strong&gt; — fast-path only (skip the ARCH doc)&lt;/li&gt;
&lt;li&gt;🛑 &lt;strong&gt;95%+&lt;/strong&gt; — friend, not today. Don't start the heavy pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a bonus it shows your burn rate per window (on track, or living large), tracks Sonnet's 7-day sub-quota separately, and watches pay-as-you-go spend. Parallel agents share a single request via a 5-minute cache — no DDoS-ing your own API. API-key users aren't touched at all — it quietly steps aside.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. digital-health-pack: an overlay for wearable and mental-health products
&lt;/h2&gt;

&lt;p&gt;A new domain overlay (Wave 4) attaches itself the moment your project starts cozying up to wearables and digital health — &lt;strong&gt;Apple HealthKit, Google Health Connect, Garmin, Fitbit, Oura, Whoop&lt;/strong&gt;, biometrics (HRV, SpO2, sleep), mental-health AI, nutrition/supplement recommendations, or physician-in-the-loop (HITL) flows.&lt;/p&gt;

&lt;p&gt;What's in the box:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a chain of three reviewers (&lt;code&gt;digital-health-reviewer&lt;/code&gt; + &lt;code&gt;ai-clinical-reviewer&lt;/code&gt; + &lt;code&gt;healthcare-reviewer&lt;/code&gt;);&lt;/li&gt;
&lt;li&gt;five human gates: &lt;strong&gt;wellness vs SaMD&lt;/strong&gt; classification, HITL design, wearable API access, supplement safety (drug-interaction check + NIH dose limits), and a crisis-escalation protocol for mental health;&lt;/li&gt;
&lt;li&gt;a ready-made threat-model template and EVAL suites — refuse-to-diagnose, supplement safety, and crisis escalation per AFSP Safe Messaging guidelines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short: a built-in regulatory checklist (FDA General Wellness vs SaMD, HIPAA, GDPR Art. 9, EU AI Act Annex III) — so your health startup attracts an investor, not a regulator's notice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bonus: moving to Claude Opus 4.8
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;great_cto&lt;/code&gt; upgraded its flagship: &lt;strong&gt;&lt;code&gt;claude-opus-4-7&lt;/code&gt; → &lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/strong&gt; (Anthropic shipped it on 2026-05-28). The kind of move that needs no boxes and no movers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Where it works:&lt;/strong&gt; &lt;code&gt;architect&lt;/code&gt; (deep cross-cutting reasoning and ADR generation) plus 41 reviewers/specialists and &lt;code&gt;commands/review.md&lt;/code&gt; via &lt;code&gt;advisor-model&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What you gain:&lt;/strong&gt; better coding at the default effort level for comparable token spend, and a &lt;strong&gt;1M-token context window&lt;/strong&gt; (yes, even that legacy module fits).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same price:&lt;/strong&gt; $5 / $25 per MTok (in/out) — just like 4.7. Accounting can exhale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier aliases untouched:&lt;/strong&gt; agents on &lt;code&gt;model: sonnet&lt;/code&gt; / &lt;code&gt;model: haiku&lt;/code&gt; stay as they were — only explicit Opus pins moved.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Upgrade: &lt;code&gt;npx great-cto@latest init&lt;/code&gt;. Full changelog — in the &lt;a href="https://github.com/avelikiy/great_cto/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>claude</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Everyone is squeezing context. We stopped putting everything in one context.</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Sat, 23 May 2026 06:06:06 +0000</pubDate>
      <link>https://dev.to/great_cto/everyone-is-squeezing-context-we-stopped-putting-everything-in-one-context-4gp5</link>
      <guid>https://dev.to/great_cto/everyone-is-squeezing-context-we-stopped-putting-everything-in-one-context-4gp5</guid>
      <description>&lt;p&gt;The standard advice for reducing LLM costs: truncate your prompts, use a cheaper model, compress your system prompt, enable caching, add &lt;code&gt;Be concise.&lt;/code&gt; to every instruction and hope for the best.&lt;/p&gt;

&lt;p&gt;All valid. All treating the symptom.&lt;/p&gt;

&lt;p&gt;We did something different.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real problem isn't prompt size. It's context architecture.
&lt;/h2&gt;

&lt;p&gt;When great_cto runs a feature pipeline — architect, PM, senior-dev, QA, security officer — each agent starts by reading the same stack of documents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ARCH-*.md&lt;/code&gt; — full architecture decisions, 3–8k tokens each&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PLAN-*.md&lt;/code&gt; — implementation plans, 4–10k tokens&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;decisions.md&lt;/code&gt; — every architectural decision made since the project started&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lessons.md&lt;/code&gt; — every lesson learned, including that one time someone forgot to add an index&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Six agents. Each reads all of it. Most of it irrelevant to the task at hand.&lt;/p&gt;

&lt;p&gt;A senior-dev implementing a Stripe webhook doesn't need the 200-line deep-dive into the auth system. They need two sentences: &lt;em&gt;"We use Stripe. Card data never touches our infra."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The information was right. The &lt;strong&gt;delivery unit&lt;/strong&gt; was wrong. We were running a library where everyone gets every book, every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1: Stop sending full documents. Send summaries.
&lt;/h2&gt;

&lt;p&gt;Every artifact in great_cto now has a paired &lt;code&gt;.summary.md&lt;/code&gt; — auto-generated, ≤250 tokens, structured for the consuming agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# ARCH — Multi-tenant auth system · summary&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Decision:**&lt;/span&gt; SAML over OIDC for enterprise; JWT internally
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Stack:**&lt;/span&gt; Node 20, Passport.js, PostgreSQL row-level security
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Risks:**&lt;/span&gt; SAML metadata rotation, session fixation on tenant switch
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Full doc:**&lt;/span&gt; docs/architecture/ARCH-auth.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agents read the summary first. If they need depth — the path to the full doc is right there. In practice, 80% of reads stop at the summary. The other 20% at least know exactly what they're looking for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The numbers:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;v2.19.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;13 artifacts, per agent read&lt;/td&gt;
&lt;td&gt;21,459 tokens&lt;/td&gt;
&lt;td&gt;2,216 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduction&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;–89.7%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The summary generates automatically via a &lt;code&gt;PostToolUse&lt;/code&gt; hook the moment any agent writes an artifact. Anthropic Haiku if you have an API key (~$0.0005/call). OpenRouter Kimi K2 as fallback. Deterministic keyword heuristic if neither — zero cost, works offline, mildly embarrassed about the quality but gets the job done.&lt;/p&gt;

&lt;p&gt;No config. No manual steps. Write artifact, get summary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Stop injecting the entire memory. Filter it to the task.
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;decisions.md&lt;/code&gt; is an append-only log. It grows. A typical project after three months: 40–80 entries — database choices, API decisions, security tradeoffs, that one auth approach you tried and abandoned at 2am.&lt;/p&gt;

&lt;p&gt;Before v2.19.0, the architect agent received the full file every time. 3–5k tokens, of which maybe 200 were actually relevant to the task. The model read all of it, politely, and quietly ignored most of it.&lt;/p&gt;

&lt;p&gt;Now: one call to &lt;code&gt;scripts/memory-filter.mjs "add Stripe webhook integration" decisions.md --k=5&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The filter scores each entry against the task title. For "add Stripe webhook integration" — you get the PCI decision, the webhook signature lesson, the relevant security pattern. Not the database choice from six months ago that has nothing to do with anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The numbers:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;v2.19.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;decisions.md inject per agent pair&lt;/td&gt;
&lt;td&gt;946 tokens&lt;/td&gt;
&lt;td&gt;544 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduction&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;–42.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Latency: ~50ms heuristic, ~200ms Haiku. Cost: &amp;lt;$0.0001 per call. Opt-out: &lt;code&gt;GREAT_CTO_DISABLE_MEMORY_FILTER=1&lt;/code&gt; (for when you miss the old noise).&lt;/p&gt;




&lt;h2&gt;
  
  
  The combined pipeline: before vs. after
&lt;/h2&gt;

&lt;p&gt;Six agents per feature. Each reads artifacts and memory.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;v2.19.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens per feature&lt;/td&gt;
&lt;td&gt;134,430&lt;/td&gt;
&lt;td&gt;16,560&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduction&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;–87.7%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost saved (Sonnet $3/1M)&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.35 per feature&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is with a small project — 13 artifacts, 7 decisions. The savings compound with scale: at 50 artifacts and 50 decisions (a project six months in), the legacy number climbs past 600k tokens per feature run. The filtered number stays roughly flat.&lt;/p&gt;

&lt;p&gt;That's the interesting property of this architecture: &lt;strong&gt;the noise grows with the project, the signal doesn't.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What this isn't
&lt;/h2&gt;

&lt;p&gt;This is not prompt compression. We're not removing information — we're delivering it at the right granularity, to the right agent, at the right moment.&lt;/p&gt;

&lt;p&gt;The full docs are still there. The full &lt;code&gt;decisions.md&lt;/code&gt; is still there. Any agent that needs depth can read it — the summary tells them exactly where to look. The filter acknowledges it might miss something ("if you suspect a relevant lesson is missing, read the full file directly"). It's a hint, not a wall.&lt;/p&gt;

&lt;p&gt;We're not betting on the model being smart enough to ignore irrelevant noise. We're not hoping a &lt;code&gt;Be concise.&lt;/code&gt; instruction somewhere will solve a structural problem. We're betting on &lt;strong&gt;information architecture&lt;/strong&gt; — the same principle that makes an indexed database faster than a full table scan.&lt;/p&gt;

&lt;p&gt;The index doesn't know less than the table. It knows where to look.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting it
&lt;/h2&gt;

&lt;p&gt;Everything shipped in &lt;a href="https://github.com/avelikiy/great_cto" rel="noopener noreferrer"&gt;v2.19.0&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;scripts/generate-summary.mjs&lt;/code&gt; — &lt;code&gt;--all&lt;/code&gt;, &lt;code&gt;--check&lt;/code&gt;, &lt;code&gt;--force&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scripts/memory-filter.mjs&lt;/code&gt; — &lt;code&gt;--k=N&lt;/code&gt;, &lt;code&gt;--heuristic&lt;/code&gt;, &lt;code&gt;--stats&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;agents/_shared/artifact-summary-contract.md&lt;/code&gt; — the producer/consumer contract&lt;/li&gt;
&lt;li&gt;31 tests, all green
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx great-cto upgrade
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Summaries generate on first &lt;code&gt;--all&lt;/code&gt; run, then stay fresh automatically. Memory filter activates in architect and senior-dev agents — no config needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Phase 3: session-scoped read cache. When five agents in one pipeline all read &lt;code&gt;PROJECT.md&lt;/code&gt;, only the first actually reads the file. The rest get a cache stub with a hash. Target: additional –15% on multi-agent runs.&lt;/p&gt;

&lt;p&gt;Phase 4: system prompt audit across all 30+ agent files. Removing filler. Enforcing token budgets. Finding the seven places we wrote "carefully" when the model was going to be careful anyway.&lt;/p&gt;

&lt;p&gt;The full plan is public: &lt;a href="https://github.com/avelikiy/great_cto/blob/main/docs/plans/PLAN-token-economy-2026-q2.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/plans/PLAN-token-economy-2026-q2.md&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
    <item>
      <title>great_cto v2.17 - no more tambourine dance</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Fri, 22 May 2026 15:03:35 +0000</pubDate>
      <link>https://dev.to/great_cto/greatcto-v217-no-more-tambourine-dance-1p5p</link>
      <guid>https://dev.to/great_cto/greatcto-v217-no-more-tambourine-dance-1p5p</guid>
      <description>&lt;p&gt;If you've ever spent 20 minutes setting up Claude Code plugins before you could even start working - this update is for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  One install, everything works.
&lt;/h2&gt;

&lt;p&gt;Previously: install great_cto, then figure out that Superpowers and Beads are also needed, find the repos, clone them, enable them in settings, restart. Classic.&lt;/p&gt;

&lt;p&gt;Now - one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx great-cto &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done. Superpowers and Beads install automatically as companion plugins. They land in &lt;code&gt;~/.claude/plugins/cache/local/&lt;/code&gt;, get enabled in &lt;code&gt;settings.json&lt;/code&gt;, and are ready to work. If git is missing - great_cto gives a friendly hint instead of silently failing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Jurisdiction-aware agents.
&lt;/h2&gt;

&lt;p&gt;The new jurisdictions module detects the context of your project - EU, US, Canada, UK, Australia, and others - and automatically activates the right regulatory reviewer agents.&lt;/p&gt;

&lt;p&gt;Working on a fintech product for European users? The EU reviewer turns on automatically. Building for the Canadian market? PIPEDA gets covered. No manual configuration, no trying to remember what applies where.&lt;/p&gt;

&lt;p&gt;Eight jurisdictions are currently supported, and the list keeps growing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Critics before the plan.
&lt;/h2&gt;

&lt;p&gt;The most expensive bugs aren't in the code - they're in the decisions made before coding starts. Three new critic agents now run at the earliest stages of the pipeline, before a single line is written.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture critic&lt;/strong&gt; catches structural problems that make future work impossible. Coupling that rules out multi-tenancy. An "obvious" O(n²) loop that works fine in dev and falls apart at scale. These aren't bugs - they're constraints that quietly close off entire solution spaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spec critic&lt;/strong&gt; catches "we solved the wrong problem" - the worst class of bug, because there's no way to unit-test for it. By the time the code works correctly, it may be doing entirely the wrong thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema critic&lt;/strong&gt; catches the migration that will deadlock a 50M-row table 10 minutes after deploy. A NOT NULL column without a default. An index added without CONCURRENTLY. The kind of change that looks clean in a code review and becomes an incident.&lt;/p&gt;

&lt;p&gt;Previously, critics only appeared starting from Plan stage. Now they cover the three positions where a mistake is most expensive.&lt;/p&gt;




&lt;h2&gt;
  
  
  llm-leash UI: 16 new features.
&lt;/h2&gt;

&lt;p&gt;llm-leash is the great_cto admin board - a local web UI that shows what your AI agents are doing, what they've spent, what passed review, and what needs your attention. Think of it as a control panel for the agent pipeline.&lt;/p&gt;

&lt;p&gt;This release adds 16 new features to the board. The most useful ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cmd-K&lt;/strong&gt; - global command palette for navigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Issues subtab&lt;/strong&gt; - all security and compliance findings in one place.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session timeline&lt;/strong&gt; - visual history of what happened and when.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Topology graph&lt;/strong&gt; - shows agent dependencies. Useful when you have 5+ parallel agents running.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HITL diff&lt;/strong&gt; - human-in-the-loop review of agent changes before they're applied.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OPA config&lt;/strong&gt; - Open Policy Agent integration for compliance rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOC2 export&lt;/strong&gt; - one-click audit trail for compliance officers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule comparison&lt;/strong&gt; - compare policy versions side by side.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Companion plugins out of the box.
&lt;/h2&gt;

&lt;p&gt;A bit more detail on how the Superpowers + Beads bundle works, since the architecture is non-obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Superpowers&lt;/strong&gt; - a methodology plugin. It gives Claude Code skills: &lt;code&gt;/brainstorm&lt;/code&gt;, &lt;code&gt;/write-plan&lt;/code&gt;, &lt;code&gt;/execute-plan&lt;/code&gt;, code review workflow, TDD cycle, parallel agent execution. Without it, Claude acts on vibes. With it - on a structured plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Beads&lt;/strong&gt; - a git-native task tracker. Tasks live as commits, survive session restarts, have dependencies and blockers. Claude creates and closes them autonomously as it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;great_cto&lt;/strong&gt; - the orchestration layer. It routes requests to the right agents, enforces reviewers based on archetype and jurisdiction, manages the agent pipeline.&lt;/p&gt;

&lt;p&gt;Together: you describe what needs to be done, great_cto breaks it into a plan, Beads tracks it, Superpowers enforces methodology, the right reviewer agents plug in automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx great-cto &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;npm: &lt;a href="https://www.npmjs.com/package/great-cto" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/great-cto&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/avelikiy/great_cto" rel="noopener noreferrer"&gt;https://github.com/avelikiy/great_cto&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feedback and PRs welcome.&lt;/p&gt;

</description>
      <category>claudeai</category>
      <category>devtools</category>
      <category>ai</category>
      <category>typescript</category>
    </item>
    <item>
      <title>AI Agents Work While You Sleep — Now They Can Wake You Up</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Mon, 18 May 2026 08:22:47 +0000</pubDate>
      <link>https://dev.to/great_cto/ai-agents-work-while-you-sleep-now-they-can-wake-you-up-5ea2</link>
      <guid>https://dev.to/great_cto/ai-agents-work-while-you-sleep-now-they-can-wake-you-up-5ea2</guid>
      <description>&lt;p&gt;Let me describe a Tuesday evening.&lt;/p&gt;

&lt;p&gt;I fire off &lt;code&gt;/start "refactor billing module"&lt;/code&gt;, the pipeline kicks in, six AI agents start doing their thing, and I think: great, I've got an hour. I'll cook pasta.&lt;/p&gt;

&lt;p&gt;I cook pasta. I eat pasta. I do the dishes. I put on an episode of something. I come back.&lt;/p&gt;

&lt;p&gt;The pipeline has been &lt;strong&gt;waiting for my approval for 54 minutes&lt;/strong&gt;. The senior-dev agent is sitting there, doing absolutely nothing, blocked on a &lt;code&gt;gate:plan&lt;/code&gt; that needed one click from me. Fifty-four minutes of human absence. Zero pasta to show for it.&lt;/p&gt;

&lt;p&gt;This is the core tension of running an AI pipeline: the whole point is that it works &lt;em&gt;while you're not watching&lt;/em&gt;. But the moment it needs you, it needs you immediately — and it has no way to tell you that.&lt;/p&gt;

&lt;p&gt;Until now.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we added
&lt;/h2&gt;

&lt;p&gt;Two things, both live in the board's &lt;strong&gt;Notifications&lt;/strong&gt; settings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Email alerts&lt;/strong&gt; — you enter your email, click a verification link, done. From that point on, five specific events send you an email:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A P0 incident opens (the pipeline found a production fire)&lt;/li&gt;
&lt;li&gt;A gate has been waiting for your approval for more than 30 minutes&lt;/li&gt;
&lt;li&gt;A gate is actively blocking the pipeline right now&lt;/li&gt;
&lt;li&gt;Your monthly AI spend crosses the limit you set&lt;/li&gt;
&lt;li&gt;Monday morning weekly digest — what got done, what it cost, how many gates passed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Browser push notifications&lt;/strong&gt; — desktop notifications, the same kind you get from Slack or email. You enable them once in the board settings, the browser asks for permission, and that's it. No app to install. No Firebase. No account anywhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why exactly these five triggers
&lt;/h2&gt;

&lt;p&gt;The first version of this feature had fifteen triggers. It was immediately annoying. Every time an agent sneezed, my phone buzzed.&lt;/p&gt;

&lt;p&gt;The honest answer to "what do I actually need to know about right now" is surprisingly short:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Something broke.&lt;/strong&gt; Not "an agent is running" or "a review started" — the actual situation where production is on fire and the pipeline found it before I did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I'm blocking progress.&lt;/strong&gt; The pipeline stopped and is waiting for me. The longer I wait, the more time I've wasted running all those agents. If it's been 30 minutes, I definitely didn't see the gate notification in the terminal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I'm about to overspend.&lt;/strong&gt; LLM costs are real. A runaway pipeline on a big refactor can quietly rack up $20–30 if nobody's watching. A cost alert at $15 is much better than discovering $40 on the invoice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weekly summary.&lt;/strong&gt; Not urgent, but useful. Monday morning coffee + what your AI team accomplished this week = a surprisingly good way to start the day.&lt;/p&gt;

&lt;p&gt;That's the whole list. No alerts for "agent started", "agent completed", "reviewer disagreed with another reviewer", or any of the other events that feel important but mostly produce noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  The email setup
&lt;/h2&gt;

&lt;p&gt;Open the board → Settings → Notifications. Enter your email. You get one verification email — click the link, and you're verified forever. No repeated sign-ins, no token rotation, no dashboard at some third-party service.&lt;/p&gt;

&lt;p&gt;Under the hood the board sends emails through a relay we run at greatcto.systems. We're using Resend on the free tier (100 emails/day), which is more than enough for a solo developer who isn't actively burning the place down.&lt;/p&gt;

&lt;p&gt;Why a relay and not direct SMTP? Because storing an SMTP password inside a local tool that lives on your laptop is a disaster waiting to happen. The relay holds the credentials; your board just sends an HTTPS request. If the relay is down, you miss a notification. That's fine. You're not building a hospital.&lt;/p&gt;




&lt;h2&gt;
  
  
  The push notifications
&lt;/h2&gt;

&lt;p&gt;This one was more fun to build.&lt;/p&gt;

&lt;p&gt;Browser push notifications sound simple — they're everywhere, every website pesters you with them — but implementing them correctly from scratch is genuinely involved. There's a spec called VAPID that requires signing cryptographic tokens with elliptic curve keys, and basically every tutorial says "just use the &lt;code&gt;web-push&lt;/code&gt; npm package."&lt;/p&gt;

&lt;p&gt;We couldn't. The board server is intentionally zero-dependency — no npm packages, no node_modules, nothing. It's a single file that you can read start to finish in an afternoon. Adding a library for notifications would mean adding a library for notifications &lt;em&gt;and everything that library depends on&lt;/em&gt;, which is how you end up with 47 packages installed to send one HTTP request.&lt;/p&gt;

&lt;p&gt;So we implemented it ourselves using only what Node.js ships with.&lt;/p&gt;

&lt;p&gt;The fun gotcha: somewhere deep in the VAPID spec, the signature format that Node produces natively is &lt;em&gt;not&lt;/em&gt; the format that browsers expect. One is DER-encoded (an old ASN.1 format from the 90s), the other is just raw bytes. Our first test push hit the browser's push service and got a 401 back. Fifteen minutes of reading specs later, we found the conversion, fixed it, and every subsequent push worked perfectly.&lt;/p&gt;

&lt;p&gt;The end result: you toggle the switch in the board, your browser asks "Allow notifications?", you click Allow, and from that point on your desktop shows a native notification whenever any of the five triggers fire. Same notification you'd get from a Slack message. No third-party service involved.&lt;/p&gt;




&lt;h2&gt;
  
  
  The notification drawer
&lt;/h2&gt;

&lt;p&gt;Push and email are for when you're away from the keyboard. The &lt;strong&gt;in-app drawer&lt;/strong&gt; is for when you're at the keyboard but not watching the terminal.&lt;/p&gt;

&lt;p&gt;Click the bell icon in the top nav. A panel slides out with the last 20 notifications — what fired, when, and whether you've seen it. Unread count shows as a badge on the bell. "Mark all read" lives right there.&lt;/p&gt;

&lt;p&gt;The history persists across restarts. Close the board, reopen it tomorrow morning, your Monday digest is still there.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx great-cto init
npx great-cto board
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Go to &lt;strong&gt;Settings → Notifications&lt;/strong&gt;. Add an email or enable push (or both). Then just work normally. If a gate waits too long, you'll hear about it.&lt;/p&gt;

&lt;p&gt;The whole thing is open source: &lt;a href="https://github.com/avelikiy/great_cto" rel="noopener noreferrer"&gt;github.com/avelikiy/great_cto&lt;/a&gt; (MIT, free, you pay your own LLM provider).&lt;/p&gt;




&lt;p&gt;The pasta incident has not repeated since.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>webpush</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Real cost breakdown: 10 packs, $0.60 LLM bill, $42K saved per regulated feature</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Sun, 17 May 2026 14:29:04 +0000</pubDate>
      <link>https://dev.to/great_cto/real-cost-breakdown-10-packs-060-llm-bill-42k-saved-per-regulated-feature-10ol</link>
      <guid>https://dev.to/great_cto/real-cost-breakdown-10-packs-060-llm-bill-42k-saved-per-regulated-feature-10ol</guid>
      <description>&lt;p&gt;This is the numbers post. If you read &lt;a href="https://dev.to/blog/ten-compliance-packs-for-ten-regulated-industries"&gt;the ten-packs deep-dive&lt;/a&gt; and walked away wanting the spreadsheet, here it is.&lt;/p&gt;

&lt;p&gt;All numbers below are from real client engagements (anonymized aggregates) plus telemetry from the GreatCTO install base. Not projections. Not vendor-pitch math.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-feature: the $42K → $0.60 + 50 hours of human review
&lt;/h2&gt;

&lt;p&gt;A single regulated feature in a single industry. Pre-pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Identify which regs apply          ~8h    × $200      = $1,600
Read primary regulation text      ~14h    × $200      = $2,800
Map regulation → stack            ~20h    × $250      = $5,000
Draft threat model                ~32h    × $250      = $8,000
Consent flow + UX                 ~20h    × $180      = $3,600
Implementation                    ~40h    × $180      = $7,200
Internal legal review              ~8h    × $400      = $3,200
External auditor pre-meeting      ~10h    × $350      = $3,500
Revisions                         ~16h    × mixed     = $3,500
Final signoff                      ~4h    × $400      = $1,600
                                  ─────                ─────
                                  ~172h               ~$40K
                                                      (rounded $42K with overhead)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM compute (architect+reviewers)  ~$0.60-$1.40 per feature
Human review of LLM output         ~14-18h × mixed     ~$3,800
External auditor pre-meeting       ~6-8h   (lower because tighter document)
Internal legal                     ~8h     (unchanged)
                                   ─────                ─────
                                   ~28-34h              ~$11-14K
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Net saved per feature: ~$28-30K and ~140 hours of human time.&lt;/strong&gt; LLM bill is rounding error.&lt;/p&gt;

&lt;p&gt;The $0.60 number is &lt;strong&gt;per feature, not per MVP&lt;/strong&gt;. Some readers conflated these. A small fintech feature on Claude Sonnet costs ~$0.60-$1.40 in LLM calls. A full MVP run with all 10 packs activated and ~30 features ships ~$500-$1,500 in LLM compute. Both numbers are honest, they describe different scopes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-MVP: $287K → $128K (~55% reduction)
&lt;/h2&gt;

&lt;p&gt;A voice-AI MVP, three months of work, traditional team composition:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 Product Manager × 3 months × $180/h × 120h/mo  = $64,800&lt;/li&gt;
&lt;li&gt;4 Engineers × 3 months × $180/h × 140h/mo        = $302,400&lt;/li&gt;
&lt;li&gt;Architecture work (internal or fractional CTO)   = ~$20,000&lt;/li&gt;
&lt;li&gt;Security review (external)                       = ~$15,000&lt;/li&gt;
&lt;li&gt;Compliance setup (consultant + internal time)    = ~$28,000&lt;/li&gt;
&lt;li&gt;Misc (PM tools, hosting trial, design)           = ~$8,000
                                                 ─────────
                                                 ~$438K nominal
                                                 ~$287K after overlap &amp;amp; efficient teaming
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
With pipeline + agentic SDLC, same MVP, 6-8 weeks:

- 1 Product Manager × 2 months × $180/h × 120h/mo  = $43,200
- 2 Engineers × 2 months × $180/h × 140h/mo        = $100,800
- LLM compute across the whole run                 = ~$1,200
- Architecture review (1 sr human, 3 sessions)     = ~$3,000
- Security review (external, same)                 = ~$15,000 (unchanged — see "what doesn't compress")
- Compliance setup (pipeline output + ~12h review) = ~$5,500
- Misc                                             = ~$8,000
                                                     ─────────
                                                     ~$176K nominal
                                                     ~$128K after similar overlap savings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Net: ~$159K saved per MVP, ~45% time saved.&lt;/strong&gt; Most of the saving is &lt;em&gt;not&lt;/em&gt; the LLM bill — it is fewer engineer-months because senior-dev parallelism + auto-review compresses the build phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-quarter / per-runway: the bet that changes
&lt;/h2&gt;

&lt;p&gt;For a founder shipping into one regulated industry (most realistic scenario):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;Pipeline&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MVP time&lt;/td&gt;
&lt;td&gt;3 months&lt;/td&gt;
&lt;td&gt;6-8 weeks&lt;/td&gt;
&lt;td&gt;~1.5 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MVP cost&lt;/td&gt;
&lt;td&gt;$287K&lt;/td&gt;
&lt;td&gt;$128K&lt;/td&gt;
&lt;td&gt;$159K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance setup (4 features, year 1)&lt;/td&gt;
&lt;td&gt;$168K&lt;/td&gt;
&lt;td&gt;$48K&lt;/td&gt;
&lt;td&gt;$120K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Year 1 total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$455K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$176K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$279K&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Equivalent runway months @$50K burn&lt;/td&gt;
&lt;td&gt;9.1 mo&lt;/td&gt;
&lt;td&gt;3.5 mo&lt;/td&gt;
&lt;td&gt;5.6 months recovered&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a founder shipping into 10 industries (hypothetical "compliance-heavy AI products" portfolio):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;Pipeline&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Year 1 (10 MVPs × overlap)&lt;/td&gt;
&lt;td&gt;$1.45M&lt;/td&gt;
&lt;td&gt;$580K&lt;/td&gt;
&lt;td&gt;$870K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wall-clock (sequential)&lt;/td&gt;
&lt;td&gt;30 months&lt;/td&gt;
&lt;td&gt;10 months&lt;/td&gt;
&lt;td&gt;20 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wall-clock (with parallelism)&lt;/td&gt;
&lt;td&gt;21 months&lt;/td&gt;
&lt;td&gt;7 months&lt;/td&gt;
&lt;td&gt;14 months&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 10-industry case is hypothetical — no real founder ships into all 10 simultaneously. But it shows the structural ratio: roughly &lt;strong&gt;60% cost reduction, roughly 67% wall-clock reduction&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM compute: where the money goes
&lt;/h2&gt;

&lt;p&gt;Per-MVP LLM compute, ~$500-$1,500 total, breaks down roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;senior-dev × 4-8 features            ~70%     (code-writing is expensive)
architect (per-feature ARCH.md)      ~12%
specialist reviewers (5 per feature) ~10%     (verdicts are cheap)
pm (decomposition)                   ~3%
qa-engineer (test scaffolds)         ~3%
detection + memory + misc            ~2%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reviewers are roughly 10% of cost despite being 5 of the 8 agents that run. They output verdicts, not code. If your LLM cost is exploding, look at how much code is being &lt;em&gt;generated&lt;/em&gt;, not how many agents are running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware / model-choice ratios
&lt;/h2&gt;

&lt;p&gt;We tested Sonnet 4.6 vs Haiku 4.5 vs Opus 4.5 on the same 23-feature batch:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;LLM cost ratio&lt;/th&gt;
&lt;th&gt;Wall-clock ratio&lt;/th&gt;
&lt;th&gt;architect output quality (human eval, blind)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;0.31×&lt;/td&gt;
&lt;td&gt;0.74×&lt;/td&gt;
&lt;td&gt;"noticeably worse" — 4 of 23 ARCH docs unusable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sonnet 4.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.0× (baseline)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.0×&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;acceptable, default&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4.5&lt;/td&gt;
&lt;td&gt;5.1×&lt;/td&gt;
&lt;td&gt;1.27×&lt;/td&gt;
&lt;td&gt;"marginally better" — 1 ARCH doc clearly superior&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Conclusion: &lt;strong&gt;Sonnet for everything except deep-reasoning architecture decisions.&lt;/strong&gt; Use Opus only for &lt;code&gt;architect&lt;/code&gt; on greenfield features in unfamiliar territory. Haiku for high-volume worker agents (pair programming, code generation) where the ARCH note is not on the critical path.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does NOT compress
&lt;/h2&gt;

&lt;p&gt;I have called this out before, but in numbers terms:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Compressible?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;External audit cycle (NYC bias auditor, 2-4 weeks)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FDA pre-submission meeting (60-90 days)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IRB approval (clinical trials, 8-12 weeks)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wet-lab validation (drug discovery)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HARA signature (functional safety, 1 calendar moment)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lawyer reading the threat model&lt;/td&gt;
&lt;td&gt;Compresses (LLM-written threat model is faster to read than human-written long-form)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regulator phone calls&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Anything that requires another organization's calendar runs at human speed. Internal work compresses 5-25×. External-dependency work does not move.&lt;/p&gt;

&lt;p&gt;For an early-stage AI startup on 18-24 month runway, the bet that changes is the &lt;em&gt;internal&lt;/em&gt; portion. You can now run 3 external compliance cycles per year instead of 1.5, because the internal prep for each one compressed from six weeks to ten days.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing I underbet
&lt;/h2&gt;

&lt;p&gt;When I started building the packs, I assumed the ROI claim would be "30-40% on compliance cost." The number ended up larger and the shape surprised me — most of the saving is not the LLM compute (it is rounding error) but the &lt;em&gt;fewer engineering-months&lt;/em&gt; the parallelism enables, plus the &lt;em&gt;fewer consulting hours&lt;/em&gt; the LLM-drafted threat model enables.&lt;/p&gt;

&lt;p&gt;If you take one number from this post: &lt;strong&gt;the LLM compute is not the moat. The pipeline that runs the agents in parallel, gates the right humans at the right scope, and persists memory across incidents is the moat.&lt;/strong&gt; The LLM is the substrate.&lt;/p&gt;




&lt;p&gt;About: I build &lt;a href="https://greatcto.systems" rel="noopener noreferrer"&gt;GreatCTO&lt;/a&gt; — a multi-agent SDLC plugin for Claude Code with 10 compliance packs. MIT, runs locally. Pay your own LLM API. Per-pack numbers (which 10 industries, what each pack does, real consulting-rate comparisons) are in &lt;a href="https://dev.to/blog/ten-compliance-packs-for-ten-regulated-industries"&gt;the W21 deep-dive&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>compliance</category>
      <category>cost</category>
      <category>roi</category>
    </item>
    <item>
      <title>The MTTR -94% claim, with receipts</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Sun, 17 May 2026 14:28:21 +0000</pubDate>
      <link>https://dev.to/great_cto/the-mttr-94-claim-with-receipts-4ncl</link>
      <guid>https://dev.to/great_cto/the-mttr-94-claim-with-receipts-4ncl</guid>
      <description>&lt;p&gt;Earlier posts cite a "median MTTR drop of 94.1% across 47 paired P0 incidents." This post is the receipts. The full methodology is also in &lt;a href="https://github.com/avelikiy/great_cto/blob/main/docs/benchmarks/MTTR.md" rel="noopener noreferrer"&gt;docs/benchmarks/MTTR.md&lt;/a&gt; — this post explains why the number is what it is, and the four cases it does not capture.&lt;/p&gt;

&lt;h2&gt;
  
  
  What got measured
&lt;/h2&gt;

&lt;p&gt;Setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;12 production repositories (mix of fintech, voice-AI, clinical, dev-tools).&lt;/li&gt;
&lt;li&gt;P0 incident defined as: user-facing, paged a human, took ≥15 minutes to resolve.&lt;/li&gt;
&lt;li&gt;Window: rolling 6 months. Pre-treatment + post-treatment.&lt;/li&gt;
&lt;li&gt;Treatment: the project installed GreatCTO and started persisting &lt;code&gt;(pattern_hash, detection_order_that_worked, rationale)&lt;/code&gt; after each P0 resolved.&lt;/li&gt;
&lt;li&gt;Outcome: time from page to root-cause identified.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I measured detection time, not full resolution time. Resolution depends on rollout speed, blast radius, customer comms — too many confounds. Detection time is the part where memory could conceivably help, and it is the part where humans burn the most calendar hours on recurring bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number
&lt;/h2&gt;

&lt;p&gt;47 paired incidents. "Paired" means: same shape (same &lt;code&gt;pattern_hash&lt;/code&gt;) seen at least twice across the 6-month window, once before persistence, once after.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stat&lt;/th&gt;
&lt;th&gt;Pre&lt;/th&gt;
&lt;th&gt;Post&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Median detection time&lt;/td&gt;
&lt;td&gt;178 min&lt;/td&gt;
&lt;td&gt;11 min&lt;/td&gt;
&lt;td&gt;−94.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean detection time&lt;/td&gt;
&lt;td&gt;224 min&lt;/td&gt;
&lt;td&gt;17 min&lt;/td&gt;
&lt;td&gt;−92.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;90th percentile&lt;/td&gt;
&lt;td&gt;412 min&lt;/td&gt;
&lt;td&gt;41 min&lt;/td&gt;
&lt;td&gt;−90.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Worst case (post)&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;89 min&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best case (post)&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;4 min&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Skewed by a couple of near-100% cases (postgres pool exhaustion and a connection-string typo that the agent matched to a prior incident's commit diff and flagged in under 5 minutes). I report median because it is less misleading than mean for skewed distributions. The 90th percentile is probably the number you should care about — it is the "still 6× faster on the bad cases" claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the mechanism works
&lt;/h2&gt;

&lt;p&gt;The agent stores, for each resolved incident:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;pattern_hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;sha256(normalized_log_signature + topology_hint)&lt;/span&gt;
&lt;span class="na"&gt;detection_order&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_pg_pool_size"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_connections"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_query_count"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;rationale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;connection_refused&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pool&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;80%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;utilization&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;→&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pool&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exhaustion,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;network"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On a new incident, the agent's &lt;em&gt;Step 0&lt;/em&gt; is: hash the current incident's signature, look up in &lt;code&gt;~/.great_cto/incident_memory.jsonl&lt;/code&gt;, if pattern hits, try the prior &lt;code&gt;detection_order&lt;/code&gt; first. If it identifies the root cause: log "memory hit." If it does not: fall back to systematic exploration.&lt;/p&gt;

&lt;p&gt;There is no inference. The agent is not "smarter" — it is just skipping hypothesis exploration time because someone (you, last time) already paid for that exploration.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚠ The 4 honest misses
&lt;/h2&gt;

&lt;p&gt;Memory-based detection is not magic. Four cases in the 47 had pattern matches that pointed in the wrong direction and burned 10-30 minutes before the agent gave up and fell back to systematic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Miss #1.&lt;/strong&gt; Pattern matched on log signature "OOMKilled in worker pool." Prior detection order was "check worker memory limits." Reality: this time, the OOM was a memory leak in a &lt;em&gt;different&lt;/em&gt; worker that pushed the wrong worker over its limit. Agent spent 18 minutes confirming the wrong worker's limits before noticing the leak. Total detection time: 34 minutes vs ~80 minutes baseline. Net positive but ugly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Miss #2.&lt;/strong&gt; Pattern matched "5xx spike from API gateway." Prior cause was upstream DB lag. Reality: this time it was a misconfigured rate-limiter that started rejecting requests after a deploy. Agent ran "check DB lag" for 12 minutes before pivoting. 28 minutes total vs ~140 baseline. Still a win, but called a "miss" because the prior path was wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Miss #3.&lt;/strong&gt; Pattern matched "auth failures after deploy." Prior cause was OAuth client secret rotation. Reality: a clock skew on one node caused JWT signature validation to fail. Agent's prior detection order led it through token store inspection first. 41 minutes total vs ~200 baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Miss #4.&lt;/strong&gt; Worst case. Pattern matched "DNS resolution failures." Prior detection order was "check Route 53 health checks." Reality: a third-party CDN had an outage. The agent's path was completely wrong, did not give up early enough, and a human had to manually override at minute 22. 89 minutes total vs ~150 baseline. Win on absolute time, but I would not call this a "memory worked" case.&lt;/p&gt;

&lt;p&gt;If I report the 47 cases as "94.1% median drop," I owe the audience the 4 cases where the mechanism worked badly. They are 8.5% of the sample. The remaining 91.5% of cases saw memory either help significantly (74%) or be irrelevant (no pattern hit, fell straight to systematic exploration — 17%).&lt;/p&gt;

&lt;h2&gt;
  
  
  How to replicate in your own repo
&lt;/h2&gt;

&lt;p&gt;Three steps, no GreatCTO required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Persist incident memory.&lt;/strong&gt; After each P0 resolves, write &lt;code&gt;(pattern_hash, detection_order, rationale)&lt;/code&gt; to a markdown file in your repo. Plain text. Git-trackable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At incident start, ask your agent to read that file before doing anything else.&lt;/strong&gt; Even Claude Code with no plugins will use the file if you point it at one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track detection time.&lt;/strong&gt; Page-to-RC-identified, in minutes. Spreadsheet is fine.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Run for one quarter. If you see a consistent reduction in detection time on recurring patterns, you have your own version of this mechanism. If you do not see reduction, your incidents are too unique or your pattern hash is too coarse.&lt;/p&gt;

&lt;p&gt;The hash I use is &lt;code&gt;sha256(top_3_log_lines_normalized + topology_hint)&lt;/code&gt; where &lt;code&gt;topology_hint&lt;/code&gt; is the service name. This gets ~70% recall on similar incidents and very few false hits. You can tune for your domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I will not do
&lt;/h2&gt;

&lt;p&gt;Some readers ask for the raw data (anonymized incidents). I will not publish it — even anonymized, customers can be re-identified from incident shapes and timing. I will share the synthetic test cases in &lt;code&gt;tests/incident_memory.test.mjs&lt;/code&gt; and the aggregate statistics in &lt;code&gt;docs/benchmarks/MTTR.md&lt;/code&gt;. That is enough to verify the mechanism without leaking client incident data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;p&gt;Not an RCT. Observational. Twelve repos is small. The selection bias is real — the repos that adopted GreatCTO early were also the ones with the best L3 culture. A worse team might see 30% drop instead of 94%.&lt;/p&gt;

&lt;p&gt;The number I would defend to your board: &lt;strong&gt;on recurring incident patterns, memory-driven detection compresses detection time by 5-10× median, with a long tail of near-zero-improvement cases.&lt;/strong&gt; That is more honest than "94%." But "94%" is what shows up in the data.&lt;/p&gt;




&lt;p&gt;About: I build &lt;a href="https://greatcto.systems" rel="noopener noreferrer"&gt;GreatCTO&lt;/a&gt; — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Memory layer source is in &lt;a href="https://github.com/avelikiy/great_cto/blob/main/packages/cli/src/main.ts" rel="noopener noreferrer"&gt;packages/cli/src/memory.ts&lt;/a&gt;. The full benchmark methodology is at &lt;a href="https://github.com/avelikiy/great_cto/blob/main/docs/benchmarks/MTTR.md" rel="noopener noreferrer"&gt;docs/benchmarks/MTTR.md&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mttr</category>
      <category>operations</category>
    </item>
    <item>
      <title>Three days of code, six weeks of compliance — the math behind why</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Sun, 17 May 2026 14:28:15 +0000</pubDate>
      <link>https://dev.to/great_cto/three-days-of-code-six-weeks-of-compliance-the-math-behind-why-48g8</link>
      <guid>https://dev.to/great_cto/three-days-of-code-six-weeks-of-compliance-the-math-behind-why-48g8</guid>
      <description>&lt;p&gt;If you have shipped into a regulated industry, you know this ratio. Engineering ships a feature in three days. Compliance setup around the feature takes six weeks. Some founders get used to it. The right reaction is: &lt;em&gt;the ratio is the bug.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This post is for the CEO / CTO who reads &lt;a href="https://dev.to/blog/ten-compliance-packs-for-ten-regulated-industries"&gt;"What $1.4M of compliance work looks like in 14 hours"&lt;/a&gt; and wants to understand the mechanism — why six weeks specifically, and where in those weeks an LLM can save time without anyone getting sued.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the six weeks actually go
&lt;/h2&gt;

&lt;p&gt;I priced this out properly the last three times I lived it as a CTO-for-hire. Numbers below are typical for a voice-AI or fintech feature shipping in 2025-2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Median hours&lt;/th&gt;
&lt;th&gt;Hourly rate&lt;/th&gt;
&lt;th&gt;Subtotal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Identify which regulations apply&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;$200 (senior legal)&lt;/td&gt;
&lt;td&gt;$1,600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read primary regulation text&lt;/td&gt;
&lt;td&gt;12-16&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;td&gt;~$2,800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Map regulation → your stack&lt;/td&gt;
&lt;td&gt;16-24&lt;/td&gt;
&lt;td&gt;$250 (compliance consultant)&lt;/td&gt;
&lt;td&gt;~$5,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Draft threat model&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;$250&lt;/td&gt;
&lt;td&gt;$8,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Draft consent flow + UX changes&lt;/td&gt;
&lt;td&gt;16-24&lt;/td&gt;
&lt;td&gt;$180 (senior PM + senior frontend)&lt;/td&gt;
&lt;td&gt;$3,600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implement consent + audit log&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;$180&lt;/td&gt;
&lt;td&gt;$7,200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal legal review of threat model&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;$400 (general counsel)&lt;/td&gt;
&lt;td&gt;$3,200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External auditor pre-meeting + Q&amp;amp;A&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;$350 (specialist)&lt;/td&gt;
&lt;td&gt;$3,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Revisions, second pass&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;mixed&lt;/td&gt;
&lt;td&gt;~$3,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final sign-off&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;$400&lt;/td&gt;
&lt;td&gt;$1,600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~190 hours&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;mixed&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$42,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is a single regulated feature. Multi-jurisdictional (US + EU + India + state-level US) doubles or triples it. Multi-feature (a startup shipping into a regulated industry has 8-15 such features in the first six months) makes the aggregate $300K-$500K of consulting before the product exists in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where an LLM helps
&lt;/h2&gt;

&lt;p&gt;Not all of those 190 hours are equal. Some are mechanical, some require judgment, some require relationships.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanical (can be 80-90% automated):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reading primary regulation text. The CFR is plain text. The EU AI Act Annex III is plain text. LLMs read 200 pages faster than any human can think. &lt;strong&gt;Replaces ~12-16 hours.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Mapping regulation to stack. "Does our PCI-DSS scope include the webhook signature verifier?" is a deterministic question with a regex-and-citation answer. &lt;strong&gt;Replaces ~12-18 hours of the 16-24.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Drafting threat model. Each pack has a 200-word template (down from my first 800-word version — auditors politely asked for shorter). LLM fills it in using regulation text + your ARCH.md. &lt;strong&gt;Replaces ~24-28 hours of the 32.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Generating evidence artifacts (decision logs, gate signoffs, audit trail). The pipeline emits these as side effects, not as a separate phase. &lt;strong&gt;Replaces ~6-8 hours.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Judgment (human time stays roughly constant):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify which regulations apply. Mostly mechanical, but the "is this an edge case" call is human. &lt;strong&gt;Reduces from 8h to ~2-3h of review.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Drafting consent flow UX. Pure product judgment. The LLM writes a &lt;em&gt;first pass&lt;/em&gt; you can react to in 15 minutes instead of authoring from scratch in 4 hours. &lt;strong&gt;Reduces from 16-24h to ~4-6h.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Implementation. Coding is faster with LLM assistance, but the gates are real. &lt;strong&gt;Reduces from 40h to ~10-15h.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Relationship (cannot be automated, and pretending otherwise is malpractice):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal legal review. Your GC has to sign. Their time is your time. &lt;strong&gt;Unchanged at 8h.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;External auditor pre-meeting. The auditor wants a &lt;em&gt;human&lt;/em&gt; on the other end of the phone who can defend the threat model under questioning. The LLM-generated threat model is the document the auditor reads. The conversation about it is yours. &lt;strong&gt;Unchanged at 10h, but the auditor reads a tighter document faster, so call it 6-8h net.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;New math:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Old&lt;/th&gt;
&lt;th&gt;New&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Identify regs&lt;/td&gt;
&lt;td&gt;8h&lt;/td&gt;
&lt;td&gt;2-3h&lt;/td&gt;
&lt;td&gt;~6h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read regs&lt;/td&gt;
&lt;td&gt;12-16h&lt;/td&gt;
&lt;td&gt;1-2h&lt;/td&gt;
&lt;td&gt;~13h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Map to stack&lt;/td&gt;
&lt;td&gt;16-24h&lt;/td&gt;
&lt;td&gt;3-4h&lt;/td&gt;
&lt;td&gt;~17h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Threat model&lt;/td&gt;
&lt;td&gt;32h&lt;/td&gt;
&lt;td&gt;4-6h&lt;/td&gt;
&lt;td&gt;~27h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consent UX&lt;/td&gt;
&lt;td&gt;16-24h&lt;/td&gt;
&lt;td&gt;4-6h&lt;/td&gt;
&lt;td&gt;~15h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implementation&lt;/td&gt;
&lt;td&gt;40h&lt;/td&gt;
&lt;td&gt;10-15h&lt;/td&gt;
&lt;td&gt;~28h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal legal&lt;/td&gt;
&lt;td&gt;8h&lt;/td&gt;
&lt;td&gt;8h&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External auditor&lt;/td&gt;
&lt;td&gt;10h&lt;/td&gt;
&lt;td&gt;6-8h&lt;/td&gt;
&lt;td&gt;~3h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Revisions&lt;/td&gt;
&lt;td&gt;16h&lt;/td&gt;
&lt;td&gt;6-8h&lt;/td&gt;
&lt;td&gt;~9h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final signoff&lt;/td&gt;
&lt;td&gt;4h&lt;/td&gt;
&lt;td&gt;4h&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~190h&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~50-65h&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~125-140h&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Wall-clock compresses from &lt;strong&gt;six weeks to about ten working days&lt;/strong&gt;, partly because removed work and partly because the work that remains can run in parallel (the LLM drafts while the auditor pre-meeting is scheduled).&lt;/p&gt;

&lt;p&gt;Cost compresses from ~$42K to ~$15-18K (LLM bill ~$50-150, human time the rest). Median compression I have measured: &lt;strong&gt;~60% on cost, ~67% on wall-clock&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is not "AI replaces compliance consultants"
&lt;/h2&gt;

&lt;p&gt;The compliance specialist of 2027 is someone who knows which regulation applies in which jurisdiction &lt;em&gt;and&lt;/em&gt; can operate a pipeline to do the reading and templating for them. Same depth of judgment. Five times the productivity.&lt;/p&gt;

&lt;p&gt;That person is going to win market share against the consultant still billing by the hour to read 200 pages of regulation. Not because their judgment is better — it is the same. Because their cost-per-judgment is one-fifth.&lt;/p&gt;

&lt;p&gt;The judgment is the moat. The reading and templating around the judgment has been commoditized. This is the same transition that happened to junior associates in law firms when document-review tools landed in 2010-2015. Senior partners did not disappear; they got faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does not compress
&lt;/h2&gt;

&lt;p&gt;External calendar time. The auditor still books two weeks out. The FDA pre-submission meeting is still 60-90 days. IRB approval is still 8-12 weeks. Internal work compresses 5-25×; external-dependency work does not move.&lt;/p&gt;

&lt;p&gt;If your runway is 18 months and you ship into a regulated industry, the realistic plan is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Compress internal compliance work from 6 weeks to 10 days.&lt;/li&gt;
&lt;li&gt;Use the recovered 4 weeks to run the &lt;em&gt;external&lt;/em&gt; cycles in parallel with the next feature.&lt;/li&gt;
&lt;li&gt;End up with one external cycle per quarter, not one every two quarters.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That math doubles the number of features that ship through compliance per year for the same runway. For an early-stage AI startup, that is the difference between catching the wave and missing it.&lt;/p&gt;




&lt;p&gt;About: I build &lt;a href="https://greatcto.systems" rel="noopener noreferrer"&gt;GreatCTO&lt;/a&gt; — a multi-agent SDLC plugin for Claude Code with 10 compliance packs. MIT, runs locally. The cost-by-pack breakdown is in &lt;a href="https://dev.to/blog/ten-compliance-packs-for-ten-regulated-industries"&gt;the W21 deep-dive&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>compliance</category>
      <category>startup</category>
      <category>engineeringmanagement</category>
    </item>
    <item>
      <title>How GreatCTO chooses which compliance pack to attach</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Sun, 17 May 2026 14:27:39 +0000</pubDate>
      <link>https://dev.to/great_cto/how-greatcto-chooses-which-compliance-pack-to-attach-o64</link>
      <guid>https://dev.to/great_cto/how-greatcto-chooses-which-compliance-pack-to-attach-o64</guid>
      <description>&lt;p&gt;Every time someone runs &lt;code&gt;npx great-cto init&lt;/code&gt;, the CLI has to decide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What kind of project is this? (one of ~25 archetypes)&lt;/li&gt;
&lt;li&gt;Which compliance packs apply on top? (voice / clinical / fintech / lending / 6 more)&lt;/li&gt;
&lt;li&gt;Are any of those guesses wrong enough that the user will get a useless threat model and abandon the tool?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last question is what makes the detection logic interesting. Get it wrong and the first impression is "this is producing nonsense about regulations I don't care about." Get it too conservative and the user has to manually configure packs that &lt;em&gt;should&lt;/em&gt; have auto-attached, defeating the point.&lt;/p&gt;

&lt;p&gt;After four months in production, here is what works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tried first: LLM-based detection
&lt;/h2&gt;

&lt;p&gt;Original design (rejected after 2 weeks): pipe the repo's README, package.json, and top-level directory listing into Claude and ask it to classify.&lt;/p&gt;

&lt;p&gt;Problems, in order of severity:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Latency.&lt;/strong&gt; First run of &lt;code&gt;init&lt;/code&gt; now takes 12-18 seconds instead of &amp;lt;1s. Users perceive this as broken.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost.&lt;/strong&gt; Roughly $0.04 per &lt;code&gt;init&lt;/code&gt;. Negligible per user, real money at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinations.&lt;/strong&gt; Claude classified a Helm chart for an internal Kubernetes operator as "fintech, because the README mentions billing in the Operator's logging section." It does not. The word "billing" appeared once, describing log volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variance.&lt;/strong&gt; Same repo, same prompt, two runs: voice-AI then mlops. Probably temperature noise. Not acceptable for a decision that shapes the rest of the pipeline.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Killed it. Went to a regex-based detector. Latency dropped from 15s to 180ms. Cost dropped to $0. Variance dropped to zero.&lt;/p&gt;

&lt;p&gt;The trade-off: regex cannot read intent. It reads tokens. A repo that &lt;em&gt;says&lt;/em&gt; it does voice AI in its README but actually contains a music-recommender model will get the voice pack. That is a false positive I accept because the alternative (LLM in the loop) had its own false positives and was 80× slower.&lt;/p&gt;

&lt;h2&gt;
  
  
  The current detector
&lt;/h2&gt;

&lt;p&gt;Three signal layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — package.json dependencies.&lt;/strong&gt; &lt;code&gt;twilio&lt;/code&gt; / &lt;code&gt;livekit&lt;/code&gt; / &lt;code&gt;deepgram&lt;/code&gt; / &lt;code&gt;elevenlabs&lt;/code&gt; → voice pack. &lt;code&gt;stripe&lt;/code&gt; / &lt;code&gt;plaid&lt;/code&gt; / &lt;code&gt;dwolla&lt;/code&gt; → fintech. &lt;code&gt;tensorflow&lt;/code&gt; / &lt;code&gt;pytorch&lt;/code&gt; + &lt;code&gt;transformers&lt;/code&gt; → ml-pack (different from voice-pack). And so on for ~80 strong signal tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — file paths.&lt;/strong&gt; &lt;code&gt;clinical/&lt;/code&gt;, &lt;code&gt;fda/&lt;/code&gt;, &lt;code&gt;phi/&lt;/code&gt;, &lt;code&gt;hipaa/&lt;/code&gt; in directory names → clinical pack. &lt;code&gt;webhook/&lt;/code&gt; + signature-related code → api-platform-pack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 — README + top-level docs grep.&lt;/strong&gt; Exact-match keywords only, not fuzzy. &lt;code&gt;"AEDT"&lt;/code&gt;, &lt;code&gt;"automated employment decision"&lt;/code&gt;, &lt;code&gt;"NYC Local Law 144"&lt;/code&gt; → hr-ai pack. &lt;code&gt;"21 CFR Part 11"&lt;/code&gt;, &lt;code&gt;"SaMD"&lt;/code&gt;, &lt;code&gt;"FDA pre-submission"&lt;/code&gt; → clinical pack.&lt;/p&gt;

&lt;p&gt;Each pack has a minimum signal count. voice-pack needs ≥2 of its 11 tokens. fintech needs ≥3 of 14. This is what cut false positives roughly in half.&lt;/p&gt;

&lt;h2&gt;
  
  
  The false positives I have logged
&lt;/h2&gt;

&lt;p&gt;Across 4 months and ~340 init runs (instrumented from telemetry), 12 confirmed false positives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;repo type&lt;/th&gt;
&lt;th&gt;wrongly attached pack&lt;/th&gt;
&lt;th&gt;trigger&lt;/th&gt;
&lt;th&gt;fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;static-site generator&lt;/td&gt;
&lt;td&gt;voice-pack&lt;/td&gt;
&lt;td&gt;README explicitly disclaiming Twilio&lt;/td&gt;
&lt;td&gt;exact-match keywords only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;music-recommender ML&lt;/td&gt;
&lt;td&gt;voice-pack&lt;/td&gt;
&lt;td&gt;"audio" in package description&lt;/td&gt;
&lt;td&gt;removed "audio" as solo trigger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;internal Helm chart&lt;/td&gt;
&lt;td&gt;fintech&lt;/td&gt;
&lt;td&gt;"billing" in operator log section&lt;/td&gt;
&lt;td&gt;minimum 3 signals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;docs-only repo&lt;/td&gt;
&lt;td&gt;clinical&lt;/td&gt;
&lt;td&gt;"patient" in user-research subfolder&lt;/td&gt;
&lt;td&gt;excluded &lt;code&gt;docs/&lt;/code&gt; from path scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;game-server prototype&lt;/td&gt;
&lt;td&gt;mlops&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;torch&lt;/code&gt; in optional dev-dep&lt;/td&gt;
&lt;td&gt;only scan &lt;code&gt;dependencies&lt;/code&gt;, not &lt;code&gt;devDependencies&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7 others&lt;/td&gt;
&lt;td&gt;various&lt;/td&gt;
&lt;td&gt;various&lt;/td&gt;
&lt;td&gt;each addressed via test case in &lt;code&gt;tests/detection.test.mjs&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 12 cases are committed as regression tests. If the detector ever re-introduces one of these false positives, CI fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  The case I worry about: silent false negatives
&lt;/h2&gt;

&lt;p&gt;Easier to log a false positive (user complains "why is this thing telling me about TCPA"). Harder to catch a false negative (user runs init on a repo that &lt;em&gt;should&lt;/em&gt; have hr-ai pack attached, doesn't, ships with no bias audit, gets fined two years later).&lt;/p&gt;

&lt;p&gt;Mitigations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/migrate&lt;/code&gt; command.&lt;/strong&gt; Rerun detection with updated rules. New packs (or new keywords for existing packs) get a second chance to attach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PROJECT.md is editable.&lt;/strong&gt; The &lt;code&gt;packs:&lt;/code&gt; list is plain YAML. User can add manually if detection missed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public catalogue.&lt;/strong&gt; &lt;a href="https://greatcto.systems/companies.html" rel="noopener noreferrer"&gt;greatcto.systems/companies.html&lt;/a&gt; lists 200+ companies and the packs that &lt;em&gt;would&lt;/em&gt; auto-attach to each. If a user's similar competitor is in the catalogue, they get a sanity check on whether their detection is correct.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telemetry on no-pack runs.&lt;/strong&gt; When init detects zero packs, we log it (anon, opt-in). If a class of project keeps coming through with no pack and the cost-of-miss is high (regulated industry), I add detection rules.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I have not had a confirmed regulatory false negative yet. That is partly because the user population is small (~500 active installs as of writing) and partly because the high-stakes archetypes (clinical, fintech, lending) have strong-signal vocabulary that is hard to miss.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I will not add
&lt;/h2&gt;

&lt;p&gt;People keep asking for two features I have rejected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Pack confidence scores."&lt;/strong&gt; The detector should output 0-1 confidence per pack so the user can sort. I rejected this: it implies a precision the regex layer does not actually have, and users will treat a 0.6 score as "halfway right" when really it means "one signal matched, probably noise."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Auto-update detection from telemetry."&lt;/strong&gt; If we see 10 users with &lt;code&gt;xyz&lt;/code&gt; in their repo overriding our detection, automatically add &lt;code&gt;xyz&lt;/code&gt; as a fintech signal. Rejected: too easy to poison. One determined attacker registers 10 fake &lt;code&gt;xyz/random-name&lt;/code&gt; repos with manual fintech tags and the global detector starts attaching fintech to everyone using &lt;code&gt;xyz&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both of these are textbook examples of "the obvious feature that becomes a backdoor."&lt;/p&gt;

&lt;h2&gt;
  
  
  What I might add
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM in the loop, but only for ambiguous cases.&lt;/strong&gt; If 2+ packs have signal but below threshold for any one, pipe the README into Claude with a strict "pick one or 'unclear'" prompt. Latency penalty only on the 5-10% of repos that are ambiguous, not all of them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-language detection.&lt;/strong&gt; Right now everything assumes Node/Python/JVM-ish patterns. Rust and Go projects sometimes have weak signal even when they are clearly fintech or healthcare. Not urgent — those communities are smaller in the user base.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The detection logic is small, boring, and one of the parts of the system I am most defensive of. It is the first thing every user sees, and a wrong first guess loses them.&lt;/p&gt;




&lt;p&gt;About: I build &lt;a href="https://greatcto.systems" rel="noopener noreferrer"&gt;GreatCTO&lt;/a&gt; — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. The detector source is in &lt;a href="https://github.com/avelikiy/great_cto/blob/main/packages/cli/src/main.ts" rel="noopener noreferrer"&gt;packages/cli/src/detect.ts&lt;/a&gt; — read or fork.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>compliance</category>
      <category>detection</category>
    </item>
    <item>
      <title>Why your agent system fails: missing gates, not missing intelligence</title>
      <dc:creator>Alexander Velikiy</dc:creator>
      <pubDate>Sun, 17 May 2026 14:27:33 +0000</pubDate>
      <link>https://dev.to/great_cto/why-your-agent-system-fails-missing-gates-not-missing-intelligence-4591</link>
      <guid>https://dev.to/great_cto/why-your-agent-system-fails-missing-gates-not-missing-intelligence-4591</guid>
      <description>&lt;p&gt;A senior CTO emailed me last month: "We rolled out Devin across two teams. After three weeks the agents had merged 47 PRs. Three of them broke prod. Two contained a credential in the commit. One disabled rate limiting because the test fixtures didn't pass with rate limiting on. We're rolling back."&lt;/p&gt;

&lt;p&gt;Everyone with eyes on agentic coding has heard a version of this story. The most common diagnosis is &lt;em&gt;"the model isn't good enough yet."&lt;/em&gt; Reasonable on the surface. Wrong as a diagnosis.&lt;/p&gt;

&lt;p&gt;I've spent the last 4 months building a multi-agent SDLC layer on top of Claude Code. 34 specialist agents, 25 archetype overlays, two human gates per feature. The clearest finding from this work: &lt;strong&gt;the failures CTOs describe almost never trace to bad code generation. They trace to missing gates.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This article walks through why, and shows the state machine I think every agentic SDLC needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with how everyone does it
&lt;/h2&gt;

&lt;p&gt;The default architecture for agentic coding is &lt;strong&gt;one autonomous loop&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loop:
  llm.generate(task, context)
  apply(diff)
  run_tests()
  if pass: commit
  else: revise
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is fine for prototypes. It is a disaster for shipped code. Three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tests aren't enough.&lt;/strong&gt; Tests verify &lt;em&gt;correctness against assertions you wrote&lt;/em&gt;. They do not verify: "is this PCI-DSS scope appropriate", "does this respect TCPA recording consent", "did we just add a hidden N+1 query", "is this idempotent under retry storms". You need humans, or specialist reviewers that act like humans, for each of those.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. One agent can't review itself.&lt;/strong&gt; Even if you ask GPT-4 or Claude Opus to review its own output, the same biases that wrote the bug are reading the diff. We have decades of evidence from code review at Google, Microsoft, and Apache that &lt;strong&gt;independent reviewers catch ~3× more defects than authors&lt;/strong&gt;. Independence requires separation. Agents aren't different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Speed compounds errors.&lt;/strong&gt; When the loop runs unattended, errors accumulate quietly between human checkpoints. By the time a human sees the work, the agent has rebuilt on top of three earlier mistakes. You can't fix the lowest-level mistake without unwinding everything above it.&lt;/p&gt;

&lt;p&gt;The pattern that keeps emerging across teams that ship agentic systems successfully is &lt;strong&gt;explicit gates + specialist reviewers&lt;/strong&gt;, not bigger models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this article will show
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The 8-stage state machine I think every agentic SDLC needs&lt;/li&gt;
&lt;li&gt;Why two human gates per feature is the sweet spot&lt;/li&gt;
&lt;li&gt;The parallel implementer + parallel reviewer pattern&lt;/li&gt;
&lt;li&gt;How memory feedback closes the loop (the "94% MTTR" claim, with caveats)&lt;/li&gt;
&lt;li&gt;What this all costs (~$2 per small feature, with receipts)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The state machine
&lt;/h2&gt;

&lt;p&gt;The full pipeline, as a deterministic state machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
    Init["$ init"] --&amp;gt; Detect["archetype-detect"]
    Detect --&amp;gt; Architect["architect (ARCH.md)"]
    Architect --&amp;gt; GatePlan{"⚐ gate: plan"}
    GatePlan --&amp;gt;|human approve| PM["pm (decompose)"]
    PM --&amp;gt; Impl["senior-dev × N (parallel)"]
    Impl --&amp;gt; Review["specialist review × 5 (parallel)"]
    Review --&amp;gt; GateShip{"⚐ gate: ship"}
    GateShip --&amp;gt;|human approve| Deploy["devops"]
    Deploy --&amp;gt; Operate["l3-support"]
    Operate -.-&amp;gt;|incident pattern| Learner["continuous-learner"]
    Learner -.-&amp;gt;|inject lesson| Architect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two diamond nodes are human gates. Everything else runs unattended.&lt;/p&gt;

&lt;p&gt;A few things to notice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parallelism is structural, not accidental.&lt;/strong&gt; At the implement stage, independent tasks run in isolated git worktrees. At the review stage, 5 reviewers run concurrently because they look at different aspects (QA, security, performance, archetype-specific compliance, 12-angle code review).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The memory loop is dashed.&lt;/strong&gt; It's an out-of-band feedback path. When a P0 incident resolves, the &lt;code&gt;continuous-learner&lt;/code&gt; agent extracts the detection pattern and writes it to &lt;code&gt;~/.great_cto/lessons.md&lt;/code&gt;. Next time a similar incident shape hits, the agent's &lt;em&gt;Step 0&lt;/em&gt; includes the prior detection order. This is where the MTTR savings come from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specialists run only when archetype matches.&lt;/strong&gt; The 34 agents in the pool aren't all firing every time. For a typical fintech feature, only 7 run: architect, pm, 2× senior-dev, qa-engineer, security-officer (PCI focus), code-reviewer. The voice-AI reviewer doesn't load because the archetype isn't voice-AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two gates, not seven
&lt;/h2&gt;

&lt;p&gt;The hardest design question is: how many human gates?&lt;/p&gt;

&lt;p&gt;I started with seven: plan, design-review, security-review, qa-review, performance-review, compliance-review, ship. The complaint from every early user was: "this is just the human checkpoint problem from waterfall, but worse, because now I'm reviewing AI outputs."&lt;/p&gt;

&lt;p&gt;Down to two. Specifically:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate 1: plan.&lt;/strong&gt; You approve the ARCH note + cost estimate + task decomposition &lt;em&gt;before&lt;/em&gt; any code is written. This is the cheapest decision in the pipeline — if scope is wrong, fixing it now is free. If you approve it, you've committed to "ship this if implementation passes."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate 2: ship.&lt;/strong&gt; You see the full review panel — 5 verdicts, with rationale and diff per reviewer. APPROVED chips and BLOCKED chips. You either approve, or push back on a specific reviewer.&lt;/p&gt;

&lt;p&gt;Everything in between is the agents' problem. If they disagree with each other, the gate fails and surfaces with the disagreement explicit.&lt;/p&gt;

&lt;p&gt;Why this specific shape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gate 1 controls scope.&lt;/strong&gt; You decide what gets built.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gate 2 controls quality.&lt;/strong&gt; You decide whether the agents got it right.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't decide &lt;em&gt;how&lt;/em&gt; in between. The agents do. If you're making more than 2 decisions per feature, you're a bottleneck — and the whole pipeline collapses to your reading speed.&lt;/p&gt;

&lt;p&gt;This is the part that most agentic systems get wrong. They either show you everything (and you can't keep up), or they show you nothing (and you wake up to broken prod). Two well-chosen gates is the sweet spot.&lt;/p&gt;

&lt;h2&gt;
  
  
  The memory loop is the real moat
&lt;/h2&gt;

&lt;p&gt;Most agentic coding tools have no memory. They start each session from zero. This is &lt;em&gt;fine&lt;/em&gt; for syntax errors and dead code. It is &lt;em&gt;bad&lt;/em&gt; for the kind of bugs that recur with different surface signatures.&lt;/p&gt;

&lt;p&gt;Real example. Q1 of this year I hit a postgres connection pool exhaustion during a burst load. The log said &lt;code&gt;Connection refused&lt;/code&gt;. Looked like a network issue. Spent 4 hours unwinding network config before finally checking &lt;code&gt;pg_stat_activity&lt;/code&gt; and seeing pool size was the cap. Q3, same shape hits in a different project — different framework, different stack. Pattern hash matches. Agent's &lt;code&gt;Step 0&lt;/code&gt; includes the prior detection order. 28 minutes to resolution.&lt;/p&gt;

&lt;p&gt;This is not the agent being smarter. It's the agent skipping hypothesis exploration time.&lt;/p&gt;

&lt;p&gt;Across 47 paired P0 incidents in 12 repositories (full methodology and 4 honest memory-miss cases published &lt;a href="https://github.com/avelikiy/great_cto/blob/main/docs/benchmarks/MTTR.md" rel="noopener noreferrer"&gt;here&lt;/a&gt;), the median MTTR reduction was 94.1%. The mean was 92.6%. Skewed by a couple of near-100% cases. Not an RCT. Observational. Caveats are listed in the methodology.&lt;/p&gt;

&lt;p&gt;The mechanism is simple. The agent stores: &lt;code&gt;(pattern_hash, detection_order_that_worked, rationale)&lt;/code&gt;. On a match, it tries the winning detection first. If that's wrong (4 of 47 cases were misses), it falls back to systematic exploration. No worse than baseline.&lt;/p&gt;

&lt;p&gt;What makes the memory layer work is that it's &lt;strong&gt;local, file-backed, and git-trackable&lt;/strong&gt;. Not a vector DB. Not a cloud service. Plain markdown in &lt;code&gt;.great_cto/lessons.md&lt;/code&gt; (per-project) and &lt;code&gt;~/.great_cto/decisions.md&lt;/code&gt; (cross-project). You can read it, edit it, version-control it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge cases worth knowing
&lt;/h2&gt;

&lt;p&gt;A few things that surprised me during the build:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent count doesn't matter much.&lt;/strong&gt; I shipped 12 agents, then 24, then 34. The marginal value of adding the 35th agent is small. What matters is &lt;em&gt;coverage&lt;/em&gt; of distinct review angles. After 12, you mostly add archetype-specific compliance reviewers, and each one is opt-in based on archetype detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disagreement between reviewers is a feature, not a bug.&lt;/strong&gt; When &lt;code&gt;security-officer&lt;/code&gt; blocks a PR that &lt;code&gt;qa-engineer&lt;/code&gt; approves, you want this visible at the gate, not papered over. The state machine surfaces both verdicts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost is dominated by output tokens.&lt;/strong&gt; A typical feature: $3.40 in LLM calls. ~80% is in the agents that &lt;em&gt;write&lt;/em&gt; (senior-devs, architect). The reviewers are cheap because they output verdicts, not code. If costs balloon, look at how much code is being generated, not how many agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-approve flag is the slippery slope.&lt;/strong&gt; I considered an &lt;code&gt;--auto-approve&lt;/code&gt; flag for trivial features. Killed it. The minute you have that flag, the cycle that produces broken prod starts. The two gates are load-bearing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this fits
&lt;/h2&gt;

&lt;p&gt;The thesis isn't "you need this specific tool." It's that &lt;strong&gt;any agentic SDLC needs explicit state, explicit gates, and a memory loop&lt;/strong&gt;. Without them, you're shipping a faster version of the agent system that already burned the teams I mentioned at the top.&lt;/p&gt;

&lt;p&gt;If you want to inspect the exact state machine, the live SVG with every node clickable to its source on GitHub is &lt;a href="https://greatcto.systems/architecture" rel="noopener noreferrer"&gt;here&lt;/a&gt;. A real shipped feature, walked stage by stage with artifacts and costs, is &lt;a href="https://greatcto.systems/proof" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Agentic coding failures trace to &lt;strong&gt;missing gates&lt;/strong&gt;, not bad models.&lt;/li&gt;
&lt;li&gt;The pattern that ships safely is &lt;strong&gt;2 human gates + parallel implementers + parallel reviewers + memory loop&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;"Bigger model" is rarely the right answer. "More specialist review angles" usually is.&lt;/li&gt;
&lt;li&gt;Cost per shipped feature on this architecture: $1–4 in LLM, ~45 min wall-clock, 2 human clicks.&lt;/li&gt;
&lt;li&gt;Memory is the difference between "fast at one-off code generation" and "improves over time at &lt;em&gt;your&lt;/em&gt; codebase's recurring bugs".&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;About: I build &lt;a href="https://greatcto.systems" rel="noopener noreferrer"&gt;GreatCTO&lt;/a&gt; — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Twitter: &lt;a href="https://twitter.com/avelikiy" rel="noopener noreferrer"&gt;@avelikiy&lt;/a&gt;. GitHub: &lt;a href="https://github.com/avelikiy/great_cto" rel="noopener noreferrer"&gt;@avelikiy/great_cto&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>sdlc</category>
      <category>compliance</category>
    </item>
  </channel>
</rss>
