DEV Community: Alexander Velikiy

An open-source AI Product Builder now ships compliance-reviewed software for 5 regulated industries

Alexander Velikiy — Fri, 03 Jul 2026 15:09:24 +0000

Cross-posted from greatcto.systems/blog.

GreatCTO is an open-source AI Product Builder that runs on Claude Code: describe a product, approve the spec, and a pipeline of specialist agents ships real software. Until now its catalog covered ten US service industries — 40 products across six build pipelines.

This week it's 15 industries and 60 products — and all five new verticals are regulated, with the compliance review built into the pipeline.

The five new verticals

Industry	Products
🩺 Allied health & clinics	Patient scheduling · Clinical charting · Insurance claims · Patient intake
🦷 Dental practices	Dental scheduling · Treatment planning · Dental claims · Recall & reactivation
🛡️ Insurance agencies	Quote management · Policy management · Commission management · Agency CRM
🧾 Accounting & tax firms	Client books · Tax workflow · Document portal · Engagement billing
⚖️ Law firms & solo practitioners	Matter management · Document automation · Client intake · Trust & billing

The compliance reviewer comes with the build

The hard part of regulated software isn't the CRUD — it's the rules around it. So when you build in one of these verticals, GreatCTO detects the archetype and auto-attaches the matching domain reviewer before anything ships:

Health & dental → HIPAA / PHI (encryption, access control, BAA surface, CDT/ICD coding)
Insurance → NAIC / ACORD (actuarial auditability, anti-discrimination pricing, filing standards)
Accounting & tax → SOX ITGC + IRS (segregation of duties, ASC 606, 1099, Circular 230)
Law firms → a dedicated legal reviewer

The reviewer writes a threat model, flags the domain risks, and signs off before the build proceeds. You approve one spec; the compliance expertise runs inside the pipeline.

A whole `legal` archetype

Law-firm software has failure modes no generic reviewer catches, so it got its own archetype + reviewer covering the profession's real obligations:

UPL — software must inform, not advise; attorney review is a structural gate.
IOLTA / client trust accounting — no commingling, per-client ledgers, three-way reconciliation. Get it wrong and it's a bar complaint, not a bug.
Attorney-client privilege — Model Rule 1.6: encryption, matter-level access, metadata scrubbing.
Conflict screening — adverse-party checks that block intake before a conflict exists.
E-filing — PACER / CM-ECF, FRCP 5.2 redaction.

Also new this cycle

Measured product quality — every generated product gets an automated 0–100 score (a clean build benchmarks ~89/100).
Cross-model review — high-stakes diffs are red-teamed by a different model family, because a model reviewing its own family's code is blind to its own mistakes.
great-cto upgrade --self — one-command self-upgrade that detects how the CLI was installed (npm / pnpm / volta / npx).
Leaner local board — bundles on a fresh install, runs fully offline.

Try it

MIT-licensed, runs locally:

npx great-cto@latest

All 60 products at greatcto.systems/build. v2.82.2 is live.

June under the hood: the board becomes a pult, prompts evolve behind a holdout gate, logs shrink 99.5%

Alexander Velikiy — Thu, 11 Jun 2026 08:39:57 +0000

The last two posts were about the pivot — autopilots, live connectors, the operator console. This one is about the engine room: four upgrades that shipped in the same June sprint and that you'd otherwise only discover by reading the changelog. Users keep telling us they don't read the changelog. Fair.

1. The board is now a pult, not a mirror

Until v2.64 the dev board showed you the pipeline: tasks, gates, costs. To act on anything you went back to the terminal.

Now approving a gate (or pressing Run) spawns a Claude Code agent headlessly in the project and streams its output into the board — assistant text, tool calls, result, parsed from stream-json and pushed over SSE. There's a Run-agent panel with a prompt field and a live stream, and an Approve + ▶run button right on the gate card. Approve the plan, watch the implementation start, without touching a terminal.

Running an autonomous agent that edits files from a web page is exactly as dangerous as it sounds, so the guardrails came first:

same-origin only, and the project must live under $HOME
one run per project — a second Run gets a 409
hard timeout (SIGTERM → SIGKILL), 2000-line ring buffer, child stdin closed
permission mode defaults to acceptEdits — full autonomy is an explicit opt-in env var, never the default

Verified end-to-end with a stub binary (all four guardrails, Stop button) and a real claude run.

2. Prompts now have to prove they got better

Every agent in GreatCTO learns from lessons. The uncomfortable question: when the system rewrites an agent's prompt based on a lesson, who checks the rewrite didn't make it worse?

v2.37 closed the loop, porting the generate→evaluate→gate cycle from hexo-ai/sia:

Eval cases split into tuning (visible to the prompt-improver) and holdout (gate-only, anti-overfit)
A promotion gate blocks any candidate prompt that regresses on the holdout split — exit codes, not vibes
/prompt-evolve runs lesson → candidate → holdout gate → PROMOTE/REJECT, with a per-agent generation ledger you can audit
Each agent gets a generational changelog: which lesson, what held-out delta, full provenance

A learned improvement can no longer ship until it's re-proven on cases it never saw. The same loop later gated the compression layer below — turtles all the way down, but each turtle is tested.

3. Context compression: 31,475 chars of CI log → 155

Agents read logs, test output, JSON dumps. Most of it is repetition. v2.38 added a compression layer — deterministic, $0, no LLM, no native deps, concepts borrowed from chopratejas/headroom:

Input	Result
CI log	31,475 → 155 chars (−99.5%), FATAL/ERROR/stacks kept verbatim
JSON	−43% minified, −98% with array crush
Noisy test run	−86%, the FAIL preserved

The part that makes aggressive compression safe: CCR — Compressed Context with Retrieval. Anything dropped is stored locally, content-addressed, and recoverable on demand; the memory filter appends a recall footer listing what it filtered. Lossless-on-demand. And a fidelity eval (through the v2.37 holdout gate, naturally) ensures a compressor only ships if the key fact survives.

l3-support compresses logs and qa-engineer compresses test output before reasoning — fewer tokens spent re-reading the same stack trace twelve times.

4. Scope creep is now caught mechanically

The classic agent failure: asked to fix the webhook, also "improved" the auth module. v2.39 added governance inspired by NaCl, all machine-checkable at $0:

impl-brief per task — files-to-modify allowlist, files-NOT-to-modify denylist, API contract, test spec. senior-dev refuses to commit out of scope; a denylist hit is a hard fail, override only via a signed exception
/trace — requirement → use-case → task → test traceability for impact analysis and coverage gaps
gap-closure waves — adopt strict gates on a legacy repo incrementally: criticals never deferred, every deferred gap held by a signed, expiring exception. Never a silent bypass.

Also in June

Fable 5 support — agent-model: fable pins every managed agent to Claude Fable 5; the board's agent runner passes the model through verbatim.
CSRF guard on the board — cross-origin mutations now 403. A malicious page can no longer POST to your localhost and approve a gate. (Found by our own /audit, fixed the same day.)
The pre-push hook can no longer hang a push, and gate-approve survives GUI-launched shells with a minimal PATH.

All of it: open source, MIT, zero telemetry, github.com/avelikiy/great_cto. The full gory detail lives in the CHANGELOG — but now you don't have to read it.

The operator console: where the autopilot's work waits for a signature

Alexander Velikiy — Thu, 11 Jun 2026 08:21:51 +0000

Last post ended with the autopilot pausing at a human checkpoint. Pausing is easy — any workflow engine can stop. The hard questions are operational: where does the case wait, who is allowed to sign it, what do they see before signing, and what happens when the write fails at 2am?

That's what we built through v2.46–v2.63: the operator console. great-cto board → /autopilot.html. It's the Operate-mode surface — the app for the licensed humans the flow escalates to, not for the engineer who wired it.

Durable runs: the signature crosses a process boundary

A run persists to disk and survives restarts. startRun advances the flow to the gate and parks it as awaiting-approval; approve(id, who) resumes it and executes the irreversible write; reject ends it with nothing irreversible run. Every transition appends to an immutable audit trail.

The v2.43 safety invariant now holds end to end: the 837 claim is submitted only because a coder signed its protecting gate — provable across a process boundary, because the approve happens in a different process than the start.

We demonstrated it on medical coding live: intake → code → NCCI edits (three live connectors) → pause → the coder signs in the inbox → the claim goes out → completed. The reject path submits nothing.

Flows can require several signatures in sequence. Tax needs two: the preparer signs with their PTIN, then the taxpayer signs Form 8879 — the IRS e-file fires only after both. The board pushes a notification to the signer the moment a gate opens.

What the signer actually sees

A queue, then a case drawer. The drawer carries everything a decision needs in one panel:

The decision criteria — the SOP this case is judged against
Evidence, connector by connector — exactly what each integration found, with its live/stub flag and per-call latency
An AI-drafted determination — a templated rationale composed from the evidence, reviewed before signing
The audit trail — tamper-evident, with a "✓ verified" badge

Signing an irreversible write opens a signature ceremony: an alert dialog that names exactly what will execute — the gated step, its blast radius, the gate protecting it — and requires explicit confirmation. No one "accidentally approves" a wire transfer because the button was where their cursor happened to be.

And because humans override machines (that's the point), overrides are logged: sign against the AI recommendation and the divergence is recorded — case, recommendation, decision, who. Your regulator will ask. Now there's an answer.

The routing dial

Not every case deserves a human minute. Admin Settings sets a per-tenant confidence floor: a low-confidence approve is downgraded to escalate, and clean high-confidence cases are flagged auto-eligible. The dial moves as your trust does — start with everything escalated, widen straight-through as the override rate stays flat.

Around the queue, the things an operation actually needs:

Roles — operators sign; admins and compliance-leads see QA and Ops; invite links are scoped, with email invites and an impersonation banner when acting via a token
Smart views — All · Auto-eligible · Escalated · SLA at-risk · High blast, with SLA-aware sort and regulatory-deadline clocks on each case
QA sampling — a deterministic ~20% of closed cases lands in a QA queue to be scored 1–5; results land on the run, the audit, and Analytics
Bulk actions — multi-select (or "select auto-eligible") → approve / reject / escalate with a reason, RBAC-checked per case
Keyboard-first — ⌘K palette, j/k queue cursor, a/r/e/b decisions, ? cheatsheet

The Ops tab: because writes fail

The least glamorous tab is the one that earns the trust. For admins and compliance-leads:

KPI tiles — runs, connector calls, estimated cost, average latency, retries, over-budget, dead-letters
Dead-letter queue — every failed post-gate write with its connectors and error, and a one-click ↻ Requeue that re-runs the write and recovers the run to completed. An off-tab badge makes a stuck write visible without clicking.
Connector health — per-connector 🟢/🔴, call count, failure rate, p95 latency, last error
Metering by industry — per-vertical runs / calls / latency / cost, sorted by spend

Retries never double-submit: an idempotency key, stable per run, is threaded into every write.

Enterprise polish, measured

v2.63 was a full UI/UX pass, and we held it to numbers rather than adjectives:


Accessibility	WCAG 2.2 AA — axe-core: 0 violations, all tabs, both themes
Themes	light/dark (`prefers-color-scheme` + persist), white-label accent per tenant
Realtime	SSE pushes a change the instant any run mutates — console, CLI, or webhook
Scale	render cap keeps 500+ case queues smooth
Reliability	durable-runtime e2e across all 25 verticals (start → gate → sign → write), 348/348 lib tests

Multi-tenant scoping means an operator sees only their tenant's queue. Cases export to CSV, because the auditor's tooling is Excel and pretending otherwise helps no one.

Why this matters

"Human in the loop" is usually a checkbox in a pitch deck. Operationally it's a product: an inbox with SLA clocks, a drawer with evidence, a ceremony for the point of no return, override logs, QA sampling, and a dead-letter queue for the night the provider's API was down.

That product is what makes it safe to let the autopilot run the volume. Try it: npx great-cto init, then great-cto board. Screenshots on the landing; the run store, runtime, and console are all in the repo.

We pivoted: GreatCTO is now AI autopilots for business

Alexander Velikiy — Thu, 11 Jun 2026 08:20:49 +0000

For a year GreatCTO was an engineering-process engine: agents, gates, reviewers, compliance packs. Good product. Wrong headline.

Here's the thing we kept observing: the people who got the most value weren't buying "a better SDLC." They were buying the outcome of a business function — claims coded, contracts reviewed, invoices matched, taxes filed. The pipeline was the means.

So in v2.40 we said it out loud: GreatCTO ships AI autopilots for business. Products that sell the outcome of a service, not a tool to a specialist. Packs, reviewers and gates didn't go anywhere — they became the under-the-hood trust layer instead of the headline.

What an autopilot actually is

A flow. One file per vertical — flows/<vertical>.flow.json — the single source of truth that renders the CLI behavior, the runtime, and the landing page from the same data:

steps — intake → process → decide → deliver, each tagged with the agent, the tools, and whether a human signs it
connectors — the real-world integrations the steps call
gates — where a named, licensed human signs before the flow continues
owner — one accountable person who answers for what the autopilot does

The four autopilot invariants are machine-checkable (autopilot-gate.mjs): judgment boundary (confidence → escalation), accuracy-as-SLA, per-decision audit trail, per-outcome unit economics. Not a manifesto — a validator that exits 1.

6 → 16 → 25 verticals

We started with six (legal docs, medical coding, procurement, accounting, managed IT, tax). Then the expansion criterion clicked: a vertical is a fit when it pairs a large displaceable-labor pool with a legally-required named human who signs the risky call. That's the exact shape the safety engine is built for.

Ten more landed in v2.44 — prior-auth ($35–56B), KYC/AML ($61B), managed SOC, insurance claims (~$36–38B), mortgage underwriting, title & escrow, provider credentialing, collections, freight brokerage, clinical-trial ops. Then immigration, appraisal, payroll, workers-comp, estate planning, patent prosecution. Twenty-five total, every one shipping green on --validate.

Each carries its own compliance reviewer: False Claims Act + NCCI for coding, OFAC + BSA for AML, FDCPA + Reg F for collections, Circular 230 + §7216 for tax, FMCSA for freight. The regulation is a step in the flow, not a PDF you read later.

"Live" means live

A flow that calls mocked connectors is a demo. By v2.45, all verticals exercise at least one live connector — 17 live in the catalog, keyless by default (deterministic real logic or a curated public slice), switching to the real provider the moment you add a credential.

A few favorites:

um-criteria (prior-auth) — CMS NCD/LCD-style medical-necessity matching that never auto-denies. Missing criteria escalates to the medical director. By design, not by prompt.
sar-filing (AML) — generates a FinCEN SAR, and the filing is blocked without the BSA Officer's signature.
comms-outreach (collections) — FDCPA/Reg F 7-in-7, TCPA, and the 8am–9pm window enforced as ALLOW/BLOCK per contact.
primary-source (credentialing) — OIG LEIE / SAM exclusion screening as a hard block, plus a real NPI Luhn check.

The permission is never the wound

The scariest failure mode of an agent isn't going rogue. It's doing exactly what it's permitted to do, irreversibly, at machine speed, with no human hesitation. (Hat tip to Oleksandr Torlo's essay "The Permission Was the Wound.")

v2.43 made the boundary a runtime invariant, not a convention:

Every flow step is tagged reversible or not, with a blast radius. Money moves, claim submission, e-signing, tax filing — irreversible.
The runtime refuses to execute an irreversible step autonomously. No prior human gate → blocked-unsafe. Gate present → the step runs only after it's signed.
validateFlow() enforces it statically: irreversible ⟹ preceded by a human checkpoint, and every autopilot names an accountable owner. All 25 verticals ship green.

The autopilot does the volume. The point of no return always waits for a person.

Quality is earned, not declared

Every vertical gets a 0–100 scorecard: seven weighted dimensions, golden + adversarial cases run through the reviewer with an LLM judge, and a regression gate so a score can't silently decay. Two measure→improve→re-measure cycles took legaltech from 85 to 94.75 and msp from 78 to 98.5.

If we're going to claim an autopilot can hold a function, the claim should be a number someone measured — and a gate that fails CI when it stops being true.

Where this leaves you

npx great-cto init, name the function, and you get the flow — agents, connectors, human checkpoints, the compliance pack for your domain. The pipeline that built features for a year now runs business functions, with the same receipts: all 25 autopilots, each with its flow, gates, and live-connector badges.

Next post: what happens after the flow pauses — the operator console where a human actually signs.

great_cto: what's new — three features and the move to Opus 4.8

Alexander Velikiy — Fri, 29 May 2026 14:26:10 +0000

While you were sleeping (or heroically fixing prod), great_cto — the engineering-process engine for solo founders and teams up to 50 people — picked up some new tricks. No fluff: three features that actually change your daily grind, plus a model upgrade that didn't require re-mortgaging the apartment.

1. Discovery pipeline: think before you code

A timeless genre: write first, find out what you should have written later. Until now the pipeline started with the architect, and everything "before" — problem research, prioritization, the PRD — lived in your head, your notes, and three browser tabs you were too scared to close. That gap is now filled by two commands:

/discover — a full product-discovery cycle. Builds an Opportunity-Solution Tree (Teresa Torres' framework): desired outcome → opportunities → solutions → experiments. Ranks opportunities by Opportunity Score = Importance × (1 − Satisfaction) and tosses in ≥3 solutions for each. Output lands in docs/discovery/OST-<slug>.md.
/prd — a structured 8-section PRD, from Executive Summary to success criteria. Asks at most 4 clarifying questions (not 40, like that one meticulous stakeholder) and hands you a finished doc in docs/requirements/.

The PM agent also finally learned to prioritize features when there's more than one and they're all "urgent": pick from Opportunity Score / ICE / RICE / MoSCoW. The full new route: /discover → /prd → /architect → /pm → senior-dev.

2. Quota warning at session start

There's a special genre of pain: hitting the rate limit right in the middle of a heavy pipeline, when the result was just around the corner. The new quota-check.mjs hook checks your Claude Code quota at the start of every session and tactfully clears its throat ahead of time:

⚡ 70%+ — prefer the fast-path for big features
🔴 85%+ — fast-path only (skip the ARCH doc)
🛑 95%+ — friend, not today. Don't start the heavy pipeline

As a bonus it shows your burn rate per window (on track, or living large), tracks Sonnet's 7-day sub-quota separately, and watches pay-as-you-go spend. Parallel agents share a single request via a 5-minute cache — no DDoS-ing your own API. API-key users aren't touched at all — it quietly steps aside.

3. digital-health-pack: an overlay for wearable and mental-health products

A new domain overlay (Wave 4) attaches itself the moment your project starts cozying up to wearables and digital health — Apple HealthKit, Google Health Connect, Garmin, Fitbit, Oura, Whoop, biometrics (HRV, SpO2, sleep), mental-health AI, nutrition/supplement recommendations, or physician-in-the-loop (HITL) flows.

What's in the box:

a chain of three reviewers (digital-health-reviewer + ai-clinical-reviewer + healthcare-reviewer);
five human gates: wellness vs SaMD classification, HITL design, wearable API access, supplement safety (drug-interaction check + NIH dose limits), and a crisis-escalation protocol for mental health;
a ready-made threat-model template and EVAL suites — refuse-to-diagnose, supplement safety, and crisis escalation per AFSP Safe Messaging guidelines.

In short: a built-in regulatory checklist (FDA General Wellness vs SaMD, HIPAA, GDPR Art. 9, EU AI Act Annex III) — so your health startup attracts an investor, not a regulator's notice.

Bonus: moving to Claude Opus 4.8

great_cto upgraded its flagship: claude-opus-4-7 → claude-opus-4-8 (Anthropic shipped it on 2026-05-28). The kind of move that needs no boxes and no movers:

Where it works: architect (deep cross-cutting reasoning and ADR generation) plus 41 reviewers/specialists and commands/review.md via advisor-model.
What you gain: better coding at the default effort level for comparable token spend, and a 1M-token context window (yes, even that legacy module fits).
Same price: $5 / $25 per MTok (in/out) — just like 4.7. Accounting can exhale.
Tier aliases untouched: agents on model: sonnet / model: haiku stay as they were — only explicit Opus pins moved.

Upgrade: npx great-cto@latest init. Full changelog — in the CHANGELOG.

Everyone is squeezing context. We stopped putting everything in one context.

Alexander Velikiy — Sat, 23 May 2026 06:06:06 +0000

The standard advice for reducing LLM costs: truncate your prompts, use a cheaper model, compress your system prompt, enable caching, add Be concise. to every instruction and hope for the best.

All valid. All treating the symptom.

We did something different.

The real problem isn't prompt size. It's context architecture.

When great_cto runs a feature pipeline — architect, PM, senior-dev, QA, security officer — each agent starts by reading the same stack of documents:

ARCH-*.md — full architecture decisions, 3–8k tokens each
PLAN-*.md — implementation plans, 4–10k tokens
decisions.md — every architectural decision made since the project started
lessons.md — every lesson learned, including that one time someone forgot to add an index

Six agents. Each reads all of it. Most of it irrelevant to the task at hand.

A senior-dev implementing a Stripe webhook doesn't need the 200-line deep-dive into the auth system. They need two sentences: "We use Stripe. Card data never touches our infra."

The information was right. The delivery unit was wrong. We were running a library where everyone gets every book, every time.

Phase 1: Stop sending full documents. Send summaries.

Every artifact in great_cto now has a paired .summary.md — auto-generated, ≤250 tokens, structured for the consuming agent:

# ARCH — Multi-tenant auth system · summary
- **Decision:** SAML over OIDC for enterprise; JWT internally
- **Stack:** Node 20, Passport.js, PostgreSQL row-level security
- **Risks:** SAML metadata rotation, session fixation on tenant switch
- **Full doc:** docs/architecture/ARCH-auth.md

Agents read the summary first. If they need depth — the path to the full doc is right there. In practice, 80% of reads stop at the summary. The other 20% at least know exactly what they're looking for.

The numbers:

	Before	v2.19.0
13 artifacts, per agent read	21,459 tokens	2,216 tokens
Reduction		–89.7%

The summary generates automatically via a PostToolUse hook the moment any agent writes an artifact. Anthropic Haiku if you have an API key (~$0.0005/call). OpenRouter Kimi K2 as fallback. Deterministic keyword heuristic if neither — zero cost, works offline, mildly embarrassed about the quality but gets the job done.

No config. No manual steps. Write artifact, get summary.

Phase 2: Stop injecting the entire memory. Filter it to the task.

decisions.md is an append-only log. It grows. A typical project after three months: 40–80 entries — database choices, API decisions, security tradeoffs, that one auth approach you tried and abandoned at 2am.

Before v2.19.0, the architect agent received the full file every time. 3–5k tokens, of which maybe 200 were actually relevant to the task. The model read all of it, politely, and quietly ignored most of it.

Now: one call to scripts/memory-filter.mjs "add Stripe webhook integration" decisions.md --k=5

The filter scores each entry against the task title. For "add Stripe webhook integration" — you get the PCI decision, the webhook signature lesson, the relevant security pattern. Not the database choice from six months ago that has nothing to do with anything.

The numbers:

	Before	v2.19.0
decisions.md inject per agent pair	946 tokens	544 tokens
Reduction		–42.5%

Latency: ~50ms heuristic, ~200ms Haiku. Cost: <$0.0001 per call. Opt-out: GREAT_CTO_DISABLE_MEMORY_FILTER=1 (for when you miss the old noise).

The combined pipeline: before vs. after

Six agents per feature. Each reads artifacts and memory.

	Before	v2.19.0
Total tokens per feature	134,430	16,560
Reduction		–87.7%
Cost saved (Sonnet $3/1M)		$0.35 per feature

This is with a small project — 13 artifacts, 7 decisions. The savings compound with scale: at 50 artifacts and 50 decisions (a project six months in), the legacy number climbs past 600k tokens per feature run. The filtered number stays roughly flat.

That's the interesting property of this architecture: the noise grows with the project, the signal doesn't.

What this isn't

This is not prompt compression. We're not removing information — we're delivering it at the right granularity, to the right agent, at the right moment.

The full docs are still there. The full decisions.md is still there. Any agent that needs depth can read it — the summary tells them exactly where to look. The filter acknowledges it might miss something ("if you suspect a relevant lesson is missing, read the full file directly"). It's a hint, not a wall.

We're not betting on the model being smart enough to ignore irrelevant noise. We're not hoping a Be concise. instruction somewhere will solve a structural problem. We're betting on information architecture — the same principle that makes an indexed database faster than a full table scan.

The index doesn't know less than the table. It knows where to look.

Getting it

Everything shipped in v2.19.0:

scripts/generate-summary.mjs — --all, --check, --force
scripts/memory-filter.mjs — --k=N, --heuristic, --stats
agents/_shared/artifact-summary-contract.md — the producer/consumer contract
31 tests, all green

npx great-cto upgrade

Summaries generate on first --all run, then stay fresh automatically. Memory filter activates in architect and senior-dev agents — no config needed.

What's next

Phase 3: session-scoped read cache. When five agents in one pipeline all read PROJECT.md, only the first actually reads the file. The rest get a cache stub with a hash. Target: additional –15% on multi-agent runs.

Phase 4: system prompt audit across all 30+ agent files. Removing filler. Enforcing token budgets. Finding the seven places we wrote "carefully" when the model was going to be careful anyway.

The full plan is public: docs/plans/PLAN-token-economy-2026-q2.md

great_cto v2.17 - no more tambourine dance

Alexander Velikiy — Fri, 22 May 2026 15:03:35 +0000

If you've ever spent 20 minutes setting up Claude Code plugins before you could even start working - this update is for you.

One install, everything works.

Previously: install great_cto, then figure out that Superpowers and Beads are also needed, find the repos, clone them, enable them in settings, restart. Classic.

Now - one command:

npx great-cto install

Done. Superpowers and Beads install automatically as companion plugins. They land in ~/.claude/plugins/cache/local/, get enabled in settings.json, and are ready to work. If git is missing - great_cto gives a friendly hint instead of silently failing.

Jurisdiction-aware agents.

The new jurisdictions module detects the context of your project - EU, US, Canada, UK, Australia, and others - and automatically activates the right regulatory reviewer agents.

Working on a fintech product for European users? The EU reviewer turns on automatically. Building for the Canadian market? PIPEDA gets covered. No manual configuration, no trying to remember what applies where.

Eight jurisdictions are currently supported, and the list keeps growing.

Critics before the plan.

The most expensive bugs aren't in the code - they're in the decisions made before coding starts. Three new critic agents now run at the earliest stages of the pipeline, before a single line is written.

Architecture critic catches structural problems that make future work impossible. Coupling that rules out multi-tenancy. An "obvious" O(n²) loop that works fine in dev and falls apart at scale. These aren't bugs - they're constraints that quietly close off entire solution spaces.

Spec critic catches "we solved the wrong problem" - the worst class of bug, because there's no way to unit-test for it. By the time the code works correctly, it may be doing entirely the wrong thing.

Schema critic catches the migration that will deadlock a 50M-row table 10 minutes after deploy. A NOT NULL column without a default. An index added without CONCURRENTLY. The kind of change that looks clean in a code review and becomes an incident.

Previously, critics only appeared starting from Plan stage. Now they cover the three positions where a mistake is most expensive.

llm-leash UI: 16 new features.

llm-leash is the great_cto admin board - a local web UI that shows what your AI agents are doing, what they've spent, what passed review, and what needs your attention. Think of it as a control panel for the agent pipeline.

This release adds 16 new features to the board. The most useful ones:

Cmd-K - global command palette for navigation.
Issues subtab - all security and compliance findings in one place.
Session timeline - visual history of what happened and when.
Topology graph - shows agent dependencies. Useful when you have 5+ parallel agents running.
HITL diff - human-in-the-loop review of agent changes before they're applied.
OPA config - Open Policy Agent integration for compliance rules.
SOC2 export - one-click audit trail for compliance officers.
Rule comparison - compare policy versions side by side.

Companion plugins out of the box.

A bit more detail on how the Superpowers + Beads bundle works, since the architecture is non-obvious.

Superpowers - a methodology plugin. It gives Claude Code skills: /brainstorm, /write-plan, /execute-plan, code review workflow, TDD cycle, parallel agent execution. Without it, Claude acts on vibes. With it - on a structured plan.

Beads - a git-native task tracker. Tasks live as commits, survive session restarts, have dependencies and blockers. Claude creates and closes them autonomously as it works.

great_cto - the orchestration layer. It routes requests to the right agents, enforces reviewers based on archetype and jurisdiction, manages the agent pipeline.

Together: you describe what needs to be done, great_cto breaks it into a plan, Beads tracks it, Superpowers enforces methodology, the right reviewer agents plug in automatically.

TL;DR

npx great-cto install

npm: https://www.npmjs.com/package/great-cto
GitHub: https://github.com/avelikiy/great_cto

Feedback and PRs welcome.

AI Agents Work While You Sleep — Now They Can Wake You Up

Alexander Velikiy — Mon, 18 May 2026 08:22:47 +0000

Let me describe a Tuesday evening.

I fire off /start "refactor billing module", the pipeline kicks in, six AI agents start doing their thing, and I think: great, I've got an hour. I'll cook pasta.

I cook pasta. I eat pasta. I do the dishes. I put on an episode of something. I come back.

The pipeline has been waiting for my approval for 54 minutes. The senior-dev agent is sitting there, doing absolutely nothing, blocked on a gate:plan that needed one click from me. Fifty-four minutes of human absence. Zero pasta to show for it.

This is the core tension of running an AI pipeline: the whole point is that it works while you're not watching. But the moment it needs you, it needs you immediately — and it has no way to tell you that.

Until now.

What we added

Two things, both live in the board's Notifications settings:

Email alerts — you enter your email, click a verification link, done. From that point on, five specific events send you an email:

A P0 incident opens (the pipeline found a production fire)
A gate has been waiting for your approval for more than 30 minutes
A gate is actively blocking the pipeline right now
Your monthly AI spend crosses the limit you set
Monday morning weekly digest — what got done, what it cost, how many gates passed

Browser push notifications — desktop notifications, the same kind you get from Slack or email. You enable them once in the board settings, the browser asks for permission, and that's it. No app to install. No Firebase. No account anywhere.

Why exactly these five triggers

The first version of this feature had fifteen triggers. It was immediately annoying. Every time an agent sneezed, my phone buzzed.

The honest answer to "what do I actually need to know about right now" is surprisingly short:

Something broke. Not "an agent is running" or "a review started" — the actual situation where production is on fire and the pipeline found it before I did.

I'm blocking progress. The pipeline stopped and is waiting for me. The longer I wait, the more time I've wasted running all those agents. If it's been 30 minutes, I definitely didn't see the gate notification in the terminal.

I'm about to overspend. LLM costs are real. A runaway pipeline on a big refactor can quietly rack up $20–30 if nobody's watching. A cost alert at $15 is much better than discovering $40 on the invoice.

Weekly summary. Not urgent, but useful. Monday morning coffee + what your AI team accomplished this week = a surprisingly good way to start the day.

That's the whole list. No alerts for "agent started", "agent completed", "reviewer disagreed with another reviewer", or any of the other events that feel important but mostly produce noise.

The email setup

Open the board → Settings → Notifications. Enter your email. You get one verification email — click the link, and you're verified forever. No repeated sign-ins, no token rotation, no dashboard at some third-party service.

Under the hood the board sends emails through a relay we run at greatcto.systems. We're using Resend on the free tier (100 emails/day), which is more than enough for a solo developer who isn't actively burning the place down.

Why a relay and not direct SMTP? Because storing an SMTP password inside a local tool that lives on your laptop is a disaster waiting to happen. The relay holds the credentials; your board just sends an HTTPS request. If the relay is down, you miss a notification. That's fine. You're not building a hospital.

The push notifications

This one was more fun to build.

Browser push notifications sound simple — they're everywhere, every website pesters you with them — but implementing them correctly from scratch is genuinely involved. There's a spec called VAPID that requires signing cryptographic tokens with elliptic curve keys, and basically every tutorial says "just use the web-push npm package."

We couldn't. The board server is intentionally zero-dependency — no npm packages, no node_modules, nothing. It's a single file that you can read start to finish in an afternoon. Adding a library for notifications would mean adding a library for notifications and everything that library depends on, which is how you end up with 47 packages installed to send one HTTP request.

So we implemented it ourselves using only what Node.js ships with.

The fun gotcha: somewhere deep in the VAPID spec, the signature format that Node produces natively is not the format that browsers expect. One is DER-encoded (an old ASN.1 format from the 90s), the other is just raw bytes. Our first test push hit the browser's push service and got a 401 back. Fifteen minutes of reading specs later, we found the conversion, fixed it, and every subsequent push worked perfectly.

The end result: you toggle the switch in the board, your browser asks "Allow notifications?", you click Allow, and from that point on your desktop shows a native notification whenever any of the five triggers fire. Same notification you'd get from a Slack message. No third-party service involved.

The notification drawer

Push and email are for when you're away from the keyboard. The in-app drawer is for when you're at the keyboard but not watching the terminal.

Click the bell icon in the top nav. A panel slides out with the last 20 notifications — what fired, when, and whether you've seen it. Unread count shows as a badge on the bell. "Mark all read" lives right there.

The history persists across restarts. Close the board, reopen it tomorrow morning, your Monday digest is still there.

How to try it

npx great-cto init
npx great-cto board

Go to Settings → Notifications. Add an email or enable push (or both). Then just work normally. If a gate waits too long, you'll hear about it.

The whole thing is open source: github.com/avelikiy/great_cto (MIT, free, you pay your own LLM provider).

The pasta incident has not repeated since.

Real cost breakdown: 10 packs, $0.60 LLM bill, $42K saved per regulated feature

Alexander Velikiy — Sun, 17 May 2026 14:29:04 +0000

This is the numbers post. If you read the ten-packs deep-dive and walked away wanting the spreadsheet, here it is.

All numbers below are from real client engagements (anonymized aggregates) plus telemetry from the GreatCTO install base. Not projections. Not vendor-pitch math.

Per-feature: the $42K → $0.60 + 50 hours of human review

A single regulated feature in a single industry. Pre-pipeline:

Identify which regs apply          ~8h    × $200      = $1,600
Read primary regulation text      ~14h    × $200      = $2,800
Map regulation → stack            ~20h    × $250      = $5,000
Draft threat model                ~32h    × $250      = $8,000
Consent flow + UX                 ~20h    × $180      = $3,600
Implementation                    ~40h    × $180      = $7,200
Internal legal review              ~8h    × $400      = $3,200
External auditor pre-meeting      ~10h    × $350      = $3,500
Revisions                         ~16h    × mixed     = $3,500
Final signoff                      ~4h    × $400      = $1,600
                                  ─────                ─────
                                  ~172h               ~$40K
                                                      (rounded $42K with overhead)

With pipeline:

LLM compute (architect+reviewers)  ~$0.60-$1.40 per feature
Human review of LLM output         ~14-18h × mixed     ~$3,800
External auditor pre-meeting       ~6-8h   (lower because tighter document)
Internal legal                     ~8h     (unchanged)
                                   ─────                ─────
                                   ~28-34h              ~$11-14K

Net saved per feature: ~$28-30K and ~140 hours of human time. LLM bill is rounding error.

The $0.60 number is per feature, not per MVP. Some readers conflated these. A small fintech feature on Claude Sonnet costs ~$0.60-$1.40 in LLM calls. A full MVP run with all 10 packs activated and ~30 features ships ~$500-$1,500 in LLM compute. Both numbers are honest, they describe different scopes.

Per-MVP: $287K → $128K (~55% reduction)

A voice-AI MVP, three months of work, traditional team composition:

1 Product Manager × 3 months × $180/h × 120h/mo = $64,800
4 Engineers × 3 months × $180/h × 140h/mo = $302,400
Architecture work (internal or fractional CTO) = ~$20,000
Security review (external) = ~$15,000
Compliance setup (consultant + internal time) = ~$28,000
Misc (PM tools, hosting trial, design) = ~$8,000 ───────── ~$438K nominal ~$287K after overlap & efficient teaming


With pipeline + agentic SDLC, same MVP, 6-8 weeks:

- 1 Product Manager × 2 months × $180/h × 120h/mo  = $43,200
- 2 Engineers × 2 months × $180/h × 140h/mo        = $100,800
- LLM compute across the whole run                 = ~$1,200
- Architecture review (1 sr human, 3 sessions)     = ~$3,000
- Security review (external, same)                 = ~$15,000 (unchanged — see "what doesn't compress")
- Compliance setup (pipeline output + ~12h review) = ~$5,500
- Misc                                             = ~$8,000
                                                     ─────────
                                                     ~$176K nominal
                                                     ~$128K after similar overlap savings

Net: ~$159K saved per MVP, ~45% time saved. Most of the saving is not the LLM bill — it is fewer engineer-months because senior-dev parallelism + auto-review compresses the build phase.

Per-quarter / per-runway: the bet that changes

For a founder shipping into one regulated industry (most realistic scenario):

	Traditional	Pipeline	Saved
MVP time	3 months	6-8 weeks	~1.5 months
MVP cost	$287K	$128K	$159K
Compliance setup (4 features, year 1)	$168K	$48K	$120K
Year 1 total	$455K	$176K	$279K
Equivalent runway months @$50K burn	9.1 mo	3.5 mo	5.6 months recovered

For a founder shipping into 10 industries (hypothetical "compliance-heavy AI products" portfolio):

	Traditional	Pipeline	Saved
Year 1 (10 MVPs × overlap)	$1.45M	$580K	$870K
Wall-clock (sequential)	30 months	10 months	20 months
Wall-clock (with parallelism)	21 months	7 months	14 months

The 10-industry case is hypothetical — no real founder ships into all 10 simultaneously. But it shows the structural ratio: roughly 60% cost reduction, roughly 67% wall-clock reduction.

LLM compute: where the money goes

Per-MVP LLM compute, ~$500-$1,500 total, breaks down roughly:

senior-dev × 4-8 features            ~70%     (code-writing is expensive)
architect (per-feature ARCH.md)      ~12%
specialist reviewers (5 per feature) ~10%     (verdicts are cheap)
pm (decomposition)                   ~3%
qa-engineer (test scaffolds)         ~3%
detection + memory + misc            ~2%

The reviewers are roughly 10% of cost despite being 5 of the 8 agents that run. They output verdicts, not code. If your LLM cost is exploding, look at how much code is being generated, not how many agents are running.

Hardware / model-choice ratios

We tested Sonnet 4.6 vs Haiku 4.5 vs Opus 4.5 on the same 23-feature batch:

Model	LLM cost ratio	Wall-clock ratio	architect output quality (human eval, blind)
Haiku 4.5	0.31×	0.74×	"noticeably worse" — 4 of 23 ARCH docs unusable
Sonnet 4.6	1.0× (baseline)	1.0×	acceptable, default
Opus 4.5	5.1×	1.27×	"marginally better" — 1 ARCH doc clearly superior

Conclusion: Sonnet for everything except deep-reasoning architecture decisions. Use Opus only for architect on greenfield features in unfamiliar territory. Haiku for high-volume worker agents (pair programming, code generation) where the ARCH note is not on the critical path.

What does NOT compress

I have called this out before, but in numbers terms:

Item	Compressible?
External audit cycle (NYC bias auditor, 2-4 weeks)	No
FDA pre-submission meeting (60-90 days)	No
IRB approval (clinical trials, 8-12 weeks)	No
Wet-lab validation (drug discovery)	No
HARA signature (functional safety, 1 calendar moment)	No
Lawyer reading the threat model	Compresses (LLM-written threat model is faster to read than human-written long-form)
Regulator phone calls	No

Anything that requires another organization's calendar runs at human speed. Internal work compresses 5-25×. External-dependency work does not move.

For an early-stage AI startup on 18-24 month runway, the bet that changes is the internal portion. You can now run 3 external compliance cycles per year instead of 1.5, because the internal prep for each one compressed from six weeks to ten days.

The thing I underbet

When I started building the packs, I assumed the ROI claim would be "30-40% on compliance cost." The number ended up larger and the shape surprised me — most of the saving is not the LLM compute (it is rounding error) but the fewer engineering-months the parallelism enables, plus the fewer consulting hours the LLM-drafted threat model enables.

If you take one number from this post: the LLM compute is not the moat. The pipeline that runs the agents in parallel, gates the right humans at the right scope, and persists memory across incidents is the moat. The LLM is the substrate.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code with 10 compliance packs. MIT, runs locally. Pay your own LLM API. Per-pack numbers (which 10 industries, what each pack does, real consulting-rate comparisons) are in the W21 deep-dive.

The MTTR -94% claim, with receipts

Alexander Velikiy — Sun, 17 May 2026 14:28:21 +0000

Earlier posts cite a "median MTTR drop of 94.1% across 47 paired P0 incidents." This post is the receipts. The full methodology is also in docs/benchmarks/MTTR.md — this post explains why the number is what it is, and the four cases it does not capture.

What got measured

Setup:

12 production repositories (mix of fintech, voice-AI, clinical, dev-tools).
P0 incident defined as: user-facing, paged a human, took ≥15 minutes to resolve.
Window: rolling 6 months. Pre-treatment + post-treatment.
Treatment: the project installed GreatCTO and started persisting (pattern_hash, detection_order_that_worked, rationale) after each P0 resolved.
Outcome: time from page to root-cause identified.

I measured detection time, not full resolution time. Resolution depends on rollout speed, blast radius, customer comms — too many confounds. Detection time is the part where memory could conceivably help, and it is the part where humans burn the most calendar hours on recurring bugs.

The number

47 paired incidents. "Paired" means: same shape (same pattern_hash) seen at least twice across the 6-month window, once before persistence, once after.

Stat	Pre	Post	Delta
Median detection time	178 min	11 min	−94.1%
Mean detection time	224 min	17 min	−92.6%
90th percentile	412 min	41 min	−90.0%
Worst case (post)	n/a	89 min	n/a
Best case (post)	n/a	4 min	n/a

Skewed by a couple of near-100% cases (postgres pool exhaustion and a connection-string typo that the agent matched to a prior incident's commit diff and flagged in under 5 minutes). I report median because it is less misleading than mean for skewed distributions. The 90th percentile is probably the number you should care about — it is the "still 6× faster on the bad cases" claim.

How the mechanism works

The agent stores, for each resolved incident:

pattern_hash:   sha256(normalized_log_signature + topology_hint)
detection_order: ["check_pg_pool_size", "check_connections", "check_query_count"]
rationale:      "connection_refused logs + pool > 80% utilization → pool exhaustion, not network"

On a new incident, the agent's Step 0 is: hash the current incident's signature, look up in ~/.great_cto/incident_memory.jsonl, if pattern hits, try the prior detection_order first. If it identifies the root cause: log "memory hit." If it does not: fall back to systematic exploration.

There is no inference. The agent is not "smarter" — it is just skipping hypothesis exploration time because someone (you, last time) already paid for that exploration.

⚠ The 4 honest misses

Memory-based detection is not magic. Four cases in the 47 had pattern matches that pointed in the wrong direction and burned 10-30 minutes before the agent gave up and fell back to systematic.

Miss #1. Pattern matched on log signature "OOMKilled in worker pool." Prior detection order was "check worker memory limits." Reality: this time, the OOM was a memory leak in a different worker that pushed the wrong worker over its limit. Agent spent 18 minutes confirming the wrong worker's limits before noticing the leak. Total detection time: 34 minutes vs ~80 minutes baseline. Net positive but ugly.

Miss #2. Pattern matched "5xx spike from API gateway." Prior cause was upstream DB lag. Reality: this time it was a misconfigured rate-limiter that started rejecting requests after a deploy. Agent ran "check DB lag" for 12 minutes before pivoting. 28 minutes total vs ~140 baseline. Still a win, but called a "miss" because the prior path was wrong.

Miss #3. Pattern matched "auth failures after deploy." Prior cause was OAuth client secret rotation. Reality: a clock skew on one node caused JWT signature validation to fail. Agent's prior detection order led it through token store inspection first. 41 minutes total vs ~200 baseline.

Miss #4. Worst case. Pattern matched "DNS resolution failures." Prior detection order was "check Route 53 health checks." Reality: a third-party CDN had an outage. The agent's path was completely wrong, did not give up early enough, and a human had to manually override at minute 22. 89 minutes total vs ~150 baseline. Win on absolute time, but I would not call this a "memory worked" case.

If I report the 47 cases as "94.1% median drop," I owe the audience the 4 cases where the mechanism worked badly. They are 8.5% of the sample. The remaining 91.5% of cases saw memory either help significantly (74%) or be irrelevant (no pattern hit, fell straight to systematic exploration — 17%).

How to replicate in your own repo

Three steps, no GreatCTO required:

Persist incident memory. After each P0 resolves, write (pattern_hash, detection_order, rationale) to a markdown file in your repo. Plain text. Git-trackable.
At incident start, ask your agent to read that file before doing anything else. Even Claude Code with no plugins will use the file if you point it at one.
Track detection time. Page-to-RC-identified, in minutes. Spreadsheet is fine.

Run for one quarter. If you see a consistent reduction in detection time on recurring patterns, you have your own version of this mechanism. If you do not see reduction, your incidents are too unique or your pattern hash is too coarse.

The hash I use is sha256(top_3_log_lines_normalized + topology_hint) where topology_hint is the service name. This gets ~70% recall on similar incidents and very few false hits. You can tune for your domain.

What I will not do

Some readers ask for the raw data (anonymized incidents). I will not publish it — even anonymized, customers can be re-identified from incident shapes and timing. I will share the synthetic test cases in tests/incident_memory.test.mjs and the aggregate statistics in docs/benchmarks/MTTR.md. That is enough to verify the mechanism without leaking client incident data.

What this is not

Not an RCT. Observational. Twelve repos is small. The selection bias is real — the repos that adopted GreatCTO early were also the ones with the best L3 culture. A worse team might see 30% drop instead of 94%.

The number I would defend to your board: on recurring incident patterns, memory-driven detection compresses detection time by 5-10× median, with a long tail of near-zero-improvement cases. That is more honest than "94%." But "94%" is what shows up in the data.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Memory layer source is in packages/cli/src/memory.ts. The full benchmark methodology is at docs/benchmarks/MTTR.md.

Three days of code, six weeks of compliance — the math behind why

Alexander Velikiy — Sun, 17 May 2026 14:28:15 +0000

If you have shipped into a regulated industry, you know this ratio. Engineering ships a feature in three days. Compliance setup around the feature takes six weeks. Some founders get used to it. The right reaction is: the ratio is the bug.

This post is for the CEO / CTO who reads "What $1.4M of compliance work looks like in 14 hours" and wants to understand the mechanism — why six weeks specifically, and where in those weeks an LLM can save time without anyone getting sued.

Where the six weeks actually go

I priced this out properly the last three times I lived it as a CTO-for-hire. Numbers below are typical for a voice-AI or fintech feature shipping in 2025-2026.

Phase	Median hours	Hourly rate	Subtotal
Identify which regulations apply	8	$200 (senior legal)	$1,600
Read primary regulation text	12-16	$200	~$2,800
Map regulation → your stack	16-24	$250 (compliance consultant)	~$5,000
Draft threat model	32	$250	$8,000
Draft consent flow + UX changes	16-24	$180 (senior PM + senior frontend)	$3,600
Implement consent + audit log	40	$180	$7,200
Internal legal review of threat model	8	$400 (general counsel)	$3,200
External auditor pre-meeting + Q&A	10	$350 (specialist)	$3,500
Revisions, second pass	16	mixed	~$3,500
Final sign-off	4	$400	$1,600
Total	~190 hours	mixed	~$42,000

This is a single regulated feature. Multi-jurisdictional (US + EU + India + state-level US) doubles or triples it. Multi-feature (a startup shipping into a regulated industry has 8-15 such features in the first six months) makes the aggregate $300K-$500K of consulting before the product exists in production.

Where an LLM helps

Not all of those 190 hours are equal. Some are mechanical, some require judgment, some require relationships.

Mechanical (can be 80-90% automated):

Reading primary regulation text. The CFR is plain text. The EU AI Act Annex III is plain text. LLMs read 200 pages faster than any human can think. Replaces ~12-16 hours.
Mapping regulation to stack. "Does our PCI-DSS scope include the webhook signature verifier?" is a deterministic question with a regex-and-citation answer. Replaces ~12-18 hours of the 16-24.
Drafting threat model. Each pack has a 200-word template (down from my first 800-word version — auditors politely asked for shorter). LLM fills it in using regulation text + your ARCH.md. Replaces ~24-28 hours of the 32.
Generating evidence artifacts (decision logs, gate signoffs, audit trail). The pipeline emits these as side effects, not as a separate phase. Replaces ~6-8 hours.

Judgment (human time stays roughly constant):

Identify which regulations apply. Mostly mechanical, but the "is this an edge case" call is human. Reduces from 8h to ~2-3h of review.
Drafting consent flow UX. Pure product judgment. The LLM writes a first pass you can react to in 15 minutes instead of authoring from scratch in 4 hours. Reduces from 16-24h to ~4-6h.
Implementation. Coding is faster with LLM assistance, but the gates are real. Reduces from 40h to ~10-15h.

Relationship (cannot be automated, and pretending otherwise is malpractice):

Internal legal review. Your GC has to sign. Their time is your time. Unchanged at 8h.
External auditor pre-meeting. The auditor wants a human on the other end of the phone who can defend the threat model under questioning. The LLM-generated threat model is the document the auditor reads. The conversation about it is yours. Unchanged at 10h, but the auditor reads a tighter document faster, so call it 6-8h net.

New math:

Phase	Old	New	Saved
Identify regs	8h	2-3h	~6h
Read regs	12-16h	1-2h	~13h
Map to stack	16-24h	3-4h	~17h
Threat model	32h	4-6h	~27h
Consent UX	16-24h	4-6h	~15h
Implementation	40h	10-15h	~28h
Internal legal	8h	8h	0
External auditor	10h	6-8h	~3h
Revisions	16h	6-8h	~9h
Final signoff	4h	4h	0
Total	~190h	~50-65h	~125-140h

Wall-clock compresses from six weeks to about ten working days, partly because removed work and partly because the work that remains can run in parallel (the LLM drafts while the auditor pre-meeting is scheduled).

Cost compresses from ~$42K to ~$15-18K (LLM bill ~$50-150, human time the rest). Median compression I have measured: ~60% on cost, ~67% on wall-clock.

Why this is not "AI replaces compliance consultants"

The compliance specialist of 2027 is someone who knows which regulation applies in which jurisdiction and can operate a pipeline to do the reading and templating for them. Same depth of judgment. Five times the productivity.

That person is going to win market share against the consultant still billing by the hour to read 200 pages of regulation. Not because their judgment is better — it is the same. Because their cost-per-judgment is one-fifth.

The judgment is the moat. The reading and templating around the judgment has been commoditized. This is the same transition that happened to junior associates in law firms when document-review tools landed in 2010-2015. Senior partners did not disappear; they got faster.

What does not compress

External calendar time. The auditor still books two weeks out. The FDA pre-submission meeting is still 60-90 days. IRB approval is still 8-12 weeks. Internal work compresses 5-25×; external-dependency work does not move.

If your runway is 18 months and you ship into a regulated industry, the realistic plan is:

Compress internal compliance work from 6 weeks to 10 days.
Use the recovered 4 weeks to run the external cycles in parallel with the next feature.
End up with one external cycle per quarter, not one every two quarters.

That math doubles the number of features that ship through compliance per year for the same runway. For an early-stage AI startup, that is the difference between catching the wave and missing it.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code with 10 compliance packs. MIT, runs locally. The cost-by-pack breakdown is in the W21 deep-dive.

How GreatCTO chooses which compliance pack to attach

Alexander Velikiy — Sun, 17 May 2026 14:27:39 +0000

Every time someone runs npx great-cto init, the CLI has to decide:

What kind of project is this? (one of ~25 archetypes)
Which compliance packs apply on top? (voice / clinical / fintech / lending / 6 more)
Are any of those guesses wrong enough that the user will get a useless threat model and abandon the tool?

That last question is what makes the detection logic interesting. Get it wrong and the first impression is "this is producing nonsense about regulations I don't care about." Get it too conservative and the user has to manually configure packs that should have auto-attached, defeating the point.

After four months in production, here is what works.

What I tried first: LLM-based detection

Original design (rejected after 2 weeks): pipe the repo's README, package.json, and top-level directory listing into Claude and ask it to classify.

Problems, in order of severity:

Latency. First run of init now takes 12-18 seconds instead of <1s. Users perceive this as broken.
Cost. Roughly $0.04 per init. Negligible per user, real money at scale.
Hallucinations. Claude classified a Helm chart for an internal Kubernetes operator as "fintech, because the README mentions billing in the Operator's logging section." It does not. The word "billing" appeared once, describing log volume.
Variance. Same repo, same prompt, two runs: voice-AI then mlops. Probably temperature noise. Not acceptable for a decision that shapes the rest of the pipeline.

Killed it. Went to a regex-based detector. Latency dropped from 15s to 180ms. Cost dropped to $0. Variance dropped to zero.

The trade-off: regex cannot read intent. It reads tokens. A repo that says it does voice AI in its README but actually contains a music-recommender model will get the voice pack. That is a false positive I accept because the alternative (LLM in the loop) had its own false positives and was 80× slower.

The current detector

Three signal layers:

Layer 1 — package.json dependencies. twilio / livekit / deepgram / elevenlabs → voice pack. stripe / plaid / dwolla → fintech. tensorflow / pytorch + transformers → ml-pack (different from voice-pack). And so on for ~80 strong signal tokens.

Layer 2 — file paths. clinical/, fda/, phi/, hipaa/ in directory names → clinical pack. webhook/ + signature-related code → api-platform-pack.

Layer 3 — README + top-level docs grep. Exact-match keywords only, not fuzzy. "AEDT", "automated employment decision", "NYC Local Law 144" → hr-ai pack. "21 CFR Part 11", "SaMD", "FDA pre-submission" → clinical pack.

Each pack has a minimum signal count. voice-pack needs ≥2 of its 11 tokens. fintech needs ≥3 of 14. This is what cut false positives roughly in half.

The false positives I have logged

Across 4 months and ~340 init runs (instrumented from telemetry), 12 confirmed false positives:

repo type	wrongly attached pack	trigger	fix
static-site generator	voice-pack	README explicitly disclaiming Twilio	exact-match keywords only
music-recommender ML	voice-pack	"audio" in package description	removed "audio" as solo trigger
internal Helm chart	fintech	"billing" in operator log section	minimum 3 signals
docs-only repo	clinical	"patient" in user-research subfolder	excluded `docs/` from path scan
game-server prototype	mlops	`torch` in optional dev-dep	only scan `dependencies`, not `devDependencies`
7 others	various	various	each addressed via test case in `tests/detection.test.mjs`

The 12 cases are committed as regression tests. If the detector ever re-introduces one of these false positives, CI fails.

The case I worry about: silent false negatives

Easier to log a false positive (user complains "why is this thing telling me about TCPA"). Harder to catch a false negative (user runs init on a repo that should have hr-ai pack attached, doesn't, ships with no bias audit, gets fined two years later).

Mitigations:

/migrate command. Rerun detection with updated rules. New packs (or new keywords for existing packs) get a second chance to attach.
PROJECT.md is editable. The packs: list is plain YAML. User can add manually if detection missed.
Public catalogue. greatcto.systems/companies.html lists 200+ companies and the packs that would auto-attach to each. If a user's similar competitor is in the catalogue, they get a sanity check on whether their detection is correct.
Telemetry on no-pack runs. When init detects zero packs, we log it (anon, opt-in). If a class of project keeps coming through with no pack and the cost-of-miss is high (regulated industry), I add detection rules.

I have not had a confirmed regulatory false negative yet. That is partly because the user population is small (~500 active installs as of writing) and partly because the high-stakes archetypes (clinical, fintech, lending) have strong-signal vocabulary that is hard to miss.

What I will not add

People keep asking for two features I have rejected:

"Pack confidence scores." The detector should output 0-1 confidence per pack so the user can sort. I rejected this: it implies a precision the regex layer does not actually have, and users will treat a 0.6 score as "halfway right" when really it means "one signal matched, probably noise."
"Auto-update detection from telemetry." If we see 10 users with xyz in their repo overriding our detection, automatically add xyz as a fintech signal. Rejected: too easy to poison. One determined attacker registers 10 fake xyz/random-name repos with manual fintech tags and the global detector starts attaching fintech to everyone using xyz.

Both of these are textbook examples of "the obvious feature that becomes a backdoor."

What I might add

LLM in the loop, but only for ambiguous cases. If 2+ packs have signal but below threshold for any one, pipe the README into Claude with a strict "pick one or 'unclear'" prompt. Latency penalty only on the 5-10% of repos that are ambiguous, not all of them.
Per-language detection. Right now everything assumes Node/Python/JVM-ish patterns. Rust and Go projects sometimes have weak signal even when they are clearly fintech or healthcare. Not urgent — those communities are smaller in the user base.

The detection logic is small, boring, and one of the parts of the system I am most defensive of. It is the first thing every user sees, and a wrong first guess loses them.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. The detector source is in packages/cli/src/detect.ts — read or fork.