DEV Community

Ryosuke Tsuji
Ryosuke Tsuji

Posted on • Edited on • Originally published at ryantsuji.dev

Building a Real AI Harness: Auto-Reviewed PRs, Self-Healing Ops, and Non-Engineer Contributors (Series Intro)

The 'heal and hide' risk of automated fixes

Hi, I'm Ryan, CTO at airCloset.

In my previous posts I've introduced the full picture of our 17 internal MCP servers, an MCP server that searches 991 internal tables in natural language, a custom Graph RAG for measuring initiative impact, and the Sandbox MCP that lets non-engineers publish AI-built apps safely.

All of those run on top of an internal AI development platform we call cortex. This post is the first in a series about cortex itself — the platform, the design choices, and the operational experience.

Series Index

# Theme Key scene Article
1 Series intro: cortex's harness PRs auto-merge / incidents self-heal before you notice this post ← you are here
2 Product Graph (cpg) Code, docs, DB, infra unified into one graph cortex-product-graph
3 AI PR review webhook → AI review → auto-fix → squash merge coming
4 Self-Healing + observability + auto-added guardrails Alert → AI investigates → fix PR + new lint/type gate → auto redeploy + same-pattern writes get auto-rejected coming
5 Scaling the harness from cortex to toC services Non-engineer contributions in practice + scaling cortex's harness to the whole product org coming
6 Series wrap-up The underlying philosophy (what was given up, what was kept, why this design) plus a retrospective on the failures and lessons coming

Two Scenes, Up Front

Scene 1: PRs merge themselves

Monday morning. An engineer implements a feature locally, pushes a branch, opens a PR.

  • A few minutes later, the AI reviewer comes back with REQUEST_CHANGES. Multiple comments:
    • "This data formatting duplicates formatRow() in the shared package. Please consolidate."
    • "You changed an API response type, but the related docs (docs/api/...) still describe the old shape."
  • A separate AI agent spawns a worktree, applies the fixes, pushes a follow-up commit
  • Re-review comes back as APPROVE
  • Auto squash-merge
  • GitHub Actions detects only the changed stacks and deploys them to Cloud Run / Cloudflare Pages

No human touched any of this. The engineer refreshes the PR tab and notices it's already merged.

Scene 2: Incidents fix themselves before you notice

7 AM. A Grafana alert fires: "BQ pipeline failed 3 times in a row."

  • An AI receives the webhook, fetches the error logs from Loki via the Grafana MCP
  • Walks the Product Graph (implementation name: cortex-product-graph — a unified knowledge graph of the codebase, docs, DB schemas, and infrastructure definitions; covered later in this post and in Part 2) to trace the pipeline's code, dependent tables, and related docs, identifying the root cause
  • Opens a fix PR
  • AI reviewer APPROVE → auto squash-merge → automatic redeploy

By the time the engineer logs in at 9 AM, Slack already shows: "pipeline patched." The only incidents engineers personally handle are the ones AI genuinely can't crack.

Two automation loops

What's behind both scenes is the dev environment described in the rest of this post.

Industry Context — "Harness Engineering"

Before I get to cortex, one paragraph of context. Over the past six months, the practice of building proper foundations for AI agents in production has crystallized into a recognized industry trend.

"Harness" itself isn't a new word. In AI specifically, it traces back to EleutherAI's lm-evaluation-harness (2020) — the LLM evaluation framework that put the term in active use. What changed in the past six months is its elevation into an engineering discipline for LLM agents in production:

  • Feb 2026: OpenAI published "Harness engineering: leveraging Codex in an agent-first world", describing how a small internal team led by Codex shipped 1 million lines in 5 months
  • A few days later, Mitchell Hashimoto (HashiCorp co-founder, Terraform creator) distilled it into the formula Agent = Model + Harness
  • April 2026: Martin Fowler (author of Refactoring, ThoughtWorks Chief Scientist) published "Harness engineering for coding agent users", establishing the Guides (proactive controls) / Sensors (reactive controls) framing
  • Same month: Anthropic and Cursor each published their own harness write-ups

The catchphrase that's gone viral: "2025 was the year of agents. 2026 is the year of harnesses."

The framing is: the model itself is rapidly commoditizing (the gap between Claude / GPT / Gemini is narrowing from the user side). Where you actually get differentiation is how you design the harness — the foundation that lets AI run in production.

cortex is most cleanly read as a real attempt to build that "harness" inside a real company. In this post I'll organize cortex using Fowler's Guides / Sensors framing.

From here, I'll show how the "harness beats model" thesis takes concrete shape on cortex.

Who Builds the Code

For the first few months, I built 100% of cortex by myself. The accurate framing isn't "without a harness, others can't safely PR" but rather "without a harness, no one — including me with extra hands — could ride this thing."

Even back then, between our Google Meet recording pipeline (Japanese), about half of the 17 MCP servers, and a long tail of unpublished features, roughly 50 loosely-coupled applications were already running. Each one had its purpose, background, and data flow documented carefully. But the volume was such that even with AI in the loop, you couldn't realistically have it read all the relevant docs and absorb the whole picture for any given change. The codebase had outgrown what a person — or an AI given pieces — could hold in their head at once.

Recently, with the harness in place, non-engineers (business-side managers, PMOs, etc.) have started shipping PRs to cortex too. As of writing, the cumulative commit ratio is ~91% me, ~9% other recent contributors.

If you imagine non-engineers opening PRs against a production repo, "can quality really hold?" is the obvious question. In cortex, the answer is yes, because AI review and automation own the quality gates:

  • PRs missing annotations, tests, or lint cleanliness get REQUEST_CHANGES from the AI reviewer
  • A separate AI agent applies the fixes
  • Until everything is satisfied, nothing merges

So whoever writes a PR — engineer or not — at the moment it merges, the same quality bar is met. The key point: it's not "you can write freely," it's "you can write inside rails that don't let you derail." The author's job stops at "communicating the intent precisely"; the harness owns code correctness.

The shift is from "X could write that because they're X" to "X can write that because of cortex." That property only emerges once the harness is built — and it's the core of cortex's design.

What's Running

cortex consists of microservices, jobs, MCP servers, web frontends, Cloudflare Workers, and so on. As of writing, there are 123 apps. The features I've already covered in past posts are each composed of multiple apps — but even adding them up by feature, only about 10% of cortex has been written about. The remaining 90% hasn't appeared in a post yet. A few examples:

  • A unified product UX measurement web app — UX metrics, screen analysis, funnels, and error analysis in one place
  • A dev-org portal web app — KPIs (bug rate, etc.), per-member GitHub Activity, QA evaluation results, plus an AI chat that answers natural-language questions about KPIs via Agentic RAG
  • A family of Slack bots for operational support:
    • A config bot that lets you manage job configurations (DBs, attendance SaaS, Google Drive, etc.) directly from Slack
    • An accounting-assist bot that takes invoice OCR and drafts payment requests / expense filings in our accounting SaaS
    • In-channel knowledge search, issue/request management, meeting creation; a BigQuery cross-table RAG bot; a Google Drive cross-corpus RAG bot
    • A marketing bot that returns insights (trend, creative analysis) from BigQuery marketing data
  • An APM auto-analysis agent that runs daily on monitoring-SaaS APM data, detects performance issues, and opens tickets in our issue-tracking SaaS
  • An AI-bot auditor bot that runs E2E tests against the Slack bots above and detects spec drift

…and so on. Each will get its own dedicated post later in the series.

Scale at a glance:

Count
apps (microservices, jobs, MCP servers, web, etc.) 123
packages (shared libraries) 66
MCP servers 19
Pulumi stacks 110
TypeScript (implementation) ~630K lines
Tests ~560K lines
Markdown documentation ~110K lines / 389 files
Duration ~5 months (intensive development: ~4 months)
Merged PRs ~790

The 4-Element Flywheel — cortex's Harness

What lets "~4 months of intensive dev, mostly solo" coexist with "non-engineers shipping into the same repo" is a harness design that delegates quality to AI and automation across every layer.

cortex's harness is structured as a flywheel of 4 elements, mapped to Fowler's Guides (proactive) / Sensors (reactive) split, that mutually reinforce one another.

cortex AI Harness Flywheel

① Product Graph (Guides — supplying the right context)

All of cortex — code, documentation, DB schemas, infrastructure definitions — is indexed in real time as a single unified graph. It's queryable via MCP through semantic search.

"Where is the code that calculates this KPI?" → "Which BQ tables does that code touch?" → "What are those tables' column definitions?" → "What docs are related?" — all of these can be answered from a single query traversal. That graph becomes the context source for everything the AI does.

This is the foundation that "structurally reduces how often the AI gets confused." Where grep tells you "where the string appears," the Product Graph tells you "what is connected, why, and how." Implementation details come in Part 2.

② Lint / Quality Gates (Guides — physically blocking deviations)

eslint-disable / oxlint-disable are forbidden anywhere in the repo. In hand-written code, occurrences of : any / as any / TODO / FIXME are 0 (excluding generated files and unavoidable external-library cases). Type checking (using tsgo — Microsoft's Go port of the TypeScript compiler, ~10× faster than tsc; we use it to keep CI time down) runs on the entire codebase in CI.

On top of that, test coverage is enforced at ≥90% for statements / branches / functions / lines. Lowering the threshold to pass is forbidden — you write tests instead.

With every escape hatch sealed, even when the AI writes wrong code, it doesn't merge. This is also what stabilizes AI review judgments downstream.

③ Auto Review (Sensors — auto-fixing until the bar is met)

Scene 1 above is exactly this. The implementation-side note: AI review here isn't "lint with extra steps" — every comment is grounded in Product-Graph traversal of the actual impact. That's where it earns its keep. To give you a feel, comments that actually fire fall into categories like:

  • [Graph] Critical — missing annotation that breaks an edge in the graph
  • [Impact] Critical — a BQ MERGE statement referencing a column not present in the existing target table; would fail in production
  • [Doc] Critical — code change that left related docs stale
  • [Security] MinorexecSync doing string interpolation on an env var, opening a command injection vector

What you might mentally classify as "AI review" — surface-level — isn't this. Comments here are produced with the entire codebase carried as context, which is what the Product Graph integration buys you.

The only PRs that actually need a human are "AI review hits a hard case." Day-to-day PRs go from push to merge without anyone touching them.

④ Self-Healing (Sensors — re-injecting production anomalies into the loop)

Scene 2 above is exactly this. Starting from a Grafana alert, the AI traces the root cause through Product Graph + Loki + git blame, opens a fix PR, and pushes it through ③ Auto Review until it's auto-merged. Re-injecting anomalies into the loop is the essence of Sensors. Details in a later post.

What Makes It a Flywheel

These 4 elements mutually reinforce one another:

  • ① Product Graph exists, so ③ Auto Review can comment with real impact awareness
  • ② Lint enforces the ground rules, so ③ Auto Review can assume "everything in the codebase meets the bar"
  • ③ Auto Review exists, so new code lands in ① Product Graph with correct semantic annotations
  • ④ Self-Healing's incidents loop back through ③, maintaining the quality bar all the way back to ①

The harness's effectiveness scales with the size of the codebase, not against it.

Supporting Foundations

Three foundations make the 4 elements possible (covered in detail in Part 4):

  • Tests and coverage: ~630K lines of implementation, ~560K lines of tests (impl : test ≒ 1.13 : 1)
  • Documentation: ~110K lines / 389 files, written for both humans and AI, also ingested as Document nodes in the Product Graph
  • Observability: Frontend = Faro, backend = OTel, infrastructure and CI logs all consolidated in Grafana. The AI sees the same data humans see. Gemini API token usage and cost are tracked separately in Prometheus.

Technical Foundation

cortex is a full-TypeScript monorepo.

Layer Stack
Applications (apps/) TypeScript (Hono, TanStack Router, Vite, etc.)
Shared packages (packages/) TypeScript
Infrastructure (infra/) TypeScript (Pulumi)
Edge (worker/) TypeScript (Cloudflare Workers)
Lint plugins TypeScript
Doc scripts TypeScript (tsx)

Having everything in one language is a much bigger win when viewed from the AI's side than from a human's. Specifically:

  • You can feed the AI ASTs and type definitions directly as context — no language boundary fragments the picture
  • Refactors don't cross language boundaries — one ESLint plugin can inspect and auto-fix apps/, packages/, and infra/ together
  • Edges don't break in the Product Graph — for example, a Cloud Run service definition (infra/, TS) connects in a single graph to the Hono route (apps/, TS) it actually invokes

When you ask the AI "what does this change affect?", the reason it can hop infra → apps → packages and answer in one round-trip is that all of this is one language.

Build is parallelized via Turborepo and pnpm workspaces. Deploys go through GitHub Actions, which detects only changed stacks and applies them in parallel via Pulumi.

Numbers (snapshot at time of writing)

Scale

Value
Duration ~5 months (intensive development: ~4 months)
Commits ~4,000
Merged PRs ~790
% of commits authored by me ~91%
apps 123
packages 66
MCP servers 19
Pulumi stacks 110
TypeScript (implementation) ~630K lines
TypeScript (tests) ~560K lines
Markdown documentation ~110K lines / 389 files
as any / TODO / unjustified lint-disable in hand-written code 0 (excluding generated files / unavoidable external-library cases)
Coverage gate 90% (statements / branches / functions / lines)

The PR-flow Switch That Multiplied Throughput

Up until April, I was AI-assisted reviewing every change carefully on my own machine and then committing directly to main. The review bar was unchanged, but throughput was bottlenecked on my hands.

In April, switching to fine-grained, PR-based operation (auto review → auto fix → auto merge) dramatically changed the per-month merged-PR count:

Month Merged PRs
2026-02 10
2026-03 23
2026-04 518
2026-05 (through the 10th) 235

A ~22× jump between March and April. Total commits actually went down (because committing directly to main was replaced by going through PRs), so this isn't "I wrote more code." This is "the manual review step got replaced by the harness, and the throughput ceiling moved." The 22× is exactly the moment a human reviewer was swapped for Auto Review — clean evidence of the flywheel property where the harness's effectiveness scales with codebase size.

What's Required for These Numbers to Hold

These numbers are not explained by "we use AI" alone. The prerequisites:

  • Full TypeScript monorepo — code, tests, infrastructure, scripts all under one static-analysis system
  • Composable Architecturepackages/ holds reusable parts; apps/ compose them. Direct imports between apps/ are forbidden — everything routes through packages/. This is what guarantees components don't interfere with each other.
  • Strict quality gates — lint / coverage / annotations are run "no lowering, no working around"
  • Unified graph — code, docs, DB, infrastructure on a single graph as the foundation that lets the AI act with context
  • Auto PR review / auto fix / auto merge / auto self-healing — the harness that swaps the rate-limiting manual step for AI
  • Unified observability — humans and AI see the same data (OTel + Faro + Prometheus)

The design has to be in place first, and AI runs on top of it. That's what makes both volume and quality possible at the same time.

Composable Architecture in particular is what drives the headcount-of-one production. Because components don't interfere, multiple Claude Code sessions can run in parallel on different parts of the codebase. In practice, I've run up to ~10 sessions in parallel at peak — this multiplies with the harness's effectiveness.

It's system design, not magic. Each piece will get its own deep-dive in this series.

Some Honest Caveats

If you've read this far, it might sound like everything runs perfectly on autopilot. It doesn't. Three things I want to be upfront about:

1. High code quality doesn't prevent bugs.

What the harness protects is "correctness of the code" — not "correctness of the spec." Even when implementation is clean, getting the spec interpretation wrong still ships bugs. AI review can catch "code contradicts the documented spec," but if the spec itself is wrong, the issue sails right through. That part is still a human responsibility.

2. The work is split deliberately.

New pipelines that connect to external APIs, and anything touching secure data, are handled by engineers. Non-engineers mostly work on modifications to features that already exist (peeking at our business-side members' PRs makes it concrete pretty quickly). "Non-engineers can develop too" means "the harness provides rails they can't derail from, so they can safely modify in maintenance mode" — not "anyone can build anything from scratch."

3. This level of automation works because it's an internal platform.

Yes, cortex's full-auto deploy works partly because Composable Architecture cleanly separates apps and infrastructure. But honestly, a big part of it is that this is an internal-only platform. If something breaks, only employees are affected, and we can roll back fast. The same approach can't be applied directly to consumer products or systems where downtime is immediately critical (warehouse management, for example). We've started moves to close that gap on the consumer side too, but that's a separate post.

Series Roadmap

The series is planned as 6 parts.

Part 1: Series Intro (this post)
The big picture of what cortex is and why it works in "harness" form. The map to the rest of the series.

Part 2: Product Graph — code, docs, DB, infrastructure as one unified graph ★ recommended next
The implementation side: how the unified graph is built and maintained. What happens when you take the design principles from the Agentic Graph RAG MCP post and apply them to the entire cortex codebase.

Part 3: AI reviews, fixes, merges, and deploys PRs
GitHub webhook → AI review → on REQUEST_CHANGES, AI fixes via worktree → auto squash merge → changed-stack detection → parallel deploy: the full pipeline.

Part 4: Incidents self-heal, guardrails self-strengthen
Grafana alert → AI investigation (Loki + Product Graph + git blame) → fix PR + new lint/type gate → auto merge → automatic redeploy: the auto self-healing system. Also covers the full OTel + Loki + Mimir + Tempo + Faro stack, Gemini cost tracking, and how the quality gates are designed to be "non-loweriable, non-bypassable, and self-growing."

Part 5: Scaling the harness from cortex to toC services
The first half covers how business members can already open PRs directly to cortex -- and where that breaks (additions to existing pipelines work; new pipelines and architectural changes still need humans in the loop). The second half is the roadmap and the thinking behind scaling cortex's harness across the whole product org (multiple services, multiple infra stacks, multiple teams).

Each post stands on its own, but Part 2 (Product Graph) is the foundation for the others, so the recommended reading order is Part 1 → Part 2 → any.

Cadence: Tuesdays or Thursdays, 8–10 AM JST.

Closing

Building cortex, what's struck me is that in an AI-era dev environment, "absorbing everything that comes after the writing" wins over "reducing the burden on the writer". Tests, lint, types, coverage, code review, incident response — instead of "these get in the way, let's reduce them," the choice that worked was "have the AI do all of them, without compromise." The counterintuitive result is that quality and dev speed both go up at the same time.

And it expands two things — how much one engineer can ship, and how much non-engineers can participate — well beyond what was possible before. That's the texture of the "harness" we've built on top of cortex.

In subsequent parts, I'll walk through the individual mechanisms that make this work.

→ Part 2: Product Graph — code, docs, DB, infrastructure as one unified graph

Top comments (15)

Collapse
 
arvavit profile image
Vadym Arnaut

On the Product Graph -- what's the freshness model? Does it ingest code
changes async (via webhooks/CI) or pull on-query? Asking because the
auto-review quality probably correlates strongly with how stale the
graph is at moment-of-comment; if a PR touches a file the graph hasn't
reindexed yet, the impact analysis is partial. Curious how you handle
that at 4,000 commits / 5 months scale.

Collapse
 
ryantsuji profile image
Ryosuke Tsuji

Great question — this is exactly the right place to push on. The freshness model is async via CI on push-to-main. GitHub Actions triggers a Cloud Run Job that rebuilds the graph with differential embedding (only nodes whose textForEmbedding changed get re-embedded via Vertex AI), so a typical push costs ~$0.001 and finishes in a few minutes.

You're right that this means PR review can hit a stale-graph window. Two things soften it more than the raw freshness suggests:

  1. PR review reads diff + graph, not just graph. The PR diff carries the change itself; the graph provides the surrounding context. So even a function the graph hasn't seen yet has its definition in the diff. The AI reviewer reasons about both.
  2. MCP tools have graceful fallback. When the graph misses (e.g., a function freshly added in a sibling PR), the AI falls through to grep_code / read_file against the live git tree via our git-server MCP. So "graph miss → silent failure" doesn't happen — it's "graph miss → fall through to raw code search".

Worst case — a large refactor landing in main while a related PR is open against the old base — the answer is rebase, not graph freshness. For larger scale than ours, the next obvious step would be PR-branch-aware indexing (transient graph for the PR head ref). We haven't needed it yet at 4K commits.

Collapse
 
ryantsuji profile image
Ryosuke Tsuji

Good point to add on top of what I said: even if a review runs against a stale-ish graph, the merge itself is gated by (a) conflict resolution and (b) re-review on push. So if main moved enough to matter, the PR can't merge cleanly, and once rebased the re-review runs against the now-updated graph. The "stale graph + merged change" pathological combination is structurally hard to hit.

Thread Thread
 
arvavit profile image
Vadym Arnaut

The diff-as-context plus MCP-fallback composition is the right shape. One thing I'd ask at larger scale: when the fallback kicks in (grep_code / read_file against the live tree), what's the fan-out cap per review? Watching agents work in our codebase, the cost of an unbounded grep walk -- function ref → usage → containing test → unrelated helper — sometimes dwarfs the embedding miss it's filling in. Does cortex's reviewer prompt enforce a depth/budget heuristic, or does differential embedding keep fall-through rare enough that it's not worth bounding?

Thread Thread
 
ryantsuji profile image
Ryosuke Tsuji

The honest answer here is closer to your second framing — differential embedding keeps fall-through rare enough that we never bothered with an explicit fan-out cap. Two softer things do the bounding:

  1. Per-tool response size limits. grep_code returns capped output, and files in this repo are small (a 500-line max-code-lines lint rule enforces it). A single fallback call can't return enough for the reviewer to lose itself in unrelated context.
  2. The reviewer is anchored on a documented guideline, not a free-form prompt. We maintain a review guidelines doc that spells out review criteria and severities — composable architecture, impact analysis, security boundaries, test coverage, doc/spec alignment, etc. The reviewer prompt requires reading and applying these for every review, so the work is shaped as "verify the PR against these criteria" rather than "walk around the codebase and tell us what you see." That framing pulls the AI's attention back to the diff and the listed criteria, instead of an open investigation.

We also have an MCP-layer mechanism that scopes AI review more deliberately. I'll cover it in the next post in this series, going up around Tuesday next week.

At larger codebase size or higher cross-cutting churn, both bounds would probably fray, and that's where a hard depth budget on trace or a token-based circuit breaker becomes the right call.

Thread Thread
 
arvavit profile image
Vadym Arnaut

The "reviewer anchored on documented guidelines, not free-form exploration" is the part that makes the fan-out question almost moot — bounded
scope shrinks the surface area before any token math kicks in.

We hit a smaller version of this with cursor rule files in equip. When the rules are specific (one file per area, banned patterns, naming
conventions), the agent stays inside lanes even on large diffs. When they drift toward generic "follow best practices", token usage balloons
because the model fills the gap by reading everything.

Token-based circuit breaker as v2 makes sense, but the guideline anchor is the real circuit breaker. Same playbook as the invariants-in-PR
pattern you described upthread.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

accountability gap on non-engineer push is the part that gets messy fast - if a sandbox app fails in prod, who owns the fix? not obvious when the contributor can't debug the stacktrace.

Collapse
 
ryantsuji profile image
Ryosuke Tsuji • Edited

Great question.
The ownership question lands on me, the CTO. The whole point of opening this work up to non-engineers is that the surrounding environment is supposed to hold even when the contributor can't debug a stacktrace — designing that environment is on me, and so is everything that breaks because the environment didn't hold yet. If a problem surfaces, the fix and the work to prevent it from happening again are the same delivery.

In practice it hasn't gotten messy yet, partly because self-healing catches most issues before anyone has to debug anything. I'll cover that in detail in Part 4 of the series.

The bigger qualifier: this is an internal platform, not customer-facing product code. We're starting to extend similar guardrails to our user-facing product, but I'm not opening non-engineer pushes to that side anytime soon — for that surface, human verification is still a hard requirement.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

most teams try to split it - builder vs. environment designer. you end up with two people to blame and neither one fixing it. what you're describing sidesteps that cleanly.

Collapse
 
ggle_in profile image
HARD IN SOFT OUT

Big concern from experience: self‑healing can create a “heal and hide” pattern where the same issue recurs and is silently fixed without anyone noticing the root cause is still rotting. Do you pair auto‑healing with mandatory root‑cause logging, or does the system just patch and move on? This is where AI goes from toy to force multiplier. Auto‑reviewed PRs and self‑healing ops aren’t sci‑fi anymore; they’re team‑extenders. I like that you even thought about non‑engineer contributors — that’s where real org‑wide impact lies.

What if the self‑healing steps were always proposed as a “suggested action” with a 5‑minute delay, and only auto‑execute if no human overrides? That gives you a semi‑autonomous system that teaches rather than hides.

Collapse
 
ryantsuji profile image
Ryosuke Tsuji

Heal-and-hide is the right anti-pattern to name. The way we avoid it: every auto-fix PR carries 3 artifacts beyond the code change.

  1. The fix itself
  2. A pattern doc entry (problem, solution, code example, checklist)
  3. A 1-line invariant added to the agent's persistent rules file, so the next agent run already has the pattern in context

Concrete example from a recent fix: a Cloud Run deploy failed because secretKeyRef version:latest was pointing at a Secret with no versions. The auto-fix PR added the placeholder SecretVersion to unblock the deploy, and the same PR added a 40-line entry to our cloud-run-deploy guideline (problem, solution, code, checklist) plus a 1-line invariant to the agent's rules file. The next agent that touches a Cloud Run service with secretKeyRef sees the pattern before it codes.

Patch-and-move-on is a real risk, agreed. Our answer is making the docs and invariant update part of the PR scope itself. If the agent doesn't generate the learning artifact, the PR isn't ready for review.

On your 5-min delay suggestion: we use the code review window (auto-review agent + human approval) as the override surface instead of a time window. Different trade on the gate, same goal.

Collapse
 
ggle_in profile image
HARD IN SOFT OUT

Solid pattern. Baking the learning artifact into the same PR scope closes the loop cleanly — fix, doc, invariant, all in one atomic commit. That's not self-healing anymore, that's self-documenting TDD for ops.

One edge I'd watch: invariant collision. As the rules file grows, two invariants can silently contradict under a narrow condition. Do you have a reconciliation mechanism, or does the review surface catch that before it commits? Curious if you've considered a periodic "rule set lint" pass — treat invariants like a test suite that can be checked for internal consistency.

Thread Thread
 
ryantsuji profile image
Ryosuke Tsuji

Nice catch. That gap was missing.

Implementing it is cheap in our setup: every doc (including the review guidelines) is ingested into the knowledge graph and the agent reaches it via MCP at any time.
Just opened a PR to add rule set lint to the review guideline. Thanks!

Collapse
 
harjjotsinghh profile image
Harjot Singh

The "non-engineer contributors" piece is the most interesting and most dangerous part of this. A real harness (auto-reviewed PRs + self-healing ops) is exactly what makes it safe to let non-engineers contribute - the guardrails do the gatekeeping a senior used to do manually. Without that harness, non-engineer contributions are how you get the 3am incident.

So the harness isn't just productivity, it's the trust boundary. Auto-review catches the "looks right, is wrong" PRs; self-healing absorbs the operational mistakes; and only then can you safely widen the contributor pool. Looking forward to the series - the part I'd most want detailed is how strict the auto-review gate is, because that single threshold decides whether "non-engineer contributors" is empowering or terrifying. Strong intro, subscribed.

Collapse
 
ryantsuji profile image
Ryosuke Tsuji

Thanks — and "harness as trust boundary" is a sharper framing than I had in my head. The 3am incident line in particular: yes, exactly the scenario the harness has to absorb.

On the strictness question — Part 3 (just out) goes deep on this. Short version: PRs go through an average of 10.8 review-fix iterations before merge (max 56). 9 dimensions are reviewed sequentially under a strict no-downgrade rule, and the usual excuses ("existing code has the same issue", "will fix later", "leave a TODO") are explicitly closed off. At the meta-layer, quality-bar relaxation itself (lowering a lint rule, coverage threshold, or guideline binding) is classified Critical and the AI is forbidden from approving it — a human reviewer's approve is required (severity.md).

What this means for non-engineer contributors specifically: they can't land something that violates the contract. The strictness isn't relaxed for them; they just get more iterations, and the iteration cost is paid by the author-side AI, not by the contributor's morale. So "empowering vs terrifying" resolves toward empowering — the contributor never personally wrestles with the lint or test details. The AI does that work.

And the gate doesn't stay static — Part 4 (coming soon) covers how bug-fix PRs are required to add a prevention layer in the same PR, with a strict priority order: code/logic > lint > guideline. Every review-time catch gets promoted toward generation-time enforcement, so the gate strengthens autonomously over time — not just self-healing, but self-strengthening.

Part 5 will dig into where the actual boundary sits in practice (existing-pipeline extension: yes; new architectural patterns: not yet). Thanks for the framing.