DEV Community

Building a Real AI Harness: Auto-Reviewed PRs, Self-Healing Ops, and Non-Engineer Contributors

Ryosuke Tsuji on May 12, 2026

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screensh...

Read full post

Vadym Arnaut • May 13

On the Product Graph -- what's the freshness model? Does it ingest code
changes async (via webhooks/CI) or pull on-query? Asking because the
auto-review quality probably correlates strongly with how stale the
graph is at moment-of-comment; if a PR touches a file the graph hasn't
reindexed yet, the impact analysis is partial. Curious how you handle
that at 4,000 commits / 5 months scale.

Ryosuke Tsuji • May 13

Great question — this is exactly the right place to push on. The freshness model is async via CI on push-to-main. GitHub Actions triggers a Cloud Run Job that rebuilds the graph with differential embedding (only nodes whose textForEmbedding changed get re-embedded via Vertex AI), so a typical push costs ~$0.001 and finishes in a few minutes.

You're right that this means PR review can hit a stale-graph window. Two things soften it more than the raw freshness suggests:

PR review reads diff + graph, not just graph. The PR diff carries the change itself; the graph provides the surrounding context. So even a function the graph hasn't seen yet has its definition in the diff. The AI reviewer reasons about both.
MCP tools have graceful fallback. When the graph misses (e.g., a function freshly added in a sibling PR), the AI falls through to grep_code / read_file against the live git tree via our git-server MCP. So "graph miss → silent failure" doesn't happen — it's "graph miss → fall through to raw code search".

Worst case — a large refactor landing in main while a related PR is open against the old base — the answer is rebase, not graph freshness. For larger scale than ours, the next obvious step would be PR-branch-aware indexing (transient graph for the PR head ref). We haven't needed it yet at 4K commits.

Ryosuke Tsuji • May 13

Good point to add on top of what I said: even if a review runs against a stale-ish graph, the merge itself is gated by (a) conflict resolution and (b) re-review on push. So if main moved enough to matter, the PR can't merge cleanly, and once rebased the re-review runs against the now-updated graph. The "stale graph + merged change" pathological combination is structurally hard to hit.

Vadym Arnaut • May 14

The diff-as-context plus MCP-fallback composition is the right shape. One thing I'd ask at larger scale: when the fallback kicks in (grep_code / read_file against the live tree), what's the fan-out cap per review? Watching agents work in our codebase, the cost of an unbounded grep walk -- function ref → usage → containing test → unrelated helper — sometimes dwarfs the embedding miss it's filling in. Does cortex's reviewer prompt enforce a depth/budget heuristic, or does differential embedding keep fall-through rare enough that it's not worth bounding?

Ryosuke Tsuji • May 14

The honest answer here is closer to your second framing — differential embedding keeps fall-through rare enough that we never bothered with an explicit fan-out cap. Two softer things do the bounding:

Per-tool response size limits. grep_code returns capped output, and files in this repo are small (a 500-line max-code-lines lint rule enforces it). A single fallback call can't return enough for the reviewer to lose itself in unrelated context.
The reviewer is anchored on a documented guideline, not a free-form prompt. We maintain a review guidelines doc that spells out review criteria and severities — composable architecture, impact analysis, security boundaries, test coverage, doc/spec alignment, etc. The reviewer prompt requires reading and applying these for every review, so the work is shaped as "verify the PR against these criteria" rather than "walk around the codebase and tell us what you see." That framing pulls the AI's attention back to the diff and the listed criteria, instead of an open investigation.

We also have an MCP-layer mechanism that scopes AI review more deliberately. I'll cover it in the next post in this series, going up around Tuesday next week.

At larger codebase size or higher cross-cutting churn, both bounds would probably fray, and that's where a hard depth budget on trace or a token-based circuit breaker becomes the right call.

Vadym Arnaut • May 14

The "reviewer anchored on documented guidelines, not free-form exploration" is the part that makes the fan-out question almost moot — bounded
scope shrinks the surface area before any token math kicks in.

We hit a smaller version of this with cursor rule files in equip. When the rules are specific (one file per area, banned patterns, naming
conventions), the agent stays inside lanes even on large diffs. When they drift toward generic "follow best practices", token usage balloons
because the model fills the gap by reading everything.

Token-based circuit breaker as v2 makes sense, but the guideline anchor is the real circuit breaker. Same playbook as the invariants-in-PR
pattern you described upthread.

Mykola Kondratiuk • May 22

accountability gap on non-engineer push is the part that gets messy fast - if a sandbox app fails in prod, who owns the fix? not obvious when the contributor can't debug the stacktrace.

Ryosuke Tsuji • May 22 • Edited

Great question.
The ownership question lands on me, the CTO. The whole point of opening this work up to non-engineers is that the surrounding environment is supposed to hold even when the contributor can't debug a stacktrace — designing that environment is on me, and so is everything that breaks because the environment didn't hold yet. If a problem surfaces, the fix and the work to prevent it from happening again are the same delivery.

In practice it hasn't gotten messy yet, partly because self-healing catches most issues before anyone has to debug anything. I'll cover that in detail in Part 4 of the series.

The bigger qualifier: this is an internal platform, not customer-facing product code. We're starting to extend similar guardrails to our user-facing product, but I'm not opening non-engineer pushes to that side anytime soon — for that surface, human verification is still a hard requirement.

Mykola Kondratiuk • May 24

most teams try to split it - builder vs. environment designer. you end up with two people to blame and neither one fixing it. what you're describing sidesteps that cleanly.

HARD IN SOFT OUT • May 13

Big concern from experience: self‑healing can create a “heal and hide” pattern where the same issue recurs and is silently fixed without anyone noticing the root cause is still rotting. Do you pair auto‑healing with mandatory root‑cause logging, or does the system just patch and move on? This is where AI goes from toy to force multiplier. Auto‑reviewed PRs and self‑healing ops aren’t sci‑fi anymore; they’re team‑extenders. I like that you even thought about non‑engineer contributors — that’s where real org‑wide impact lies.

What if the self‑healing steps were always proposed as a “suggested action” with a 5‑minute delay, and only auto‑execute if no human overrides? That gives you a semi‑autonomous system that teaches rather than hides.

Ryosuke Tsuji • May 13

Heal-and-hide is the right anti-pattern to name. The way we avoid it: every auto-fix PR carries 3 artifacts beyond the code change.

The fix itself
A pattern doc entry (problem, solution, code example, checklist)
A 1-line invariant added to the agent's persistent rules file, so the next agent run already has the pattern in context

Concrete example from a recent fix: a Cloud Run deploy failed because secretKeyRef version:latest was pointing at a Secret with no versions. The auto-fix PR added the placeholder SecretVersion to unblock the deploy, and the same PR added a 40-line entry to our cloud-run-deploy guideline (problem, solution, code, checklist) plus a 1-line invariant to the agent's rules file. The next agent that touches a Cloud Run service with secretKeyRef sees the pattern before it codes.

Patch-and-move-on is a real risk, agreed. Our answer is making the docs and invariant update part of the PR scope itself. If the agent doesn't generate the learning artifact, the PR isn't ready for review.

On your 5-min delay suggestion: we use the code review window (auto-review agent + human approval) as the override surface instead of a time window. Different trade on the gate, same goal.

HARD IN SOFT OUT • May 13

Solid pattern. Baking the learning artifact into the same PR scope closes the loop cleanly — fix, doc, invariant, all in one atomic commit. That's not self-healing anymore, that's self-documenting TDD for ops.

One edge I'd watch: invariant collision. As the rules file grows, two invariants can silently contradict under a narrow condition. Do you have a reconciliation mechanism, or does the review surface catch that before it commits? Curious if you've considered a periodic "rule set lint" pass — treat invariants like a test suite that can be checked for internal consistency.

Ryosuke Tsuji • May 13

Nice catch. That gap was missing.

Implementing it is cheap in our setup: every doc (including the review guidelines) is ingested into the knowledge graph and the agent reaches it via MCP at any time.
Just opened a PR to add rule set lint to the review guideline. Thanks!

Harjot Singh • May 30

The "non-engineer contributors" piece is the most interesting and most dangerous part of this. A real harness (auto-reviewed PRs + self-healing ops) is exactly what makes it safe to let non-engineers contribute - the guardrails do the gatekeeping a senior used to do manually. Without that harness, non-engineer contributions are how you get the 3am incident.

So the harness isn't just productivity, it's the trust boundary. Auto-review catches the "looks right, is wrong" PRs; self-healing absorbs the operational mistakes; and only then can you safely widen the contributor pool. Looking forward to the series - the part I'd most want detailed is how strict the auto-review gate is, because that single threshold decides whether "non-engineer contributors" is empowering or terrifying. Strong intro, subscribed.

Ryosuke Tsuji • May 31

Thanks — and "harness as trust boundary" is a sharper framing than I had in my head. The 3am incident line in particular: yes, exactly the scenario the harness has to absorb.

On the strictness question — Part 3 (just out) goes deep on this. Short version: PRs go through an average of 10.8 review-fix iterations before merge (max 56). 9 dimensions are reviewed sequentially under a strict no-downgrade rule, and the usual excuses ("existing code has the same issue", "will fix later", "leave a TODO") are explicitly closed off. At the meta-layer, quality-bar relaxation itself (lowering a lint rule, coverage threshold, or guideline binding) is classified Critical and the AI is forbidden from approving it — a human reviewer's approve is required (severity.md).

What this means for non-engineer contributors specifically: they can't land something that violates the contract. The strictness isn't relaxed for them; they just get more iterations, and the iteration cost is paid by the author-side AI, not by the contributor's morale. So "empowering vs terrifying" resolves toward empowering — the contributor never personally wrestles with the lint or test details. The AI does that work.

And the gate doesn't stay static — Part 4 (coming soon) covers how bug-fix PRs are required to add a prevention layer in the same PR, with a strict priority order: code/logic > lint > guideline. Every review-time catch gets promoted toward generation-time enforcement, so the gate strengthens autonomously over time — not just self-healing, but self-strengthening.

Part 5 will dig into where the actual boundary sits in practice (existing-pipeline extension: yes; new architectural patterns: not yet). Thanks for the framing.