<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ryosuke Tsuji</title>
    <description>The latest articles on DEV Community by Ryosuke Tsuji (@ryantsuji).</description>
    <link>https://dev.to/ryantsuji</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3843591%2F8b126f91-f561-4e6b-8492-814b18d680ec.jpg</url>
      <title>DEV Community: Ryosuke Tsuji</title>
      <link>https://dev.to/ryantsuji</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ryantsuji"/>
    <language>en</language>
    <item>
      <title>AI Isn't Something to Trust — It's Something to Design (Series Final)</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Tue, 16 Jun 2026 00:02:03 +0000</pubDate>
      <link>https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa</link>
      <guid>https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Across the five posts of this series I've worked through how cortex's harness is put together, one piece at a time: the overall picture, the knowledge graph, Auto Review, Self-Healing + Recurrence Prevention, and non-engineer PRs. Having walked through all of them, I want to step one level down for the wrap-up. &lt;strong&gt;Why am I building this thing in the first place?&lt;/strong&gt; That's what this post is about.&lt;/p&gt;

&lt;p&gt;The five posts might look independent, but the root is one thing, and the series doesn't close cleanly without that one thing being put into words. Together with the philosophy, I want to look back at the failures that don't show up when you only write about what worked — what I threw away, where I tripped — as a reference point for anyone trying something similar.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series Index
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Theme&lt;/th&gt;
&lt;th&gt;Key scene&lt;/th&gt;
&lt;th&gt;Article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Series intro: cortex's harness&lt;/td&gt;
&lt;td&gt;PRs auto-merge / incidents self-heal before you notice&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;ai-harness-intro&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product Graph (cpg)&lt;/td&gt;
&lt;td&gt;Code, docs, DB, infra unified into one graph&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;cortex-product-graph&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;AI PR review&lt;/td&gt;
&lt;td&gt;webhook → AI review → auto-fix → squash merge&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;cortex-auto-review&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Self-Healing + observability + auto-added guardrails&lt;/td&gt;
&lt;td&gt;Alert → AI investigates → fix PR + new lint/type gate → auto redeploy&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86"&gt;cortex-self-healing&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Democratizing the maintenance phase&lt;/td&gt;
&lt;td&gt;Domain experts open PRs to production; the harness owns the quality gate&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4"&gt;cortex-non-engineer-prs&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Series wrap-up&lt;/td&gt;
&lt;td&gt;The underlying philosophy plus a retrospective on the failures and lessons&lt;/td&gt;
&lt;td&gt;This post ← you are here&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Origin — What I Was Thinking About in 2025
&lt;/h2&gt;

&lt;p&gt;When I started building cortex, there was one question I wanted to answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How do I get AI to understand the system accurately?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If AI could understand the system accurately, then PR review, bug investigation, and fixes could all be delegated, and even non-engineers could open up their own development. Conversely, as long as I was stuck on "understand it accurately," everything downstream was sitting on unstable ground. So I spent a lot of time on &lt;strong&gt;the prerequisite layer&lt;/strong&gt; before any of the individual mechanisms.&lt;/p&gt;

&lt;p&gt;The two obvious approaches both hit walls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 1: The Context Window Limit
&lt;/h3&gt;

&lt;p&gt;The first reflex is "just give it all the information it might need." Stuff the codebase, docs, DB schema, infra definitions all into the prompt, and AI gets the whole picture.&lt;/p&gt;

&lt;p&gt;That fails on size. Codebase + docs + schemas + infra at our company doesn't come close to fitting into any realistic context window.&lt;/p&gt;

&lt;p&gt;"Surely context windows will keep growing, and this'll work eventually?" — the more I thought about it, the less of a future I saw in that direction.&lt;/p&gt;

&lt;p&gt;Even with a model whose context window is very large like Gemini, behavior gets unstable when you push it close to the limit. Middle information gets dropped, irrelevant tokens skew the conclusion sideways. This isn't a model-selection problem; it's a structural attention problem. The more unrelated tokens you mix in, the more the attention ratio toward relevant tokens drops mechanically. This is the documented &lt;strong&gt;"lost in the middle"&lt;/strong&gt; phenomenon (information placed at the start and end of long inputs gets used; &lt;strong&gt;information placed in the middle is effectively ignored&lt;/strong&gt;), and &lt;strong&gt;stuff the context window full and you routinely end up in a state where the information you thought you handed over isn't actually visible to the model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;"Lost in the middle" itself may get mitigated as long-context models improve, so I treat it as empirical supporting evidence rather than the core argument. The real wall is the &lt;strong&gt;recursive&lt;/strong&gt; one beneath it: even if "size" is solved, you immediately need &lt;strong&gt;a higher-level context to judge which tokens are necessary and which aren't&lt;/strong&gt;. That problem is recursive and &lt;strong&gt;can't be resolved by context window size, in principle&lt;/strong&gt;. Information has to be structured, or AI doesn't make correct judgments. That's true of humans too — but humans are a notch better off, because &lt;strong&gt;LLMs don't notice they don't know, and they answer with confidence anyway&lt;/strong&gt;. Silently wrong is worse than visibly stuck.&lt;/p&gt;

&lt;p&gt;The keep-growing-context-windows path didn't have a real resolution in sight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall 2: Don't Lean on Learning Either
&lt;/h3&gt;

&lt;p&gt;The other obvious move is to make AI itself learn. Fine-tune per organization, teach it our codebase, our docs, our business. I considered it. Currently not doing it.&lt;/p&gt;

&lt;p&gt;Two reasons. One: getting learning into actual production was still research-phase (in 2025 then; still in 2026 as I write this) and the road to real deployment is still long. The other is thornier: &lt;strong&gt;even if you could learn it, "forgetting" is extremely hard&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A business system has to reflect "the current truth." When the design changes, the DB schema changes, the business rules change, &lt;strong&gt;you want to actively erase old knowledge&lt;/strong&gt;. But "delete just this piece of what's baked into the LLM weights" is unsolved at the research level — there's even a field name for it, &lt;strong&gt;machine unlearning&lt;/strong&gt;, which tells you how hard it is. And on top of that, teaching the model new things also &lt;strong&gt;destroys unrelated existing knowledge&lt;/strong&gt; (called &lt;strong&gt;destructive interference&lt;/strong&gt; / catastrophic forgetting). Lean on learning and both hit at once: the cost of keeping things consistent explodes.&lt;/p&gt;

&lt;p&gt;Rather than treating "doesn't learn" as a downside, I came around to: &lt;strong&gt;because it doesn't learn, swapping out the external knowledge is enough to reflect the current state, and the consistency story is much simpler&lt;/strong&gt;. That was the call at the time.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Way Out — GraphRAG + MCP
&lt;/h3&gt;

&lt;p&gt;With no future in the context-window direction or the learning direction, I came across the &lt;strong&gt;GraphRAG&lt;/strong&gt; concept.&lt;/p&gt;

&lt;p&gt;GraphRAG itself is widely discussed elsewhere; for me, what it meant was the framing: "&lt;strong&gt;supply only the context that's needed, at the moment it's needed&lt;/strong&gt;." Combined with &lt;strong&gt;MCP&lt;/strong&gt; (Anthropic's protocol for connecting LLMs to external tools), AI can go fetch what it needs on its own.&lt;/p&gt;

&lt;p&gt;What was decisive was that this structure lets AI traverse the graph agentically. Rather than "read everything and find related parts by inference," AI &lt;strong&gt;gets to the node it needs and pulls the fact out&lt;/strong&gt;. Which leads to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Instead of making AI infer, supply facts as context.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That one sentence became the core of cortex's entire design philosophy.&lt;/p&gt;

&lt;p&gt;The first thing I built was a static-analysis-based &lt;strong&gt;code-graph&lt;/strong&gt;, which I then threw away after trial and error, and arrived at the annotation-based &lt;strong&gt;product-graph (cpg)&lt;/strong&gt; — details in the trial-and-error section.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F416pzfhy6z28i4wa2g72.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F416pzfhy6z28i4wa2g72.png" alt="2025 origin. Neither growing the context window nor relying on learning had a future; GraphRAG + MCP became the way through." width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  I Don't Trust AI to Begin With
&lt;/h2&gt;

&lt;p&gt;The origin section in one line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;I don't trust AI to fill in the blanks for me.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;"Don't trust" here is not the same as "have no faith in." This isn't doubting Claude / GPT / Gemini's generation quality. What I mean is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;It doesn't know context it wasn't handed.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;It doesn't, on its own and without being told, produce the ideal state.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first one is a truth no amount of model progress will change. Architecturally, LLMs can't know things that weren't in the training data and aren't in this session's context. "Surely smarter models will pick up on it" — I don't think that future is coming. Smarter is a real direction; smart alone doesn't compensate for not knowing.&lt;/p&gt;

&lt;p&gt;The second one is about responsibility, and humans owning it. AI can't decide on its own what "ideal" means. When it tries, it lands on a generic best-practice answer slightly off from the actual situation. Ideal depends on the business, the organization, the moment in time — none of which is visible to AI unless a human verbalizes it and hands it over.&lt;/p&gt;

&lt;p&gt;So that conviction is &lt;strong&gt;not underestimating AI's capability; it's a design decision to not let AI auto-complete the prerequisites&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mastering AI is not about giving it freedom — it's about confining its output to a predictable range.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The mechanism for confining it is the harness this series has been describing.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I Build Harnesses to Hold AI to Determinism
&lt;/h2&gt;

&lt;p&gt;Reading each post through the lens of "don't make AI infer; lean on determinism" surfaces that the five of them are all the same conviction showing up in different layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 2 — Knowledge Graph&lt;/strong&gt;: Instead of making AI search the codebase, this mechanism tilts toward making the codebase legible. With &lt;code&gt;@graph-*&lt;/code&gt; annotations, code / docs / DB / infra are unified into one graph, so AI doesn't have to grep + infer to find related parts. This is the direct implementation of "supply facts as context" from the origin section. → &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;cortex-product-graph&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 3 — Auto Review Dimensions&lt;/strong&gt;: Nine review dimensions (responsibility / severity / type SSoT / etc.) are fixed in advance. When AI does the review, what to check isn't something it gets to infer. "Looking at the PR as a whole" gives AI too much room for inference, so dimensions are split and &lt;strong&gt;each is judged as its own question&lt;/strong&gt;. &lt;strong&gt;Dimensions = locked by the harness, evaluation = AI's job.&lt;/strong&gt; → &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;cortex-auto-review&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 4 — Self-Healing + Recurrence Prevention&lt;/strong&gt;: Alert → investigation → fix PR → redeploy. The flow itself is fixed. AI doesn't get to think through "how should we respond to incidents" each time. And Recurrence Prevention — adding lint / CI gates so the same trap can't be stepped on twice — is &lt;strong&gt;mechanical refusal at the gate, not trust-AI-not-to-do-it-again&lt;/strong&gt;. Or put differently: I don't expect AI never to repeat a mistake. → &lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86"&gt;cortex-self-healing&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 5 — Non-Engineer PRs&lt;/strong&gt;: If the harness weren't holding quality, business-side folks opening PRs directly to production wouldn't survive a single day. Conversely, with the three mechanisms above stacked up (context locked, dimensions locked, traps locked out mechanically), the person closest to the requirements can ship the change directly. The translation layer and the engineering priority queue disappeared as a downstream consequence of the determinism push. → &lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4"&gt;cortex-non-engineer-prs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So what's covered across the five posts is "don't make AI infer; lean on determinism" implemented at different layers. The root is one conviction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Don't Make AI Infer, Lean on Determinism" Actually Means
&lt;/h2&gt;

&lt;p&gt;Let me sharpen this phrase that's come up a few times.&lt;/p&gt;

&lt;p&gt;"Lean on determinism" does &lt;strong&gt;not&lt;/strong&gt; mean "give AI zero room to infer." Code generation, judging review findings, hypothesizing root causes from error logs — these are domains where AI not inferring is the same as no work getting done.&lt;/p&gt;

&lt;p&gt;Where I want to lean on determinism is in domains where &lt;strong&gt;variance isn't allowed&lt;/strong&gt;. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Which part of the codebase to look at&lt;/strong&gt; — don't have AI guess by analogy; pull it deterministically from the knowledge graph&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Which review dimensions to apply&lt;/strong&gt; — don't let AI pick "the important-looking dimensions"; lock the dimension list in advance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How to respond to incidents&lt;/strong&gt; — don't make AI think through the workflow each time; fix the alert → fix PR path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not stepping on the same trap twice&lt;/strong&gt; — don't ask AI to "try to be careful"; let lint / CI mechanically refuse it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What implements this line — where inference is allowed vs. where it isn't — is the harness. To borrow the metaphor from Part 5, the harness lays down &lt;strong&gt;rails you can't fall off&lt;/strong&gt;. On top of the rails, AI runs free (inference works as inference); but it can't fall off the rails sideways.&lt;/p&gt;

&lt;p&gt;Put differently, this is equivalent to the framing in &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2 (cortex-product-graph)&lt;/a&gt;: "where hallucination gets confined." Saying "no inference allowed" isn't quite right — the harness isn't a thing that makes hallucination go to zero. It's a thing that confines hallucination to places where hallucination is OK (i.e., the inference-allowed zone). The structure and facts about the codebase are pulled deterministically, so &lt;strong&gt;the retrieval process itself has no opening for hallucination&lt;/strong&gt;; hallucinations on the judgment side get filtered downstream by tests / lint / dimension-by-dimension reviews. The places where hallucination is allowed and the places where it isn't are &lt;strong&gt;physically split by the harness&lt;/strong&gt;. That's the continuation of the Part 2 framing.&lt;/p&gt;

&lt;p&gt;Step back one more level and what the harness is really doing is &lt;strong&gt;shifting when inference happens&lt;/strong&gt;. The annotations and descriptions on the graph were also written by AI originally — there is inference baked into them. But that inference is &lt;strong&gt;write-time&lt;/strong&gt; — happens once, reviewed, then frozen — not &lt;strong&gt;read-time&lt;/strong&gt; (happens every query, &lt;strong&gt;unverified at the point of use&lt;/strong&gt;). The graph is &lt;strong&gt;frozen, reviewed inference&lt;/strong&gt;, which is exactly why the read side can treat it as fact. "Leaning on determinism" can be rephrased as &lt;strong&gt;not letting unverified inference run on every query&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiffzn4u10c1qk6t6dgze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiffzn4u10c1qk6t6dgze.png" alt="Inference-allowed zone (top, green) and inference-forbidden zone (bottom, orange). The harness implements this boundary — i.e., decides where hallucination gets confined." width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is also the underlying basis for &lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;Part 1 (Series Intro)&lt;/a&gt;'s claim "models commoditize; harnesses differentiate." Model-side quality is converging across Claude / GPT / Gemini, but the harness is &lt;strong&gt;codebase-specific and business-specific&lt;/strong&gt;, so this is where org-level differentiation actually comes from.&lt;/p&gt;

&lt;p&gt;Worth flagging: &lt;strong&gt;the position of this boundary moves with model capability&lt;/strong&gt;. As agentic search and reasoning get stronger, today's "must be deterministic" zone might be tomorrow's "inference is good enough" zone — and in fact cortex itself depends on AI's ability to traverse the graph agentically. But &lt;strong&gt;the boundary itself never disappears&lt;/strong&gt;. How information is structured, where the line gets drawn between fact and inference — &lt;strong&gt;whether you hold that line explicitly as a design decision&lt;/strong&gt; is what differentiates organizations, across every model generation.&lt;/p&gt;

&lt;p&gt;A note: this framing isn't confined to cortex's harness. The same stance shapes &lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;db-graph MCP&lt;/a&gt;, the natural-language interface over internal DB schemas, and &lt;a href="https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a"&gt;Sandbox MCP&lt;/a&gt;, which lets non-engineers safely publish AI-built apps. &lt;strong&gt;It's the through-line in any platform we build that's based on AI doing meaningful work.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One level more abstract: &lt;strong&gt;the individual features aren't where the value is.&lt;/strong&gt; The value sits in the conviction itself. cortex / db-graph / Sandbox MCP are all that one conviction translated into our own use cases.&lt;/p&gt;

&lt;p&gt;The way I think about "design":&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Design is translating an abstract principle into a concrete implementation that fits your own use cases.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's not drawing class diagrams, and it's not laying out architecture diagrams — it's the translation work of &lt;strong&gt;"how does this principle take shape under our business / codebase / constraints?"&lt;/strong&gt; That's where each organization's distinctiveness lives, and that's the value that can't be copied.&lt;/p&gt;

&lt;p&gt;Said the other way: another organization copying cortex's surface doesn't reproduce the substance. What gets asked of every org is &lt;strong&gt;how it translates this principle into its own use cases&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trial and Error That Got Me Here
&lt;/h2&gt;

&lt;p&gt;Everything I've written about above is the form that ended up working. Getting to that form involved &lt;strong&gt;a lot of throwing away&lt;/strong&gt;. Two representative examples worth keeping on record, plus one shorter one.&lt;/p&gt;

&lt;h3&gt;
  
  
  I Spent Two Months on Static-Analysis code-graph, Then Threw It Out
&lt;/h3&gt;

&lt;p&gt;The first thing I built was static-analysis-based &lt;strong&gt;code-graph&lt;/strong&gt;: extracting AST data — imports, call graphs, type dependencies — and putting that into a graph DB. At a glance, the obvious implementation of "make AI understand the codebase."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why two months?&lt;/strong&gt; code-graph wasn't just cortex; it spanned our consumer-facing services and internal-system repositories too — &lt;strong&gt;over 40 repos in total&lt;/strong&gt; (cortex being one of them). The mechanically-extractable AST data (imports / call graphs / type dependencies) was usable as-is via tree-sitter, but each repo had its own API endpoints / DB schema / event definitions / Pub/Sub topology, and &lt;strong&gt;extracting those boundary nodes (where an app meets the outside) goes beyond mechanical AST analysis and had to be implemented per-repo-type&lt;/strong&gt; — that's where the time went.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I spent two months out of the first three on this, and got something that worked end-to-end.&lt;/p&gt;

&lt;p&gt;And then I threw it away.&lt;/p&gt;

&lt;p&gt;Why: static analysis is great at capturing &lt;strong&gt;structure&lt;/strong&gt;, but it can't traverse on &lt;strong&gt;intent or business context&lt;/strong&gt;. Concretely, three things broke:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No semantic entry point for search&lt;/strong&gt; — if I want to query the codebase with "show me the function calculating member subscription billing," I can't get there unless I already know the function name or file. A graph built only from static analysis has no semantic-tag entry pointing to "what is this code for?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The graph contains only code&lt;/strong&gt; — internal helpers / utilities / types / arguments all become nodes, so traversal from any function &lt;strong&gt;blows up within a few hops&lt;/strong&gt;, dragging in helpers and primitives. There's no axis to filter on semantic relatedness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What I actually wanted was code + DB schema + docs + infra on one graph&lt;/strong&gt; — given a function, I want to pull, in one query, the DB tables it touches, the docs where the design lives, and the linked business requirement. A code-only graph just can't do that&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ I switched to the annotation-based approach (&lt;code&gt;@graph-*&lt;/code&gt; JSDoc tags write the business intent into the code, and that gets unified with DB schema / docs / infra into one graph). Searchable semantically, and when you traverse, only related stuff comes back. That's the current &lt;strong&gt;product-graph (cpg)&lt;/strong&gt;. &lt;strong&gt;Don't drag sunk cost forward and you'll get to the final form&lt;/strong&gt; — discarding two months of investment instead of trying to recoup it was the foundation for everything that came after.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting Coverage 90% as a Solo Target Broke the Implementations
&lt;/h3&gt;

&lt;p&gt;Test coverage is still gated at 90%+ (as covered in &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt;). That part hasn't changed. But there was a period when &lt;strong&gt;Coverage was treated as a standalone target&lt;/strong&gt;, and during that period the implementation visibly got worse.&lt;/p&gt;

&lt;p&gt;Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Heavy default-value use that hides branches&lt;/strong&gt;: &lt;code&gt;function(input = {})&lt;/code&gt; style writes the missing-input branch out of the test path. Coverage goes up, protection against unexpected input is gone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catch-and-swallow over throw&lt;/strong&gt;: try / catch returning &lt;code&gt;null&lt;/code&gt;. Don't throw → no need to test "doesn't throw," and Coverage is satisfied. Invalid state silently propagates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Early returns that flatten too much&lt;/strong&gt;: dump complex conditions through an "early return" escape. Tests pass; what should have been validation just isn't there anymore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: &lt;strong&gt;Coverage 90%, quality lower than before&lt;/strong&gt;. When you look at Coverage alone, the shortest path to "satisfy it" is &lt;strong&gt;a weaker implementation that passes the tests&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Two lessons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Set a number as a target, and the number becomes the goal&lt;/strong&gt;. Coverage is a "minimum floor" — not "a goal to hit"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't evaluate any metric alone&lt;/strong&gt;. Coverage has to be evaluated alongside responsibility separation / exception design / boundary value coverage / etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, as the follow-up: I added linting that &lt;strong&gt;mechanically closes off the routes that let you weaken implementations to satisfy Coverage&lt;/strong&gt;. Two specific examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;no-silent-catch&lt;/code&gt;&lt;/strong&gt;: AST-level ban on empty catch and silent-handler patterns like &lt;code&gt;.catch(() =&amp;gt; null)&lt;/code&gt;. Catch bodies have to have a &lt;strong&gt;function call (logger included) / re-throw / new / await&lt;/strong&gt; — otherwise it's an error. Catches the "weaken throws to satisfy Coverage but lose observability in production" pattern structurally. The violation message routes you to &lt;code&gt;@cortex/otel/logger&lt;/code&gt; for structured logging, so the chain through Cloud Run OTel → Loki / Grafana stays intact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;vitest-strong-matchers&lt;/code&gt;&lt;/strong&gt;: bans weak matchers like &lt;code&gt;toBeTruthy&lt;/code&gt; / &lt;code&gt;toBeDefined&lt;/code&gt; / &lt;code&gt;toContain&lt;/code&gt; / &lt;code&gt;toBe(true|false)&lt;/code&gt; / &lt;code&gt;expect.any&lt;/code&gt; / &lt;code&gt;expect.objectContaining&lt;/code&gt;. Catches "any assertion that passes" patterns at the AST level, and points you instead toward &lt;code&gt;toStrictEqual&lt;/code&gt; / &lt;code&gt;toMatchInlineSnapshot&lt;/code&gt; that pin down the full output. This is one notch above Coverage — a &lt;strong&gt;test quality&lt;/strong&gt; concern — but it lines up because the same reflection applies: &lt;strong&gt;don't let a number become the goal&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On top of that, cortex's &lt;a href="https://github.com/air-closet/cortex/blob/main/docs/guidelines/testing.md" rel="noopener noreferrer"&gt;testing guideline&lt;/a&gt; opens with "&lt;strong&gt;Coverage is not the goal, just a supporting indicator&lt;/strong&gt;," and threshold-lowering / &lt;code&gt;istanbul ignore&lt;/code&gt; workarounds get bounced as Critical in Auto Review. So even when Coverage is satisfied, "this is intentionally deleting a branch" / "this is swallowing the exception" comes back as a Major finding.&lt;/p&gt;

&lt;p&gt;From the lesson "a single metric warps implementation," we descended through &lt;strong&gt;guideline that states the principle → lint that mechanically rejects → Auto Review that evaluates as a dimension&lt;/strong&gt; before Coverage 90% finally functioned as the "minimum floor" it should have been all along. This too sits in the lineage of the &lt;strong&gt;Recurrence Prevention&lt;/strong&gt; mechanism from Part 4 (so the same trap can't be stepped on twice).&lt;/p&gt;

&lt;h3&gt;
  
  
  Parallel Sub-Agent → Sequential Evaluation
&lt;/h3&gt;

&lt;p&gt;Third: an internal-structure call about Auto Review. &lt;strong&gt;Distribute the 9 dimensions to parallel sub-agents and evaluate concurrently&lt;/strong&gt; — the plausible-looking design ("parallel = faster, parallel should also hold quality") I tried first and ended up throwing out.&lt;/p&gt;

&lt;p&gt;What actually happened: &lt;strong&gt;time, cost, and accuracy all got worse&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time got worse&lt;/strong&gt;: each sub-agent has its own startup, its own context load, its own result aggregation overhead. "9-way parallel = 9x faster" didn't hold; there were even cases where sequential evaluation in one session ended up faster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost got worse&lt;/strong&gt;: each sub-agent loads PR diff + guidelines + related code independently — common context loads ran 9 times. Token consumption measured at &lt;strong&gt;just under 4x — not the naive 9x&lt;/strong&gt; (the context other than diff is shared across many dimensions, which is what kept it from blowing up to a full 9x)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy didn't hold&lt;/strong&gt;: parallel sub-agents don't see each other's verdicts, so the same problem comes back as "APPROVE" from one and "REQUEST_CHANGES" from another. Duplicate findings show up too. Without a "what kind of PR is this as a whole?" pass to anchor on, dimensional findings drift toward local optima and the overall picture gets worse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Switching to sequential evaluation: same session goes through 9 dimensions in sequence, so context loads once, and each dimension's call has the previous dimension's verdict in front of it. &lt;strong&gt;All three — time, cost, accuracy — improve simultaneously.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Of course, sequential evaluation introduces &lt;strong&gt;order dependence between dimensions&lt;/strong&gt; — earlier verdicts can shape later ones. That's a real trade-off, and I accepted it knowingly. &lt;strong&gt;Inter-dimension consistency at the cost of some order sensitivity&lt;/strong&gt; is more useful as a 9-dimension review than fully independent dimensions that contradict each other.&lt;/p&gt;

&lt;p&gt;The takeaway: the distributed-systems intuition that "&lt;strong&gt;parallel = faster, parallel = quality holds&lt;/strong&gt;" &lt;strong&gt;breaks its own assumptions in an AI harness&lt;/strong&gt;. Unlike parallelizing across CPU cores on your machine, with AI &lt;strong&gt;the context isn't shared memory; it's per-process state&lt;/strong&gt;. Sequential evaluation in one session ends up better on speed, token efficiency, and inter-dimension consistency at the same time — a structural property that's easy to miss at design time.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Section Is Really Saying
&lt;/h3&gt;

&lt;p&gt;The form I've described across the series is &lt;strong&gt;the result of a lot of trial and error&lt;/strong&gt;. Not starting with the right answer and laying it out from there. The decisions of throwing things away with sunk costs included, the trap of letting a metric I chose turn into the goal, the distributed setup that looked natural but worked against me — those are the things I walked through before landing at the current shape.&lt;/p&gt;

&lt;p&gt;Not easy. I don't pretend it was. But &lt;strong&gt;if you do walk through it, real results follow&lt;/strong&gt; — that's the honest read on it now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the Series
&lt;/h2&gt;

&lt;p&gt;What I most wanted to communicate across these six posts comes down to one thing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI coding is not about "how to use AI" — it's about designing the environment AI runs in.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Or, put another way:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI isn't something to trust. It's something to design.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Assuming a large codebase&lt;/strong&gt;: prompt engineering / model selection / tool selection — each matters individually, but polishing them alone doesn't get you to auto-merging PRs, auto-healing incidents, or non-engineer development. Getting there requires building &lt;strong&gt;a codebase / business flow / observability / repair cycle where AI doesn't need to infer&lt;/strong&gt;. That's not an individual AI skill — that's &lt;strong&gt;an environment-design problem&lt;/strong&gt; (conversely, for a small project of a few dozen files, today's AI models work fine standalone. &lt;strong&gt;Harnesses become essential when scale exceeds what one person can hold in their head.&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;And the conviction at the root of environment design is, repeating myself, "&lt;strong&gt;I don't trust AI to fill in the blanks for me&lt;/strong&gt;" — looking the reality in the face that context that wasn't handed over isn't known, and the ideal state doesn't happen without being told. Once you accept that premise, &lt;strong&gt;what to build clarifies naturally&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Looking back, four decisions ended up being the ones that mattered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Locked the conviction first&lt;/strong&gt;: putting words to the root ("AI isn't something to trust") gave priority order to every mechanism. If I'd started from technique, I don't think I'd have made it to the current form&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invested with throwing-out as the default&lt;/strong&gt;: like I did with code-graph at the two-month mark, I went into things with "throwing this out is OK." Drag sunk cost forward and you can't move forward&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refused standalone numerical targets&lt;/strong&gt;: the moment a metric like Coverage 90% becomes the goal on its own, implementations warp. Designed the system so it gets evaluated alongside other dimensions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Designed for "no inference," not around AI's capability&lt;/strong&gt;: I prioritized building structure where AI doesn't have to infer, instead of relying on what AI can do. That's what made the system stable end-to-end, I think&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If even one of these is useful to someone starting on something similar, that would be great.&lt;/p&gt;




&lt;h2&gt;
  
  
  Afterword — Where Engineering Careers Are Heading
&lt;/h2&gt;

&lt;p&gt;Slipping off the wrap-up topic — this is something I've been turning over recently, written here in a "loosely held thought" tone, so feel free to skim.&lt;/p&gt;

&lt;p&gt;As harnesses mature, I think &lt;strong&gt;engineering work splits along two directions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One direction is &lt;strong&gt;value creation from problem identification and business design&lt;/strong&gt;. In the world of Part 5 — where non-engineer PRs work — "writing code" stops being scarce, and the actual scarce thing becomes &lt;strong&gt;the ability to define what to build&lt;/strong&gt;. The person closest to the requirements (a PMO, a business manager, a domain-deep engineer) ends up driving Claude Code through to the merged PR themselves. This direction looks less like "engineer" and more like a &lt;strong&gt;business designer who moves between domain and implementation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The other direction is &lt;strong&gt;building the foundation that lets all of that happen safely and quickly&lt;/strong&gt;. Non-engineers can open PRs to the production repo only because the harness underneath holds quality — knowledge graph, Auto Review, Self-Healing, Recurrence Prevention, lint, CI, tests, observability stack, all interlocked. Designing / maintaining / evolving that gets &lt;em&gt;harder&lt;/em&gt;, not easier. As the &lt;strong&gt;house-builder side, rail-layer side&lt;/strong&gt;, this demands deep infra understanding / security instinct / observability design / a feel for AI's architectural quirks.&lt;/p&gt;

&lt;p&gt;I'm building cortex, so I'm spending more time on the latter; building "a foundation where the business can run its own changes" is genuinely fun for me. &lt;strong&gt;That said, I'm not the type who fully commits to one side&lt;/strong&gt; — I move between listening to business questions and assembling the foundation, and the satisfaction from each is its own kind. This isn't a "which is more important?" question — the harness exists precisely so the former is possible, and the former being alive is what gives the latter meaning. They're mutually dependent.&lt;/p&gt;

&lt;p&gt;Maybe the era of polishing &lt;strong&gt;just&lt;/strong&gt; "coding ability" is shifting slightly. Where to put your value — or whether to move between both directions — becomes a question more engineers will need to choose into intentionally.&lt;/p&gt;




&lt;p&gt;Six posts in, thanks for sticking with me to the end.&lt;/p&gt;




&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Theme&lt;/th&gt;
&lt;th&gt;Key scene&lt;/th&gt;
&lt;th&gt;Article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Series intro: cortex's harness&lt;/td&gt;
&lt;td&gt;PRs auto-merge / incidents self-heal before you notice&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;ai-harness-intro&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product Graph (cpg)&lt;/td&gt;
&lt;td&gt;Code, docs, DB, infra unified into one graph&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;cortex-product-graph&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;AI PR review&lt;/td&gt;
&lt;td&gt;webhook → AI review → auto-fix → squash merge&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;cortex-auto-review&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Self-Healing + observability + auto-added guardrails&lt;/td&gt;
&lt;td&gt;Alert → AI investigates → fix PR + new lint/type gate → auto redeploy&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86"&gt;cortex-self-healing&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Democratizing the maintenance phase&lt;/td&gt;
&lt;td&gt;Domain experts open PRs to production; the harness owns the quality gate&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4"&gt;cortex-non-engineer-prs&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Series Final&lt;/td&gt;
&lt;td&gt;The underlying philosophy plus a retrospective on the failures and lessons&lt;/td&gt;
&lt;td&gt;This post&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devops</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Part 5 ("The Author Doesn't Have to Be an Engineer") has been generating sharp comments.
Worth a read for the thread alone.</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Thu, 11 Jun 2026 01:06:48 +0000</pubDate>
      <link>https://dev.to/ryantsuji/part-5-the-author-doesnt-have-to-be-an-engineer-has-been-generating-sharp-comments-worth-a-1pao</link>
      <guid>https://dev.to/ryantsuji/part-5-the-author-doesnt-have-to-be-an-engineer-has-been-generating-sharp-comments-worth-a-1pao</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4" class="crayons-story__hidden-navigation-link"&gt;The Author Doesn't Have to Be an Engineer: How the Harness Holds Quality (Series Part 5)&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
      &lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4" class="crayons-article__context-note crayons-article__context-note__feed"&gt;&lt;p&gt;Self-healing guardrails for business-side PRs&lt;/p&gt;

&lt;/a&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/ryantsuji" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3843591%2F8b126f91-f561-4e6b-8492-814b18d680ec.jpg" alt="ryantsuji profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/ryantsuji" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Ryosuke Tsuji
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Ryosuke Tsuji
                
              
              &lt;div id="story-author-preview-content-3849367" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/ryantsuji" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3843591%2F8b126f91-f561-4e6b-8492-814b18d680ec.jpg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Ryosuke Tsuji&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 8&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4" id="article-link-3849367"&gt;
          The Author Doesn't Have to Be an Engineer: How the Harness Holds Quality (Series Part 5)
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/engineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;engineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/github"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;github&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/exploding-head-daceb38d627e6ae9b730f36a1e390fca556a4289d5a41abb2c35068ad3e2c4b5.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;19&lt;span class="hidden s:inline"&gt;&amp;nbsp;reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              

              30&lt;span class="hidden s:inline"&gt;&amp;nbsp;comments&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            16 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>career</category>
      <category>discuss</category>
      <category>softwareengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>The Author Doesn't Have to Be an Engineer: How the Harness Holds Quality (Series Part 5)</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Mon, 08 Jun 2026 23:32:30 +0000</pubDate>
      <link>https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4</link>
      <guid>https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;Part 1 (Series Intro)&lt;/a&gt;, I wrote about how cortex's harness has matured to the point where &lt;strong&gt;non-engineers (business-side managers, PMOs, and the like) can open PRs to the production repository&lt;/strong&gt;. The harness here is the runtime foundation for AI in production -- the combination of the knowledge graph, Auto Review, Self-Healing, and Recurrence Prevention covered across Parts 1 through 4.&lt;/p&gt;

&lt;p&gt;Part 5 is what comes next: &lt;strong&gt;that harness has now reached the layer of who actually writes the code&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;"Surely an engineer is still checking afterward, right?" -- I expect a lot of readers will land here with that question. So this post leads with &lt;strong&gt;one concrete example&lt;/strong&gt; before anything else.&lt;/p&gt;

&lt;p&gt;Part 5 covers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What kinds of PRs are actually shipping&lt;/strong&gt; -- two recent ones in detail&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What works and what doesn't&lt;/strong&gt; -- the boundary between adding on top of an existing stack and standing up new infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why this holds for non-engineers&lt;/strong&gt; -- how the four mechanisms from Parts 1-4 carry it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's next -- into toC services&lt;/strong&gt; -- the direction of travel for consumer-facing scale&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The deeper toC implementation story will live in a separate post; here you'll get the framing and the direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Theme&lt;/th&gt;
&lt;th&gt;Key scene&lt;/th&gt;
&lt;th&gt;Article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Series intro: cortex harness&lt;/td&gt;
&lt;td&gt;PRs merging unattended / incidents fixed before anyone notices&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;ai-harness-intro&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product Graph (cpg)&lt;/td&gt;
&lt;td&gt;Code / docs / DB / infra unified into one graph&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;cortex-product-graph&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Auto PR review&lt;/td&gt;
&lt;td&gt;webhook -&amp;gt; AI review -&amp;gt; auto-fix -&amp;gt; squash merge&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;cortex-auto-review&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Self-Healing + observability + auto-added guardrails&lt;/td&gt;
&lt;td&gt;Alert -&amp;gt; AI investigates -&amp;gt; fix PR + new lint/type gate -&amp;gt; auto redeploy + same pattern auto-rejected from then on&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86"&gt;cortex-self-healing&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Democratizing the maintenance phase&lt;/td&gt;
&lt;td&gt;Domain experts open PRs to production; the harness owns the quality gate&lt;/td&gt;
&lt;td&gt;This article ← you are here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Series Final&lt;/td&gt;
&lt;td&gt;The underlying philosophy plus a retrospective on the failures and lessons&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa"&gt;cortex-philosophy&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Start with one scene
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;+1,742 line / 41 file&lt;/strong&gt; PR lands on the internal dashboard web app. Title: "PL dashboard ver.2". The change opens up project visibility to managers and team leads across multiple business units, scoping what each person sees to their own division or team. It adds an SSoT in the shared types package, new routes on the API server with SQL involving &lt;code&gt;INNER JOIN&lt;/code&gt; and &lt;code&gt;LEFT JOIN&lt;/code&gt;, new pages and view-state on the web app, and a personal-settings surface -- the whole stack of things you'd expect for a real feature.&lt;/p&gt;

&lt;p&gt;The point is, this isn't a typo fix or a string swap. Entities, repositories, API routes, screens, filters, personal settings -- every layer you'd normally touch for a feature got touched. &lt;strong&gt;A few days of work for an experienced engineer, in scale terms.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The review-fix cycle ran like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PR open&lt;/strong&gt; (+1,742 / 41 files)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;auto-review pass 1&lt;/strong&gt;: Major finding (a permission-scope fall-through -- data from other divisions leaking into the view that shouldn't be there) plus a handful of Minor items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;author bot push&lt;/strong&gt;: closes the scope fall-through, addresses the Minor items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;auto-review pass 2&lt;/strong&gt;: Nit items remaining, plus a lint catch (&lt;code&gt;no-empty-function&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;author bot push&lt;/strong&gt;: lint clean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;auto-review pass 3&lt;/strong&gt;: still some COMMENTED nits, not yet APPROVE&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;author bot push (iteration 2)&lt;/strong&gt;: hardens loading skeleton, reverts an unnecessary JSDoc tweak&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;auto-review pass 4: APPROVED&lt;/strong&gt; → CI green + APPROVE both met → auto-merge → production&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From PR open to merge: &lt;strong&gt;four review-fix rounds, three author-bot pushes, zero human reviewers in the loop.&lt;/strong&gt; The reviews come from the auto-review bot, the fixes come from an author bot (an automated review-response agent that the PR author has running on their machine), the final APPROVE is submitted by the AI, and an auto-merge script picks it up the instant CI is green. Production lands with &lt;strong&gt;56/56 shared type checks (SSoT), 2,284/2,284 API tests, 1,113/1,113 web specs, and 0 lint errors.&lt;/strong&gt; (cortex splits the lint job between &lt;a href="https://oxc.rs/docs/guide/usage/linter" rel="noopener noreferrer"&gt;oxlint&lt;/a&gt; for general checks and a custom eslint plugin for the &lt;code&gt;@graph-*&lt;/code&gt; rules.)&lt;/p&gt;

&lt;p&gt;The second review pass is worth noting. "Scope fall-through" is a somewhat technical finding -- a hole in the permission filter meant data from divisions other than your own could leak into the view. This is an internal dashboard, so it's not an external-leak incident, but &lt;strong&gt;"only see what's relevant to you" is the whole point of a dashboard like this&lt;/strong&gt; -- losing it doesn't just risk an information slip, it drowns the user in noise that they shouldn't be filtering through in the first place. That's the kind of issue that's easy to merge by mistake and painful to notice in production. &lt;strong&gt;The fact that auto-review caught it on pass one and bounced it back for the author side to fix is what makes this whole flow viable for non-engineers.&lt;/strong&gt; Without that loop, a PR of this size from a non-engineer would be a bad bet.&lt;/p&gt;

&lt;p&gt;And: &lt;strong&gt;the author of this PR is not an engineer&lt;/strong&gt;. A business-side teammate handed a feature description to Claude Code, leaned on the knowledge graph (covered in &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2&lt;/a&gt;) to pull in the relevant existing code, and the +1,742 line PR is what came back. The four review-fix rounds above are what happened next.&lt;/p&gt;

&lt;p&gt;That setup lines up directly with the central claim of this post: &lt;strong&gt;the person who knows the business requirements best, instead of organizing them and handing them to an engineer, runs them through Claude Code to production themselves.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Quick clarification on "write." When I say "write" in this article, I don't mean &lt;strong&gt;typing line by line in an editor&lt;/strong&gt;. I mean &lt;strong&gt;handing the business requirements to Claude Code, judging the resulting diffs and AI review comments with domain knowledge, and seeing it through to a production merge&lt;/strong&gt; -- the whole arc. Most of the actual diff is written by Claude Code; review feedback is handled by the author bot. What the human does is three things: put what they want into words, make the judgment calls along the way ("does this fit, is this off"), and sign off when it's ready to merge. None of that is implementation work in the technical sense. That's what "write" means here.&lt;/p&gt;

&lt;p&gt;There's still a learning curve, of course -- the prompts you give Claude Code, where to point it for context. But &lt;strong&gt;none of that is learning to program.&lt;/strong&gt; What you need is the ability to articulate what you want clearly, not syntax or framework knowledge.&lt;/p&gt;

&lt;p&gt;The harness covers quality, so even at +1,742 lines / 41 files, this works.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Out of scope&lt;/strong&gt;: this post does &lt;em&gt;not&lt;/em&gt; cover the path where non-engineers freely ship apps to a sandbox environment instead of opening PRs against the production repo. That's a different mechanism, covered in an earlier post: &lt;a href="https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a"&gt;Bridging "I Want to Build" and "I Want to Publish Safely" for Non-Engineers with a Custom Sandbox MCP&lt;/a&gt;. This post is specifically about &lt;strong&gt;opening PRs against the production repo&lt;/strong&gt; -- the front door that's traditionally been engineer-only.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  When you need a change, you can make it yourself
&lt;/h2&gt;

&lt;p&gt;The point of the previous section is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;When you need a change, you make it yourself, without flagging an engineer.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When that holds, work like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"I want a new metric on the dashboard"&lt;/li&gt;
&lt;li&gt;"The aggregation filter doesn't match how the business actually operates"&lt;/li&gt;
&lt;li&gt;"I want a small business-support feature embedded in the production app"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;stops queueing behind whatever an engineer is in the middle of. The fix lands when the need lands.&lt;/p&gt;

&lt;p&gt;Think about the old flow. Someone on the business side notices a small thing that needs to change. They write the requirements up. They open a ticket or a Slack thread for an engineer. The engineer is in the middle of something else, so it queues. When they finally get to it, the interpretation drifts from what the business actually meant, there's a back-and-forth, a review pass, and only then does it ship. Even a small change takes days to a week in wall-clock time.&lt;/p&gt;

&lt;p&gt;That's the cost of a &lt;strong&gt;translation layer between business understanding and code&lt;/strong&gt;, and it gets worse the busier the engineer is. The business's improvement cycle ends up paced by engineering's backlog.&lt;/p&gt;

&lt;p&gt;When the person who knows the requirements writes the change themselves, that translation layer and that queue both disappear.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qelj5u7og19uc7gogaw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qelj5u7og19uc7gogaw.png" alt="Business request to production -- the translation layer and queue go away, taking the cycle from days to hours" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are two recent examples of that working.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two non-engineer PRs that recently shipped
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PR&lt;/th&gt;
&lt;th&gt;Kind&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;#1573&lt;/td&gt;
&lt;td&gt;Deep bug fix&lt;/td&gt;
&lt;td&gt;+348 -177 / 7 files&lt;/td&gt;
&lt;td&gt;The dashboard's actuals number was unfairly exceeding the target. Root cause: the "which teams to aggregate" definition was asymmetric between target side and actuals side. Fix lifts the shared "teams to include" list into its own file and points both sides at it. &lt;strong&gt;Tests added too.&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#1557&lt;/td&gt;
&lt;td&gt;Feature build on top of existing stack&lt;/td&gt;
&lt;td&gt;+1,742 -227 / 41 files&lt;/td&gt;
&lt;td&gt;The PL dashboard v2 from the opening scene. &lt;strong&gt;Entities, repositories, API, UI -- all touched&lt;/strong&gt;, but the web app itself (the stack) was already standing; this rides on top.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Different shapes, but both are non-engineer PRs that made it all the way to merge.&lt;/p&gt;

&lt;h4&gt;
  
  
  #1573 -- a deep root-cause fix
&lt;/h4&gt;

&lt;p&gt;This started from "the number looks wrong" on the business side, and the PR went all the way down to a data-integrity issue. The surface symptom: "the actuals number on the dashboard exceeds the monthly target, with the achievement reading 101% even though the team knows that's not real." The lazy fix would be a fudge factor or a clamp on the display. That's not what happened.&lt;/p&gt;

&lt;p&gt;The author dug into the aggregation queries and pinned the real cause: &lt;strong&gt;the actuals side and the target side were reading from different tables, and the definition of "which teams count" wasn't symmetric between them.&lt;/strong&gt; Teams that don't carry a target value (designers, PMOs, and so on) didn't show up on the target side but were getting counted on the actuals side, so the numerator was inflated against the denominator.&lt;/p&gt;

&lt;p&gt;The fix is structural, not cosmetic. A single file defines "the teams in scope for this aggregation" as a shared list, and both sides reference it. &lt;strong&gt;No future drift between target-side and actuals-side definitions&lt;/strong&gt; -- it's locked in by the shared constant.&lt;/p&gt;

&lt;p&gt;The handling of "what data falls out of an aggregation" and "are the target and actuals sides really symmetric" is the kind of thing engineers miss too. &lt;strong&gt;A non-engineer working through it down to the structural level and fixing it there&lt;/strong&gt; is what stands out about this PR.&lt;/p&gt;

&lt;h4&gt;
  
  
  #1557 -- a big feature build on top of an existing stack
&lt;/h4&gt;

&lt;p&gt;This is the PR the opening scene walked through. +1,742 / 41 files spanning entity, repository, API, and UI -- &lt;strong&gt;a scale of change that's well past what people usually mean when they say "modification."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What lets a non-engineer ship this size of change is that &lt;strong&gt;the web app itself (the stack) is already standing.&lt;/strong&gt; Nobody's standing up a new app, no new Cloud Run service needed defining, no new dependency packages, no new directory structure. The change adds a route, a page, and a repository entry inside the existing structure that's already there. It rides on what's been built.&lt;/p&gt;

&lt;p&gt;This is the "on top of an existing stack" range. That's where the boundary is, and the next section spells it out.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on terminology&lt;/strong&gt;: "modification" in this article is broader than "small tweaks to existing logic." It includes adding new entities, new endpoints, and new pages on top of an existing stack. The line I'm drawing is between &lt;strong&gt;building on top of a stack&lt;/strong&gt; vs. &lt;strong&gt;standing the stack up in the first place.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What works, what doesn't
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The principle: standing up a stack is hard, building on top of one isn't
&lt;/h3&gt;

&lt;p&gt;The cleanest dividing line for non-engineer development isn't "modification vs. new development." It's &lt;strong&gt;"on top of an existing stack" vs. "stand up a new stack."&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standing up a new stack&lt;/strong&gt; (work that starts from infrastructure: a new web app from scratch, a new Cloud Run service defined from a Dockerfile, a brand-new BigQuery pipeline) → engineering work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adding to an existing stack&lt;/strong&gt; (a new page in an app that already exists, a new endpoint on an existing API, a new data source on an existing pipeline) → non-engineers can do this&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three of the example PRs above sit on the second side. The stack itself was already built (by me, for the most part), so they get to work inside it. "Stand up a new app from scratch" or "define infrastructure (Dockerfile / IaC) from zero" are still engineer territory.&lt;/p&gt;

&lt;p&gt;Put another way: &lt;strong&gt;renovations and new rooms inside an existing house are open to anyone. Building the house itself is engineering.&lt;/strong&gt; Get the structure wrong -- the load-bearing parts, the wiring, the plumbing -- and the cost of recovery is high. That's the part of stack design where there's still too much "if this is wrong, everything downstream breaks" risk to hand to AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's left for engineers: laying the rails -- the stack and the harness itself
&lt;/h3&gt;

&lt;p&gt;The flip side: &lt;strong&gt;the rail-laying work&lt;/strong&gt; -- standing up a stack, and &lt;strong&gt;extending the harness itself&lt;/strong&gt; -- is what non-engineers don't touch yet. Both require a different kind of knowledge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: containers, IaC, the operational characteristics of cloud services. Cloud Run resource ceilings, cold starts, Pub/Sub at-least-once semantics, BigQuery partition / cluster design, how Pulumi stacks split. Get this wrong and a thing that compiles can still fall over in production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication for external integrations&lt;/strong&gt;: OAuth, webhooks, how you handle API keys and where they sit in Secret Manager. One small slip leaks credentials into the repo or lets a webhook fire something you didn't intend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security fundamentals&lt;/strong&gt;: what to never expose, where to sanitize, where the privilege boundary cuts. SQL injection, XSS, SSRF, broken authorization -- "it works" isn't enough here&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harness design and extension&lt;/strong&gt;: adding a new Auto Review dimension, changing Self-Healing logic, writing a new lint rule (e.g. in &lt;code&gt;eslint-plugin-graph&lt;/code&gt;), structuring guidelines. &lt;strong&gt;Decisions that require understanding how the whole flywheel hangs together&lt;/strong&gt; -- the most meta layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last bullet -- harness extension -- has an important implication: &lt;strong&gt;for non-engineers to keep being able to ship to production, someone has to keep the harness evolving.&lt;/strong&gt; Recurrence Prevention (Part 4) is the automatic loop that adds lint / CI guards / guidelines per trap. But the architecture of the harness itself -- the structure of dimensions, the calibration of judgment, the design of the Self-Healing flow, the shape of the knowledge graph -- those are a meta layer that still requires engineering judgment.&lt;/p&gt;

&lt;p&gt;Concrete case: the current nine Auto Review dimensions (&lt;code&gt;[Graph]&lt;/code&gt; / &lt;code&gt;[Architecture]&lt;/code&gt; / &lt;code&gt;[Security]&lt;/code&gt; / &lt;code&gt;[Test]&lt;/code&gt; / &lt;code&gt;[Doc]&lt;/code&gt; / &lt;code&gt;[Impact]&lt;/code&gt; / &lt;code&gt;[Observability]&lt;/code&gt; / &lt;code&gt;[AI-Antipattern]&lt;/code&gt; / &lt;code&gt;[Recurrence]&lt;/code&gt;) were designed by observing past incidents and fix patterns. When a tenth dimension becomes necessary -- say, a "breaking change check on dependency upgrades" axis -- decisions about responsibility splits with existing dimensions and where to set thresholds are made by looking at the whole structure. That's the kind of engineering work that stays.&lt;/p&gt;

&lt;p&gt;The harness provides "rails you can't derail from." Laying those rails -- and laying the foundation those rails sit on -- is a different job, and it's still on engineering. &lt;strong&gt;Engineers lay the rails; anyone can run on them.&lt;/strong&gt; That's the boundary today.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0pvl864tgh1o4vpzlru.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0pvl864tgh1o4vpzlru.png" alt="Three layers -- the upper layer is the non-engineer surface; the lower two (harness and stack) are engineering work" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works for non-engineers
&lt;/h2&gt;

&lt;p&gt;This is a short recap, because everything that makes it work was already covered in Parts 1 through 4. &lt;strong&gt;Four mechanisms reinforcing each other&lt;/strong&gt; -- that's what lets non-engineers operate safely on top of an existing stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  ① The knowledge graph pulls relevant code from "what you want to do"
&lt;/h3&gt;

&lt;p&gt;cortex-product-graph from &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2&lt;/a&gt; -- the unified graph fusing code, docs, DB schema, and infrastructure into one knowledge base (implementation name: cpg) -- carries this layer.&lt;/p&gt;

&lt;p&gt;Non-engineers don't need to know function names or repo structure. A natural-language question like "I want to add a metric column to the dashboard" goes to Claude Code, which hits the knowledge graph with a semantic search and gets back the relevant nodes -- the screen, the API, the DB, the docs -- in one or two hops. &lt;strong&gt;You can get started without knowing the technical vocabulary.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For #1557: the author told Claude Code "I want PL dashboard v2 with division/team scoping for non-PI-Div PMOs and team leads," and the knowledge graph pulled the existing &lt;code&gt;/projects&lt;/code&gt; route, &lt;code&gt;project-repository.ts&lt;/code&gt;, &lt;code&gt;FilterHeaders.tsx&lt;/code&gt;, and &lt;code&gt;ProjectTable.tsx&lt;/code&gt; as the relevant nodes. The author never needed to know what file to edit. &lt;strong&gt;That's how the translation layer drops out.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ② Auto Review enforces quality at the gate
&lt;/h3&gt;

&lt;p&gt;The 9-dimension automated review from &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt; is the next layer. &lt;code&gt;[Graph]&lt;/code&gt; / &lt;code&gt;[Architecture]&lt;/code&gt; / &lt;code&gt;[Security]&lt;/code&gt; / &lt;code&gt;[Test]&lt;/code&gt; / &lt;code&gt;[Doc]&lt;/code&gt; / &lt;code&gt;[Impact]&lt;/code&gt; / &lt;code&gt;[Observability]&lt;/code&gt; / &lt;code&gt;[AI-Antipattern]&lt;/code&gt; / &lt;code&gt;[Recurrence]&lt;/code&gt; -- the AI returns REQUEST_CHANGES on what's missing and loops with the author bot until APPROVE -- the four-round example from the opening scene is exactly this in motion.&lt;/p&gt;

&lt;p&gt;The point is this: &lt;strong&gt;the first PR doesn't have to be perfect.&lt;/strong&gt; The author doesn't need to ship a completed, security-hole-free version on the first try. Push the initial PR and the rest gets sorted by the auto-review and the author bot bouncing off each other. The reason &lt;strong&gt;the author bot doesn't spin off into a loop of confused fixes&lt;/strong&gt; is that the knowledge graph holds the full codebase context: changes are made with structural awareness of what they touch, so misreadings of the review feedback don't compound.&lt;/p&gt;

&lt;h3&gt;
  
  
  ③ Self-Healing catches what slips through to production
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86"&gt;Part 4&lt;/a&gt; covered Self-Healing. If something does break in production, the AI starts from the alert, investigates root cause, opens a fix PR, and gets it auto-redeployed -- the entire loop runs without humans. &lt;strong&gt;Incidents triggered by a non-engineer's change recover on their own, hands-off.&lt;/strong&gt; That's what makes the bar to opening a PR feel survivable.&lt;/p&gt;

&lt;p&gt;This isn't "non-engineers are safe because nothing can go wrong." It's "even if something goes wrong, the harness has it covered." The system is designed to &lt;strong&gt;minimize damage&lt;/strong&gt;, not eliminate failure. The three-layer construction (Observation → Repair → Strengthening) from Part 4 is what makes that net real.&lt;/p&gt;

&lt;h3&gt;
  
  
  ④ Recurrence Prevention keeps the trap count from growing
&lt;/h3&gt;

&lt;p&gt;The Recurrence Prevention loop from the back half of Part 4. &lt;strong&gt;Every trap that gets stepped on gets nailed down in the same PR&lt;/strong&gt;, so the next attempt at the same pattern gets caught. The form depends: mechanizable traps become lint or CI guards; less-mechanizable ones become entries in the guideline docs (&lt;code&gt;docs/gotchas&lt;/code&gt;, severity docs) that the AI reviewer reads. Either way, the catch happens before merge. Non-engineers contribute to this loop too -- when they hit a trap, the doc entry that prevents the next person from hitting it can come from them.&lt;/p&gt;

&lt;p&gt;As this compounds, &lt;strong&gt;the rails get denser.&lt;/strong&gt; Where there was once a loose "don't go that way" guideline, every incident adds another small rail saying "or this way, or this way, or this way," and the lane that's safe to walk gets clearer. The denser the rails, the safer non-engineers are in the lane.&lt;/p&gt;

&lt;p&gt;→ The four pieces aren't independent components. &lt;strong&gt;Each one's output feeds the next one's input.&lt;/strong&gt; This is the Guides + Sensors flywheel from Part 1 in action. I won't re-explain the details since they're in the prior posts, but &lt;strong&gt;non-engineers shipping to production is the result of all four wheels turning together.&lt;/strong&gt; Take any one out and the level of upfront knowledge required to write to production jumps, and the whole thing collapses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next -- carrying the pattern to consumer-facing services
&lt;/h2&gt;

&lt;p&gt;cortex is an internal AI platform, so the system as it stands can't be lifted into a toC production service as-is. &lt;strong&gt;The biggest issue is the difference in quality bar.&lt;/strong&gt; For toC, "detect after user impact → Self-Healing fix" is too late. The requirement becomes: incidents don't happen, and when something is about to ship, there's review and testing on top of human sign-off.&lt;/p&gt;

&lt;p&gt;That said, the &lt;strong&gt;shape&lt;/strong&gt; of the harness -- a knowledge graph for context, 9-dimension AI review, an author bot responding to feedback -- carries over directly. The thing that changes is &lt;strong&gt;the final step&lt;/strong&gt;: cortex's auto-merge becomes "&lt;strong&gt;AI does the prep, a human signs off&lt;/strong&gt;." Not by giving up the AI's range, but by having the AI handle the heavy lifting (test writing, environment setup, test runs, the 9-dimension review) and leaving only the final APPROVE on a human. "If the human sign-off stays, engineer time doesn't really decrease, does it?" -- but historically engineers were spending the bulk of their time on the implementation, the test writing, the environment setup, the self-review, the back-and-forth on review. Sign-off itself is the smallest piece of that pie. With AI doing the prep work, what an engineer spends time on shifts from implementation labor to &lt;strong&gt;quality judgment&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A caveat on the knowledge graph: &lt;strong&gt;it only earns its keep at large codebase scale.&lt;/strong&gt; If the codebase fits in one AI context window, a cross-repo graph is unnecessary. The reason cortex (100+ apps) and the toC side (40+ repos) need one is because the scale forces it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59dzskllpk4kp8jirits.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59dzskllpk4kp8jirits.png" alt="cortex's shape carried into toC services -- internal knowledge graph → service-side knowledge graph / auto-merge → AI-prep + human sign-off / autonomous Self-Healing → human final call -- three things shift, the rest holds" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The concrete plan is real (extending the knowledge graph across the toC side's 40+ repositories, designing the AI-prep flow, etc.), and the full version goes in a separate post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-up
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The person who knows the business requirements best, instead of writing them up for an engineer, runs them through to production directly.&lt;/strong&gt; Quality is held by the harness, so what's required from the writer is domain knowledge and the ability to direct an AI well. Business asks stop queuing behind engineering, and the cycle speeds up&lt;/li&gt;
&lt;li&gt;The four mechanisms from Parts 1-4 (knowledge graph / Auto Review / Self-Healing / Recurrence Prevention) form a reinforcing flywheel. &lt;strong&gt;The first PR doesn't have to be perfect, and what does break is repaired automatically.&lt;/strong&gt; That's the design&lt;/li&gt;
&lt;li&gt;The boundary: &lt;strong&gt;engineers lay the rails, anyone can run on them.&lt;/strong&gt; Standing up the stack (infrastructure, authentication, security) and extending the harness itself (new lint rules, new review dimensions, Self-Healing flow design) stay on engineering&lt;/li&gt;
&lt;li&gt;Carrying this to consumer-facing toC services, &lt;strong&gt;the knowledge graph (a 40+ repo cross-repo graph on the service side) covers the context layer, but the quality bar shifts, so auto-merge becomes "AI prep + human sign-off."&lt;/strong&gt; Details in a separate post&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;In &lt;strong&gt;Part 6&lt;/strong&gt; I'll wrap the series with the philosophy at the foundation -- &lt;strong&gt;why this design, what got given up, what got kept&lt;/strong&gt;. The series so far has been about "the parts that are working"; Part 6 puts the failures and the dead ends on the table too, including the gap between the philosophy and the actual implementation. A retrospective for myself, and -- I hope -- a reference for anyone heading down a similar road.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>engineering</category>
      <category>github</category>
    </item>
    <item>
      <title>Fixed Before Anyone Notices, Stronger After Every Fix: Self-Healing + Recurrence Prevention (Series Part 4)</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Mon, 01 Jun 2026 23:57:25 +0000</pubDate>
      <link>https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86</link>
      <guid>https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt; I covered &lt;strong&gt;AI reviewing AI PRs&lt;/strong&gt; -- the auto-review pipeline that defends quality &lt;strong&gt;at the PR stage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This post is the other side: &lt;strong&gt;defending quality in production&lt;/strong&gt;, via &lt;strong&gt;Self-Healing&lt;/strong&gt;. A production alert fires, an AI investigates it, opens a fix PR, the PR goes through the same auto-review pipeline from Part 3, gets auto-merged and auto-redeployed. And the same fix PR is &lt;strong&gt;required to add a new Guide -- whether that's a lint rule, CI guard, type constraint, or guideline update&lt;/strong&gt; -- so the same anti-pattern gets auto-rejected from then on. The guardrails grow every time.&lt;/p&gt;

&lt;p&gt;"Incidents get fixed automatically" is catchy on its own, but on its own it's probably not enough in the long run. You have to &lt;strong&gt;close the recurrence class while you fix the incident&lt;/strong&gt; -- self-healing &lt;strong&gt;plus&lt;/strong&gt; self-strengthening -- before the quality gates start to compound over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start with last month's numbers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;115 Self-Healing PRs merged in the past 30 days.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Effectively all of them merged and deployed without human involvement.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Humans only step in when the AI judges "this is not something code can fix."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the current state of "incident response" at cortex.&lt;/p&gt;

&lt;p&gt;Don't read "115 = 115 user-impacting incidents" though. Roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;About half (54) are Deploy Failed-style alerts&lt;/strong&gt; -- CI / Pulumi deploy step caught a failure, the AI absorbed it &lt;strong&gt;before it shipped to production&lt;/strong&gt;. Recently the &lt;code&gt;[Recurrence]&lt;/code&gt; loop (covered later) has been piling up countermeasures here, so this bucket is trending down anecdotally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The remaining 61 are production-runtime alerts&lt;/strong&gt; (Service Error Log Detected / Pipeline Failure / Generator Failure etc.) -- the service is running in production, but an error-log threshold or consecutive-failure threshold tripped. The AI absorbed them before they propagated to user impact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So it's less "incident response" than "&lt;strong&gt;production anomalies that monitoring caught, fixed 115 times by AI before anyone woke up&lt;/strong&gt;." The number of incidents humans actually have to acknowledge is in the low single digits per month.&lt;/p&gt;

&lt;p&gt;There's also a clear pattern of &lt;strong&gt;the same service firing repeatedly&lt;/strong&gt; (e.g. &lt;code&gt;gcs-transformer&lt;/code&gt; is 25 of the 61) -- which is exactly what the &lt;code&gt;[Recurrence]&lt;/code&gt; loop covered later is supposed to &lt;strong&gt;eliminate by turning into lint or type gates&lt;/strong&gt;. That's the back half of this post.&lt;/p&gt;

&lt;p&gt;One more honest note: &lt;strong&gt;the recent month's number is slightly inflated&lt;/strong&gt;. The codebase had a fair number of "silent catch" patterns -- catch blocks that swallow exceptions without logging anything. We added the &lt;code&gt;no-silent-catch&lt;/code&gt; lint rule and &lt;strong&gt;swept the existing silent catches in batches&lt;/strong&gt;, which exposed previously hidden production errors as alerts. So part of the spike is "monitoring caught up to reality." Once the &lt;code&gt;[Recurrence]&lt;/code&gt; loop converts these into lint over time, the number should converge. &lt;strong&gt;"Things we couldn't see, we can see now" is a quality improvement&lt;/strong&gt; -- what we're seeing is the catch-up phase.&lt;/p&gt;

&lt;p&gt;One more thing worth saying: doing this by hand is utterly unsustainable. Running 115 manual cycles of "ack alert -&amp;gt; read logs -&amp;gt; context switch -&amp;gt; understand the code -&amp;gt; fix -&amp;gt; open PR -&amp;gt; review -&amp;gt; deploy" would bankrupt any team's engineering bandwidth. &lt;strong&gt;The system absorbs them without anyone noticing, and converts the fix into a new Guide (lint / CI guard / type constraint / guideline) at the same time&lt;/strong&gt; -- that's the actual subject of this post.&lt;/p&gt;

&lt;p&gt;The moment an alert fires, the AI starts an investigation, traces Loki / Product Graph / git blame to root cause, opens a fix PR, runs it through the auto-review from &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt;, APPROVE -&amp;gt; auto-merge -&amp;gt; auto-redeploy. One full loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Theme&lt;/th&gt;
&lt;th&gt;Key scene&lt;/th&gt;
&lt;th&gt;Article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Series intro: cortex harness&lt;/td&gt;
&lt;td&gt;PRs merging unattended / incidents fixed before anyone notices&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;ai-harness-intro&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product Graph (cpg)&lt;/td&gt;
&lt;td&gt;Code / docs / DB / infra unified into one graph&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;cortex-product-graph&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Auto PR review&lt;/td&gt;
&lt;td&gt;webhook -&amp;gt; AI review -&amp;gt; auto-fix -&amp;gt; squash merge&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;cortex-auto-review&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Self-Healing + observability + auto-added guardrails&lt;/td&gt;
&lt;td&gt;Alert -&amp;gt; AI investigates -&amp;gt; fix PR + new lint/type gate -&amp;gt; auto redeploy + same pattern auto-rejected from then on&lt;/td&gt;
&lt;td&gt;This article ← you are here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Democratizing the maintenance phase&lt;/td&gt;
&lt;td&gt;Domain experts open PRs to production; the harness owns the quality gate&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4"&gt;cortex-non-engineer-prs&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Series Final&lt;/td&gt;
&lt;td&gt;The underlying philosophy plus a retrospective on the failures and lessons&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa"&gt;cortex-philosophy&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Big picture -- the three layers: Observation, Repair, Strengthening
&lt;/h2&gt;

&lt;p&gt;For Self-Healing to work, you need an &lt;strong&gt;Observation layer&lt;/strong&gt; in front and a &lt;strong&gt;Strengthening layer&lt;/strong&gt; (recurrence prevention) behind it. Self-Healing itself is the middle &lt;strong&gt;Repair layer&lt;/strong&gt;. The "self-healing + self-strengthening" loop only spins up when all three are in place.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;: The three layers only stand up on top of two prior pieces: &lt;strong&gt;cpg&lt;/strong&gt; (the unified code / docs / DB / infra knowledge graph from &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2&lt;/a&gt;) and the &lt;strong&gt;Observability stack&lt;/strong&gt; covered in this post.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Observability&lt;/strong&gt; -&amp;gt; the observation layer is empty, nothing gets detected -&amp;gt; the repair layer never even fires&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No cpg&lt;/strong&gt; -&amp;gt; the AI cannot see "where else does this trap exist" -&amp;gt; the repair layer does symptom-level patching at best, and the strengthening layer's horizontal expansion stops working&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Put differently: &lt;strong&gt;trying to copy this setup without those two will just multiply incidents&lt;/strong&gt;. An AI that blindly looks at error logs and rewrites production code is just speeding up the rate at which &lt;code&gt;gh pr create&lt;/code&gt; ships accidents. cpg and Observability are the &lt;strong&gt;minimum bar&lt;/strong&gt; for being able to delegate auto-repair to AI.&lt;/p&gt;

&lt;p&gt;Note also that cortex is a &lt;strong&gt;several-hundred-thousand-line codebase&lt;/strong&gt;, and at that scale loading the whole codebase as AI context is &lt;strong&gt;impossible for the AI as well&lt;/strong&gt; (let alone for a human). Tell the AI to trace impact with just grep and file reads, and it'll run out of context window before it finds anything. cpg is what lets it ask "which other code does this function's change ripple into" and get the answer in one hop. Small repos may not need this. Past a certain scale, cpg is not optional, it's &lt;strong&gt;required&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In Fowler's Guides / Sensors terms from Part 1, cpg and Observability are &lt;strong&gt;the substrate that supports both Guides (pre-execution controls like lint) and Sensors (post-execution gates like auto-review and Self-Healing)&lt;/strong&gt;. Observability feeds Sensors via firing alerts; cpg feeds the Guides side by supplying the auto-review with impact-scoping context. &lt;strong&gt;Neither belongs on one side only&lt;/strong&gt; -- they're foundational to both, and Self-Healing and auto-review only function on top of this substrate. That's the structural claim this post is built around.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9oqc8k3k9f6ljwddm50b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9oqc8k3k9f6ljwddm50b.png" alt="Three layers -- Observation -&gt; Repair -&gt; Strengthening loop"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Key components&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time detection of production anomalies&lt;/td&gt;
&lt;td&gt;OTel SDK / Loki / Mimir / Tempo / Faro / Grafana / Pino logs with trace_id&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repair&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI receives the alert, investigates root cause, opens a fix PR, auto-review, auto-merge, auto-redeploy&lt;/td&gt;
&lt;td&gt;Event Relay -&amp;gt; SSE -&amp;gt; &lt;code&gt;self-healing&lt;/code&gt; mode script -&amp;gt; claude -p (worktree) -&amp;gt; gh pr create&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Strengthening&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The fix PR is required to add a new Guide (lint / CI guard / type constraint / guideline). The same anti-pattern can't reach production again&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@cortex/eslint-plugin-graph&lt;/code&gt; (26 rules), &lt;code&gt;scripts/check-*.ts&lt;/code&gt; (13 guards), &lt;a href="https://github.com/air-closet/cortex-review-guidelines/blob/main/en/guidelines/recurrence-prevention.md" rel="noopener noreferrer"&gt;&lt;code&gt;recurrence-prevention.md&lt;/code&gt;&lt;/a&gt;, the &lt;code&gt;[Recurrence]&lt;/code&gt; lens of auto-review&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I'll walk through them in order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observation -- where do the alerts come from?
&lt;/h2&gt;

&lt;p&gt;cortex's production observability is built on &lt;strong&gt;Grafana Cloud + OpenTelemetry&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OTel SDK&lt;/strong&gt; (the shared &lt;code&gt;@cortex/otel&lt;/code&gt; package) -- every service calls &lt;code&gt;initOtel({ serviceName })&lt;/code&gt; at its entry point. Trace / metric / log all go out via OTLP to Grafana Cloud&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loki&lt;/strong&gt; (logs) -- Pino structured logs get &lt;code&gt;trace_id&lt;/code&gt; automatically. trace and log are cross-referenced&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mimir&lt;/strong&gt; (metrics) -- Cloud Run / pipeline / Gemini API token usage, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tempo&lt;/strong&gt; (traces) -- distributed tracing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faro&lt;/strong&gt; (frontend) -- captures browser JS errors / performance / network failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt; -- dashboards + Alert Rules + Notification Policy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also have &lt;strong&gt;a strict definition of log levels, anchored on business impact&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Definition&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;warn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Business-foreseeable, &lt;strong&gt;does not need immediate action&lt;/strong&gt; (retryable / self-recovers).&lt;/td&gt;
&lt;td&gt;Search query returned 0 results, optional field unset, short retry due to rate limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Data recovery / re-run will definitely be needed afterward&lt;/strong&gt;. Impact expected to be under 20%.&lt;/td&gt;
&lt;td&gt;"User record that should exist isn't there," BigQuery insert failure, per-record enrichment failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fatal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The feature as a whole &lt;strong&gt;fails for 20%+ of requests&lt;/strong&gt;. Service-continuity broken, fatal config missing, full upstream outage.&lt;/td&gt;
&lt;td&gt;OTel init failure, required secret missing at startup, full input data source outage for a pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key point is to &lt;strong&gt;not pick the level mechanically based on the exception class name&lt;/strong&gt; like &lt;code&gt;NotFoundError&lt;/code&gt;. Same "record not found" situation: "this record must exist and doesn't" is &lt;code&gt;error&lt;/code&gt; / &lt;code&gt;fatal&lt;/code&gt;; "user search returned 0 hits" is &lt;code&gt;warn&lt;/code&gt;. &lt;strong&gt;The level is decided by business impact&lt;/strong&gt; -- "does this require data recovery later," "is the whole feature down" -- not by the type. Without this discipline you simultaneously get monitoring fatigue and missed critical incidents. Self-Healing reacts mainly to &lt;code&gt;error&lt;/code&gt;-threshold trips; &lt;code&gt;fatal&lt;/code&gt; is the human-escalation side.&lt;/p&gt;

&lt;p&gt;Alert Rules are &lt;strong&gt;managed declaratively in Pulumi&lt;/strong&gt;, grouped by service into categories like &lt;code&gt;BOT / Pipeline / Transformer / Generator / Gemini / CI / Deploy / Service Catch-All&lt;/code&gt;. When we add a new service, one line in infra code spins up the dashboards and alerts automatically.&lt;/p&gt;

&lt;p&gt;This is "the infrastructure that lets &lt;strong&gt;the AI see the same things humans see&lt;/strong&gt;." Self-Healing picks up alerts coming off this stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Observability can't catch, Self-Healing can't fix either
&lt;/h3&gt;

&lt;p&gt;Honest disclaimer: Self-Healing can only react to &lt;strong&gt;what the observation layer can detect as an anomaly&lt;/strong&gt;. "Observability is everything" is literally true here.&lt;/p&gt;

&lt;p&gt;What the current stack catches is roughly &lt;strong&gt;logic-level errors&lt;/strong&gt; -- exceptions, error logs, deploy failures, external-API call failures, threshold-based metric anomalies.&lt;/p&gt;

&lt;p&gt;What it doesn't catch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;UI errors&lt;/strong&gt; -- the logic ran, no error logs, but the screen &lt;strong&gt;shows something different from intent / shows the wrong value&lt;/strong&gt;. Faro catches client-side JS exceptions and network failures, but "the logic ran and the output is just wrong" never fires an alert&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silent data corruption&lt;/strong&gt; -- aggregated values slowly drift, bad values get into a table. Unless it crosses a threshold or schema check, nothing detects it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Perceived UX degradation&lt;/strong&gt; -- requests feel slow, the UX feels off. Only catchable once SLO / latency thresholds trip&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So Self-Healing is "&lt;strong&gt;AI replacing the human in the loop for incidents the observation layer can catch&lt;/strong&gt;." &lt;strong&gt;The coverage of the observation layer itself is the prerequisite.&lt;/strong&gt; Holes in observation stay as blind spots that neither auto-review nor Self-Healing reaches.&lt;/p&gt;

&lt;p&gt;This isn't really a limitation of Self-Healing -- it's the &lt;strong&gt;importance of growing the observation stack&lt;/strong&gt;, which cortex keeps investing in continuously. (From &lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;Part 1&lt;/a&gt;, Observability is one of the "supporting foundations" beneath the flywheel.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Repair -- the Self-Healing flow
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;MODE=self-healing&lt;/code&gt; runs the same &lt;code&gt;webhook-server&lt;/code&gt; script as the auto-review setup from &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt;, but listening for Grafana firing alerts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjoa2w71p3guljzdg6sou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjoa2w71p3guljzdg6sou.png" alt="Self-Healing full flow -- median 30 min to 1 hr from firing alert to production recovery"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The textual flow looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Grafana Alert Rule firing]
   ↓ POST /webhook/grafana
[Event Relay (in-house)] -- persisted in Firestore
   ↓ SSE push (event: grafana-alert)
[self-healing mode script]
   ↓ throttle check (same fingerprint skipped for 4h)
   ↓ 👀 reaction in Slack to signal "I'm on it"
   ↓ git worktree add -b hotfix/auto-alert-{service}-{ts} origin/main
   ↓ run claude -p inside the worktree
     - search related code via Product Graph MCP
     - pull error logs from Loki via Grafana MCP
     - identify root cause and fix
     - update tests as needed
     - conventional commit
   ↓ git push + gh pr create
[fix PR]
   ↓ auto-review (the Part 3 pipeline)
   ↓ APPROVE -&amp;gt; auto-merge -&amp;gt; auto-redeploy
[recovered]
   ↓ ✅ in the Slack thread
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What happens when the AI judges "this is not fixable in code"
&lt;/h3&gt;

&lt;p&gt;Not every alert is fixable by code. The implementation has a rule: "if you judge it unfixable, exit without changing anything." In that case Slack gets a notification of the form "&lt;strong&gt;This alert cannot be addressed in code. Investigation: ...&lt;/strong&gt;" -- &lt;strong&gt;including what the AI investigated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Worth clarifying on the numbers side: the headline &lt;strong&gt;115&lt;/strong&gt; is "Self-Healing runs that reached PR-created -&amp;gt; merged -&amp;gt; deployed." This "unfixable, exit clean" case is a &lt;strong&gt;separate bucket&lt;/strong&gt;, happening several times a month (external transient outages, infra / config issues that aren't code, cases too complex for the AI to judge confidently). &lt;strong&gt;The "humans show up" bucket is this separate one&lt;/strong&gt; -- it isn't a "some of the 115 failed" failure rate.&lt;/p&gt;

&lt;p&gt;Here's what a real Slack message looks like (a &lt;code&gt;styling-pattern-watch-transformer&lt;/code&gt; case where the GitHub PAT had expired):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ℹ️ This alert cannot be addressed in code. Investigation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigation summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Checked the error logs in Loki for the past hour. The single error source for &lt;code&gt;styling-pattern-watch-transformer&lt;/code&gt; right now is &lt;strong&gt;expiration / revocation of the GitHub PAT (Personal Access Token)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I found&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The BQ snapshot pipeline is healthy: 7 consecutive &lt;code&gt;[BQ] Snapshot rows inserted&lt;/code&gt; successes. The earlier autodetect type-drift problem was resolved in an earlier PR that put &lt;code&gt;SAFE_CAST(... AS INT64)&lt;/code&gt; in place.&lt;/li&gt;
&lt;li&gt;The single ongoing error (multiple occurrences in the past hour): &lt;code&gt;GithubUnauthorizedError: GitHub API /commits/master unauthorized (401): PAT likely expired/revoked&lt;/code&gt; -- the &lt;code&gt;/check-drift&lt;/code&gt; handler returns 401 every time Cloud Scheduler invokes it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Root cause&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The GitHub PAT held in Secret Manager for this service has expired or been revoked. The code already classifies the 401 with a dedicated &lt;code&gt;GithubUnauthorizedError&lt;/code&gt; and is already sending the "PAT needs rotation" notification to Slack. &lt;strong&gt;The detection / notification design on the code side is working as intended.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resolution: exit without changes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is an external-credential operational issue (PAT expiry), and what needs to happen is &lt;strong&gt;a human rotates the PAT in Secret Manager&lt;/strong&gt;. Not the kind of incident code can resolve. Therefore: no code change, no commit, no PR.&lt;/p&gt;

&lt;p&gt;Operator actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add a new version of the target secret in GCP Console / Secret Manager with a valid PAT (&lt;code&gt;repo&lt;/code&gt; scope, read access to the target repository)&lt;/li&gt;
&lt;li&gt;No Cloud Run revision redeploy needed (&lt;code&gt;secretKeyRef version:latest&lt;/code&gt; is referenced)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;At this level of detail, "what was investigated, why code can't fix it, what the human should do" all come out in one Slack message. Open the thread and the operator can act immediately. The productivity gap vs. "alerts just forwarded blindly" is significant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deduplication
&lt;/h3&gt;

&lt;p&gt;A throttle ensures the same &lt;code&gt;fingerprint&lt;/code&gt; (Grafana's unique alert identifier) is &lt;strong&gt;not re-processed for 4 hours&lt;/strong&gt;. Without this, alerts that fire again before the fix PR has merged would spawn another worktree, another fix PR, and so on -- an easy infinite loop.&lt;/p&gt;

&lt;p&gt;We also &lt;strong&gt;permanently skip&lt;/strong&gt; any &lt;code&gt;alertname&lt;/code&gt; containing &lt;code&gt;credential&lt;/code&gt;. Credential incidents carry leakage risk if the AI touches them, so they're explicitly escalated to humans.&lt;/p&gt;

&lt;h3&gt;
  
  
  Self-Healing and Part 3 auto-review -- "the fixer AI" and "the reviewer AI" are independent
&lt;/h3&gt;

&lt;p&gt;This is the most consequential design choice of the agent setup, so calling it out explicitly.&lt;/p&gt;

&lt;p&gt;PRs opened by Self-Healing are &lt;strong&gt;not special PRs, just fix PRs&lt;/strong&gt;. They go through the Part 3 auto-review pipeline &lt;strong&gt;under exactly the same conditions&lt;/strong&gt; -- the 9 lenses (Graph / Architecture / Security / Test / Doc / Impact / Observability / AI-Antipattern / Recurrence) get checked in order. Critical / Major findings -&amp;gt; &lt;code&gt;REQUEST_CHANGES&lt;/code&gt;; Nit-only / no findings + CI green -&amp;gt; &lt;code&gt;APPROVE&lt;/code&gt; -&amp;gt; auto-merge.&lt;/p&gt;

&lt;p&gt;The important bit: &lt;strong&gt;this is not a monolithic "AI fixing AI" loop&lt;/strong&gt;. The fixer-side AI and the reviewer-side AI are fully independent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Different process, different session&lt;/strong&gt;: the self-healing-mode AI and the reviewer-mode AI are launched as separate &lt;code&gt;claude -p&lt;/code&gt; processes. They do not share context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Different input sources&lt;/strong&gt;: the fixer builds the problem from Grafana alert + Loki + cpg. The reviewer judges from the PR diff + cpg + review guidelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Different objectives&lt;/strong&gt;: the fixer is optimizing for "stop the incident." The reviewer is judging "does this violate the 9 lenses or the severity contract?" A deliberate separation of concerns where the two roles' incentives are intentionally misaligned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a result, &lt;strong&gt;PRs the fixer dashed off get blocked by the reviewer&lt;/strong&gt; (REQUEST_CHANGES -&amp;gt; back to the fixer). The AI does not approve its own output. "Just-make-it-work" fixes don't get through.&lt;/p&gt;

&lt;p&gt;This is the often-debated &lt;strong&gt;review-independence&lt;/strong&gt; problem in LLM-agent operation, solved here in the obvious way: split the work across separate agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  A concrete example: meet subscription's 409 ALREADY_EXISTS
&lt;/h3&gt;

&lt;p&gt;Take the alert from the Google Meet recording auto-fetch service I covered in &lt;a href="https://dev.to/ryantsuji/how-we-built-an-automated-meeting-intelligence-system-with-google-meet-slack-and-rag-42ln"&gt;the Meeting Intelligence post&lt;/a&gt;. On 2026-05-21, Self-Healing opened a fix PR titled &lt;code&gt;fix(meet-subscription-renewal): auto-fix for Service Error Log Detected&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The trigger error from Loki:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Workspace Events API request failed: 409 Conflict
"Subscription associated with the resource already exists."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How the AI investigated:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pinned the error in Loki&lt;/strong&gt; -- ran &lt;code&gt;{service_name="meet-subscription-renewal"} | json | level=~"ERROR|error|Error"&lt;/code&gt; via Grafana MCP, picked up the &lt;code&gt;Failed to renew Meet subscription&lt;/code&gt; stack trace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traced the call path in Product Graph&lt;/strong&gt; -- identified &lt;code&gt;renewSubscriptions&lt;/code&gt; -&amp;gt; &lt;code&gt;createMeetSubscription&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-referenced past PRs&lt;/strong&gt; -- the "opposite-direction inconsistency" (name in Firestore but missing from Google = 404) had already been self-healed in another PR with &lt;code&gt;patchMeetSubscriptionTtl&lt;/code&gt; -&amp;gt; null fallback. &lt;strong&gt;The current direction (still on Google's side but missing from Firestore = 409) was the gap&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verdict&lt;/strong&gt;: "the same pattern may exist elsewhere" -- a [Recurrence] decision matrix "&lt;strong&gt;horizontal expansion required&lt;/strong&gt;" case&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead of a quick patch, &lt;strong&gt;it implemented the same-direction self-healing symmetrically to the opposite-direction fallback that was already there&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Made &lt;code&gt;createMeetSubscription&lt;/code&gt; idempotent&lt;/li&gt;
&lt;li&gt;If POST returns 409, extract the existing Subscription name from the response and call &lt;code&gt;patchMeetSubscriptionTtl&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The caller writes the return value back into Firestore, so the next renewal converges to the normal PATCH path (&lt;strong&gt;self-healing&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;Per the existing &lt;code&gt;graph/no-silent-catch&lt;/code&gt; lint, JSON.parse failures are also &lt;code&gt;logger.warn&lt;/code&gt; + &lt;code&gt;serializeError&lt;/code&gt; for structured logging&lt;/li&gt;
&lt;li&gt;Three tests added&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what "Self-Healing pushing all the way to root cause and rolling the fix out horizontally" looks like in practice. &lt;strong&gt;"Close the recurrence class, don't just suppress the symptom"&lt;/strong&gt; (the spirit of &lt;a href="https://github.com/air-closet/cortex-review-guidelines/blob/main/en/guidelines/recurrence-prevention.md" rel="noopener noreferrer"&gt;&lt;code&gt;recurrence-prevention.md&lt;/code&gt;&lt;/a&gt;) executed autonomously by the AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strengthening -- Guides (lint + guidelines) grow automatically
&lt;/h2&gt;

&lt;p&gt;This is the layer that &lt;strong&gt;keeps Self-Healing from being just "auto-repair."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In Fowler's Guides / Sensors terms from &lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;Part 1&lt;/a&gt;, the Strengthening layer is &lt;strong&gt;the place where Guides grow&lt;/strong&gt; -- i.e. the pre-execution controls that prevent AI from deviating in the first place. cortex's Guides come in two flavors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Machine-read Guides&lt;/strong&gt;: lint / type / CI guard / coverage thresholds / Prettier -- enforced at commit / CI time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-and-AI-read Guides&lt;/strong&gt;: guidelines like &lt;a href="https://github.com/air-closet/cortex-review-guidelines/blob/main/en/guidelines/recurrence-prevention.md" rel="noopener noreferrer"&gt;&lt;code&gt;recurrence-prevention.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/air-closet/cortex-review-guidelines/blob/main/en/guidelines/severity.md" rel="noopener noreferrer"&gt;&lt;code&gt;severity.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/air-closet/cortex-review-guidelines/blob/main/en/guidelines/ai-antipattern.md" rel="noopener noreferrer"&gt;&lt;code&gt;ai-antipattern.md&lt;/code&gt;&lt;/a&gt;, etc. -- used as decision criteria by auto-review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 9 lenses, severity contract, and no-downgrade rules from &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt; are the latter; the auto-added lints in Part 4 are the former. &lt;strong&gt;Together they form the Guides surface&lt;/strong&gt;. Lints are "formalized guidelines," guidelines are "lints that haven't been formalized yet."&lt;/p&gt;

&lt;p&gt;The Sensors side -- Self-Healing and auto-review -- &lt;strong&gt;grow these Guides every time they run&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-Healing's root-cause investigation finds "the same pattern exists elsewhere" -&amp;gt; demands horizontal expansion + a new lint (= new Guide)&lt;/li&gt;
&lt;li&gt;Auto-review's &lt;code&gt;[Recurrence]&lt;/code&gt; lens blocks PRs that fix without adding lint&lt;/li&gt;
&lt;li&gt;Both depend on &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;cpg&lt;/a&gt; to see impact scope across the codebase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;cpg is what lets the AI ask "where else does this trap exist." Self-Healing and auto-review (= the Sensors side) &lt;strong&gt;share cpg as a substrate, and each run thickens Guides by one notch&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5m28vtcko417bg44sr3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5m28vtcko417bg44sr3e.png" alt="cpg as the shared substrate; Self-Healing and auto-review (Sensors) grow Guides"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens every time Self-Healing runs (the recurrence-prevention-first flow)
&lt;/h3&gt;

&lt;p&gt;Every fix PR Self-Healing opens is checked for &lt;code&gt;[Recurrence]&lt;/code&gt; by auto-review. The decision matrix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Required action&lt;/th&gt;
&lt;th&gt;Form&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Same trap stepped on 2+ times&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Lint required&lt;/strong&gt; (custom ESLint rule / type constraint / CI guard)&lt;/td&gt;
&lt;td&gt;Machine (new Guide)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pattern may exist elsewhere&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Horizontal expansion required&lt;/strong&gt; (cpg traversal for similar nodes, fix all of them in this PR)&lt;/td&gt;
&lt;td&gt;Investigation + fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cannot be machine-checked but worth formalizing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Add to an existing guideline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Guideline entry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-off, no value in formalization&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Nothing&lt;/strong&gt; (bug fix only)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When the "stepped on 2+ times" situation applies, &lt;strong&gt;the fix PR can't merge without a new lint included&lt;/strong&gt;. So every Self-Healing run produces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal expansion via cpg&lt;/strong&gt; -- not just the immediate fix target, every similar node enumerated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A new Guide added in the same PR&lt;/strong&gt; -- ESLint custom rule / type constraint / CI guard / guideline entry, one of the four&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All existing violations cleared in the same PR&lt;/strong&gt; -- no &lt;code&gt;warn&lt;/code&gt;-as-deferral, &lt;code&gt;error&lt;/code&gt; on first introduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-review -&amp;gt; auto-merge -&amp;gt; auto-redeploy&lt;/strong&gt; -- the regular Part 3 pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Going forward, writing the same pattern gets mechanically rejected by CI / lint&lt;/strong&gt; -- the recurrence class is structurally closed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3imbzr9glo3hdw7wqwps.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3imbzr9glo3hdw7wqwps.png" alt="5 steps every Self-Healing run produces -- recurrence-prevention-first flow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"Add the guard while you fix the bug" runs as a self-sustaining loop driven by Self-Healing.&lt;/p&gt;

&lt;h3&gt;
  
  
  "We'll do it later" and "introduce as &lt;code&gt;warn&lt;/code&gt;" are banned
&lt;/h3&gt;

&lt;p&gt;A couple of important contract clauses from the guidelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Plan to lint later," "lint when we refactor," "another PR will handle this" -- &lt;strong&gt;all banned&lt;/strong&gt;. If it can be addressed in this PR, it must be&lt;/li&gt;
&lt;li&gt;"Existing violations remain, so introduce as &lt;code&gt;warn&lt;/code&gt; and promote to &lt;code&gt;error&lt;/code&gt; later" -- &lt;strong&gt;not accepted&lt;/strong&gt;. This is deferral in disguise. The responsibility for the &lt;code&gt;warn&lt;/code&gt;-&amp;gt;&lt;code&gt;error&lt;/code&gt; promotion goes nowhere and the rule rots&lt;/li&gt;
&lt;li&gt;If you add a lint rule, &lt;strong&gt;fix all existing violations in the same PR and ship at &lt;code&gt;error&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These extend the &lt;strong&gt;no-downgrade rules&lt;/strong&gt; from &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt; -- preempting the typical escape hatches.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "step on it, mechanize it" lineage
&lt;/h3&gt;

&lt;p&gt;Custom Guides currently piled up in cortex:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;graph/no-silent-catch&lt;/code&gt;&lt;/strong&gt; (ESLint) -- the source of the "inflated number" mentioned in the intro. Bans catch blocks that swallow exceptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stacktrace-preservation guideline&lt;/strong&gt; (codified as a Major violation in &lt;a href="https://github.com/air-closet/cortex-review-guidelines/blob/main/en/guidelines/observability.md" rel="noopener noreferrer"&gt;&lt;code&gt;observability.md&lt;/code&gt;&lt;/a&gt;, caught by auto-review) -- forbids &lt;code&gt;logger.error(err.message)&lt;/code&gt; style logs that drop the stack and keep only the message string. Forces the &lt;code&gt;err&lt;/code&gt; field to hold &lt;code&gt;serializeError(error)&lt;/code&gt; so &lt;code&gt;name&lt;/code&gt; / &lt;code&gt;message&lt;/code&gt; / &lt;code&gt;stack&lt;/code&gt; are preserved as structured fields. &lt;strong&gt;Observability is everything&lt;/strong&gt; here, so logs that drop stack info are treated as inherently broken&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cortex-quality/require-fetch-timeout&lt;/code&gt;&lt;/strong&gt; (oxlint -- a Rust-implemented JS/TS lint that runs ESLint-compatible rule sets, dozens of times faster than ESLint due to the Rust impl. cortex uses oxlint for the standard ruleset and ESLint for custom rules that need AST-level work) -- mandates &lt;code&gt;signal: AbortSignal.timeout(...)&lt;/code&gt; on external &lt;code&gt;fetch&lt;/code&gt; calls. Born from a case where a no-timeout &lt;code&gt;fetch&lt;/code&gt; hung indefinitely and triggered a Cloud Tasks redelivery storm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;graph/no-bq-string-timestamp-param&lt;/code&gt;&lt;/strong&gt; (ESLint) -- from a case where passing TIMESTAMP as a string to a BigQuery query parameter NULLed the value out through a serializer bug and silently failed every INSERT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;graph/require-firestore-ignore-undefined&lt;/code&gt;&lt;/strong&gt; (ESLint) -- forces &lt;code&gt;ignoreUndefinedProperties: true&lt;/code&gt; on &lt;code&gt;new Firestore()&lt;/code&gt;. From a case where a single NULL row caused a 100% failure rate in a sync batch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;check-otel-env-injection&lt;/code&gt;&lt;/strong&gt; (CI guard) -- the recurrence prevention for the Cloud Run OTel env injection case below&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TypeScript type tightening&lt;/strong&gt; (type level) -- tighter function signatures, branded types for ID disambiguation, exhaustive discriminated unions, etc. Patterns that can't be lint-caught but are catchable at the type level get closed from the type side&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't textbook-learnable rules -- they're "&lt;strong&gt;stepped on once, then mechanized&lt;/strong&gt;." The number of traps the organization has stepped on translates directly into the number of Guides piled up (across ESLint / oxlint / CI guard / types).&lt;/p&gt;

&lt;h3&gt;
  
  
  How does the AI write a lint rule without breaking it?
&lt;/h3&gt;

&lt;p&gt;Three structural things keep this sane:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Existing rules are the template&lt;/strong&gt;: &lt;code&gt;packages/eslint-plugin-graph/src/rules/&lt;/code&gt; already holds 26 custom rules, each as &lt;code&gt;.ts&lt;/code&gt; + &lt;code&gt;.test.ts&lt;/code&gt; pairs. New rules follow the same shape, so the AI never has to write the AST-walking boilerplate from scratch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tests first&lt;/strong&gt;: violation / pass fixtures go into &lt;code&gt;.test.ts&lt;/code&gt; first, implementation fills in TDD-style. Coverage threshold (90% statements + branches) is gated by the &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt; auto-review, so a lint without tests cannot merge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;lint / type / CI guard sit in the same "mechanize" bucket&lt;/strong&gt;: the decision matrix in &lt;a href="https://github.com/air-closet/cortex-review-guidelines/blob/main/en/guidelines/recurrence-prevention.md" rel="noopener noreferrer"&gt;&lt;code&gt;recurrence-prevention.md&lt;/code&gt;&lt;/a&gt; groups lint / type constraint / CI guard together as the "lint-required" row, and leaves the choice within that bucket (write it as a lint? express it at the type level? add a separate CI guard?) to the AI based on how much AST work is involved and whether runtime semantics matter. Traps that need AST inspection but actually hinge on runtime behavior usually end up as a type constraint (branded type / discriminated union / signature tightening) rather than a custom lint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So "AI writes a lint rule" is supported by &lt;strong&gt;existing rule corpus + the test harness + the mechanize-bucket selection criteria&lt;/strong&gt; -- three together. The path where the AI hand-rolls raw ESLint API and bricks something is structurally closed.&lt;/p&gt;

&lt;h3&gt;
  
  
  A concrete example: Cloud Run OTel env injection -&amp;gt; promoted to CI guard
&lt;/h3&gt;

&lt;p&gt;Multiple services hit this trap: when a Cloud Run Service / Job is defined in Pulumi, forgetting to inject &lt;code&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/code&gt; and &lt;code&gt;GRAFANA_CLOUD_API_KEY&lt;/code&gt; via &lt;code&gt;secretKeyRef&lt;/code&gt; causes OTel init to be skipped in production, no trace/log reaches Grafana, and incidents become silently invisible.&lt;/p&gt;

&lt;p&gt;The normal response would be "we'll be more careful next time." At cortex:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Incident surfaces -&amp;gt; Self-Healing opens a fix PR (adds the env injection to the affected service)&lt;/li&gt;
&lt;li&gt;Auto-review's &lt;code&gt;[Recurrence]&lt;/code&gt; decides "same trap stepped on -&amp;gt; lint required"&lt;/li&gt;
&lt;li&gt;The same PR adds &lt;code&gt;scripts/check-otel-env-injection.ts&lt;/code&gt; (CI guard) -- mechanically asserts OTel env injection across all Cloud Run resource definitions under &lt;code&gt;infra/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;All other existing services get their env injection added in the same PR&lt;/li&gt;
&lt;li&gt;Merge -&amp;gt; deploy -&amp;gt; any future write of the same kind gets rejected by CI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's what "the guardrails grow every time Self-Healing runs" looks like in practice. The trap is "stepped on -&amp;gt; mechanically checked from then on."&lt;/p&gt;

&lt;h3&gt;
  
  
  Where Guides stand right now (in numbers)
&lt;/h3&gt;

&lt;p&gt;Snapshot of cortex's Guide inventory:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Custom ESLint rules&lt;/strong&gt; (&lt;code&gt;@cortex/eslint-plugin-graph&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;no-silent-catch&lt;/code&gt; / &lt;code&gt;require-firestore-ignore-undefined&lt;/code&gt; / &lt;code&gt;no-bq-string-timestamp-param&lt;/code&gt; etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;CI guards&lt;/strong&gt; (&lt;code&gt;scripts/check-*.ts&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;check-otel-env-injection&lt;/code&gt; / &lt;code&gt;check-cloudscheduler-oidctoken-audience&lt;/code&gt; etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Standard oxlint rules&lt;/strong&gt; (set to &lt;code&gt;error&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;183&lt;/td&gt;
&lt;td&gt;Base config ships everything at error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;TypeScript strict gates&lt;/strong&gt; (baseline)&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;strict&lt;/code&gt; / &lt;code&gt;noImplicitAny&lt;/code&gt; / &lt;code&gt;strictNullChecks&lt;/code&gt; / &lt;code&gt;noUncheckedIndexedAccess&lt;/code&gt; etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;TypeScript type tightening&lt;/strong&gt; (per-recurrence)&lt;/td&gt;
&lt;td&gt;grows over time&lt;/td&gt;
&lt;td&gt;branded type / discriminated union / function-signature tightening etc. Patterns that can't be lint-caught but can be type-caught are closed from the type side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Test coverage thresholds&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;statements + branches 90%&lt;/td&gt;
&lt;td&gt;Uniform across all packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prettier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 config&lt;/td&gt;
&lt;td&gt;Format auto-fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Guidelines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;the entire review-guidelines repo&lt;/td&gt;
&lt;td&gt;Used as the decision basis by auto-review&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first two categories plus the type-tightening row -- &lt;strong&gt;Custom ESLint, CI guard, type tightening&lt;/strong&gt; -- are the part that &lt;strong&gt;compounds over time&lt;/strong&gt; through the &lt;code&gt;[Recurrence]&lt;/code&gt; lens every time Self-Healing or auto-review runs. &lt;strong&gt;The guardrails grow with time.&lt;/strong&gt; That's the substance of the Strengthening layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The whole loop, from the top
&lt;/h2&gt;

&lt;p&gt;When you compose the three layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[production anomaly] -&amp;gt; Observation layer (OTel/Loki/Grafana) -&amp;gt; Alert firing
                                              ↓
                                       Event Relay -&amp;gt; SSE
                                              ↓
[Self-Healing mode script]
   - claude -p in worktree
   - root cause via cpg + Loki + git blame
   - commit fix
   - (if applicable) add new lint / type gate too
   - gh pr create
                                              ↓
[Auto-review (Part 3)] -- 9 lenses in order, especially [Recurrence] forces
                         recurrence-prevention action (lint / horizontal expansion / guideline entry)
                                              ↓
                          APPROVE + CI green
                                              ↓
[auto-merge -&amp;gt; Turborepo build -&amp;gt; Pulumi parallel deploy]
                                              ↓
[production recovered + same anti-pattern mechanically rejected from now on]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The loop &lt;strong&gt;completes without human intervention&lt;/strong&gt;. Not just repair, but the quality gates that grow with every repair -- that's the "auto-recovery + auto-strengthening" substance at cortex.&lt;/p&gt;

&lt;p&gt;That said, as the front of the article spelled out, &lt;strong&gt;the loop is only viable because cpg and Observability exist&lt;/strong&gt;. cpg makes horizontal expansion possible; Observability turns production anomalies into structured data. With those two in place at the foundation, AI can stand on the side that does Repair and Strengthening. &lt;strong&gt;Self-Healing is not a standalone mechanism. It's a Sensor riding on top of cortex's Guides (cpg + Observability + lint + guidelines).&lt;/strong&gt; That's the single most important framing in this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Healing by the numbers
&lt;/h2&gt;

&lt;p&gt;Breaking the headline down further.&lt;/p&gt;

&lt;h3&gt;
  
  
  Main firing categories
&lt;/h3&gt;

&lt;p&gt;What kicked off Self-Healing in the past 30 days (with the mapping back to the front-of-post 2 buckets):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Bucket&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Service Error Log Detected&lt;/strong&gt; (most frequent)&lt;/td&gt;
&lt;td&gt;Production-runtime (61 side)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Pipeline Failure&lt;/strong&gt; -- data pipeline failing a configured number of times in a row&lt;/td&gt;
&lt;td&gt;Production-runtime (61 side)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Generator Failure&lt;/strong&gt; -- AI generation jobs (embedding / annotation etc.) failing&lt;/td&gt;
&lt;td&gt;Production-runtime (61 side)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Deploy Failed&lt;/strong&gt; -- deploy step failures (Pulumi up / Cloud Run revision failed)&lt;/td&gt;
&lt;td&gt;Deploy step (54 side)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Alert-firing to production-recovery time
&lt;/h3&gt;

&lt;p&gt;Median &lt;strong&gt;30 minutes to 1 hour&lt;/strong&gt;. Roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alert firing -&amp;gt; AI investigation start: under 1 minute (Event Relay + SSE)&lt;/li&gt;
&lt;li&gt;AI investigation + fix + PR open: 3-8 minutes&lt;/li&gt;
&lt;li&gt;Auto-review (including the &lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt; &lt;strong&gt;10.8 review-fix iterations on average&lt;/strong&gt;): 20-45 minutes&lt;/li&gt;
&lt;li&gt;Auto-merge + deploy: 3-10 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many of these finish before anyone wakes up (alert fires early morning -&amp;gt; by the time people come in, there's just a ✅ in Slack).&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed / Bridge to Part 5
&lt;/h2&gt;

&lt;p&gt;We've now covered &lt;strong&gt;the cortex picture across Parts 1-4&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;Part 1&lt;/a&gt;: the cortex big picture and harness-engineering framing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2&lt;/a&gt;: Product Graph (cpg) -- the AI's "brain"&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3&lt;/a&gt;: auto-review -- defending quality at the PR stage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Part 4 (this post): Self-Healing + Observability + auto-added guardrails -- defending quality in production while growing the quality gates themselves&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engineering role has shifted, over the last half-year, from "&lt;strong&gt;write&lt;/strong&gt;, &lt;strong&gt;review&lt;/strong&gt;, &lt;strong&gt;fix&lt;/strong&gt;, &lt;strong&gt;merge&lt;/strong&gt;, &lt;strong&gt;deploy&lt;/strong&gt;, &lt;strong&gt;incident-respond&lt;/strong&gt;" -- all of that -- toward &lt;strong&gt;looking at the whole system from above and tuning it&lt;/strong&gt;. &lt;code&gt;human-on-the-loop&lt;/code&gt;, working at the Policy layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 5&lt;/strong&gt; covers the harness reaching the "who writes the code" layer. The center of it is &lt;strong&gt;domain experts (business-side managers, PMOs — non-engineers) opening PRs to production&lt;/strong&gt;, with a concrete walk-through of a +1,742 line / 41 file feature PR that landed with zero human reviewers in the loop. What guarantees the quality is the harness stack built across this series — "whoever writes, the harness owns the quality gate" is the Part 5 framing.&lt;/p&gt;

&lt;p&gt;The toC service expansion gets a brief mention at the end for direction, but the full implementation discussion lives in a separate post.&lt;/p&gt;

&lt;p&gt;The actual series wrap-up is &lt;strong&gt;Part 6&lt;/strong&gt;. The center of it is &lt;strong&gt;the underlying philosophy&lt;/strong&gt; -- why I picked this design, what I gave up, what I kept. Alongside that, since the series so far has been mostly "what's working," I want to look back at the failures and dead ends behind that surface, and the gap between the philosophy and the implementation. A retrospective for myself, and -- hopefully -- a reference for anyone starting down a similar path.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>github</category>
      <category>observability</category>
    </item>
    <item>
      <title>Human-on-the-Loop: AI Reviewing AI PRs at cortex -- 769 PRs/month while raising the quality bar (Series Part 3)</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Tue, 26 May 2026 14:35:43 +0000</pubDate>
      <link>https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5</link>
      <guid>https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;Part 1 (intro)&lt;/a&gt; I covered the high level -- &lt;strong&gt;AI driving both PR reviews and incident response on top of cortex&lt;/strong&gt;. In &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2 (Product Graph)&lt;/a&gt; I went deep on &lt;strong&gt;cpg&lt;/strong&gt;, the unified knowledge graph that fuses code, docs, DB schemas and infra into a single business-aware index.&lt;/p&gt;

&lt;p&gt;This post is about &lt;strong&gt;the automated PR review pipeline&lt;/strong&gt; -- AI reviews the PR, a separate AI applies the fixes, and the system merges automatically once policy gates pass. The usual critiques of AI-assisted development ("&lt;strong&gt;the reviewer becomes the bottleneck&lt;/strong&gt;" and "&lt;strong&gt;AI code drops the quality bar&lt;/strong&gt;") don't really apply here. The rest of this post unpacks why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Theme&lt;/th&gt;
&lt;th&gt;Key scene&lt;/th&gt;
&lt;th&gt;Article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Series intro: cortex harness&lt;/td&gt;
&lt;td&gt;PRs merging unattended / incidents fixed before anyone notices&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;ai-harness-intro&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product Graph (cpg)&lt;/td&gt;
&lt;td&gt;Code / docs / DB / infra unified into one graph&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;cortex-product-graph&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Auto PR review&lt;/td&gt;
&lt;td&gt;webhook -&amp;gt; AI review -&amp;gt; auto-fix -&amp;gt; squash merge&lt;/td&gt;
&lt;td&gt;This article ← you are here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Self-Healing + observability + auto-added guardrails&lt;/td&gt;
&lt;td&gt;Alert -&amp;gt; AI investigates -&amp;gt; fix PR + new lint/type gate -&amp;gt; auto redeploy + same-pattern writes get auto-rejected&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86"&gt;cortex-self-healing&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Democratizing the maintenance phase&lt;/td&gt;
&lt;td&gt;Domain experts open PRs to production; the harness owns the quality gate&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4"&gt;cortex-non-engineer-prs&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Series Final&lt;/td&gt;
&lt;td&gt;The underlying philosophy plus a retrospective on the failures and lessons&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa"&gt;cortex-philosophy&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Start with last month's numbers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;769 PRs merged.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Median time to merge: 31 minutes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human review involvement per PR: near-zero.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's a typical 30 days on cortex (Apr 21 -- May 21).&lt;/p&gt;

&lt;p&gt;Every one of those 769 PRs had an AI reviewer as the first reviewer, with &lt;strong&gt;an average of 10.8 review-fix loop iterations per PR (max 56)&lt;/strong&gt;. 1 in 5 merged within 10 minutes, roughly half within 30 minutes. What humans do now is look at review outcomes and &lt;strong&gt;tune the review prompt and the guidelines themselves&lt;/strong&gt; -- this is &lt;strong&gt;human-on-the-loop, not human-in-the-loop&lt;/strong&gt;. &lt;strong&gt;Humans operate on the policy layer, not the execution layer.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Past 30 days&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PRs merged&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;769&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI reviewer coverage&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg review iterations / PR&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max review iterations&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-PR human review&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~0%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median time-to-merge&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31 min&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merged within 10 min&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merged within 30 min&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is a typical month on cortex now.&lt;/p&gt;

&lt;p&gt;The common refrain -- "&lt;strong&gt;AI speeds up writing but reviews still bottleneck&lt;/strong&gt;" and "&lt;strong&gt;AI-written code lowers quality&lt;/strong&gt;" -- is something cortex absorbs through &lt;strong&gt;a pipeline where neither failure mode can take hold&lt;/strong&gt;. Let me break it down.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the review bottleneck stops forming
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The conventional wisdom: the reviewer becomes the bottleneck
&lt;/h3&gt;

&lt;p&gt;As AI writes faster, the load on whoever reviews the output grows proportionally. Anthropic's internal blog (&lt;a href="https://www.anthropic.com/news/how-anthropic-teams-use-claude-code" rel="noopener noreferrer"&gt;How Anthropic teams use Claude Code&lt;/a&gt;) reports the same pattern -- &lt;strong&gt;the bottleneck has shifted from writing to reviewing&lt;/strong&gt;, and senior engineers' work has moved from writing code toward integrating and reviewing AI output.&lt;/p&gt;

&lt;p&gt;cortex hit exactly this. The moment we ran Claude Code at full throttle, &lt;strong&gt;writing speed jumped by an order of magnitude or more&lt;/strong&gt;. Meanwhile the human time available to read and approve PRs only grew linearly. If the reviewer (=me) took a day off, the whole org stalled -- a classic single point of failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  cortex's answer: move the reviewer role to AI as well
&lt;/h3&gt;

&lt;p&gt;Part 1 and Part 2 kept asking the same recurring question: "&lt;strong&gt;how far do you push the harness?&lt;/strong&gt;" cortex went all-in: &lt;strong&gt;the AI writes the code, the AI reviews the code&lt;/strong&gt;. What humans keep their hands on is "&lt;strong&gt;tuning the prompts and guidelines themselves&lt;/strong&gt;" -- not making decisions inside each individual PR, but watching the system from above and adjusting.&lt;/p&gt;

&lt;p&gt;Three conditions had to hold for this to work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The AI reviewer has enough context&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A generic AI reviewer &lt;strong&gt;only sees the PR diff&lt;/strong&gt;. The diff alone hides business meaning, upstream/downstream dependencies, and prior incident history. cortex feeds the &lt;strong&gt;Product Graph (cpg)&lt;/strong&gt; from &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2&lt;/a&gt; -- &lt;strong&gt;a knowledge graph that fuses code, docs, DB schemas, and infra into one structure, with each node carrying business role and upstream/downstream dependencies&lt;/strong&gt; -- into the AI reviewer, so it can &lt;strong&gt;trace impact into code that the PR didn't even touch&lt;/strong&gt;. It catches:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Missed upstream/downstream fixes
- Missed doc updates
- Tests that should have been updated but weren't

Diff-only AI review can never reach this territory.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reviews are not improvisational&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If reviews shift day to day, the team gets confused, and the AI can't be told what "correct" looks like. We enforce this by passing &lt;strong&gt;an explicit review-guideline document&lt;/strong&gt; as the mandatory citation source for every review (we open-sourced a snapshot, see below).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;False positives don't blanket-block merges&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Treating every false positive as Critical breaks the workflow. We control this with &lt;strong&gt;a severity hierarchy (Critical / Major / Minor / Nit) plus strict no-downgrade rules&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So: the cpg from &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2&lt;/a&gt; solves "&lt;strong&gt;what context the AI sees&lt;/strong&gt;," the review guidelines solve "&lt;strong&gt;what the AI should do&lt;/strong&gt;" as &lt;strong&gt;Guides (pre-execution control)&lt;/strong&gt;, and the severity ladder + no-downgrade rules solve "&lt;strong&gt;what the AI must not do&lt;/strong&gt;" as &lt;strong&gt;Sensors (post-execution control)&lt;/strong&gt;. This maps cleanly onto Martin Fowler's Guides / Sensors taxonomy (introduced back in &lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;Part 1&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;One more upstream layer: before any of those three kicks in, &lt;strong&gt;a 500-lines-per-file lint&lt;/strong&gt; keeps every file in any PR small enough to fit in a single AI session. That alone keeps AI review from breaking down, and unlike a human reviewer, the AI doesn't lose focus. There are plenty of other lints in front of the AI reviewer too, but the full picture belongs to &lt;strong&gt;Part 4 (Self-Healing + observability + auto-added guardrails)&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the auto-review system is wired
&lt;/h2&gt;

&lt;p&gt;The implementation is &lt;strong&gt;a script running on each developer's machine&lt;/strong&gt;. GitHub webhooks land on an in-house &lt;strong&gt;Event Relay server&lt;/strong&gt;, get persisted to Firestore, and each developer's machine subscribes as an SSE client. On reconnect, Last-Event-ID replays anything missed -- zero event loss, single webhook registration. &lt;strong&gt;Reviewer-mode machines stay always-on&lt;/strong&gt;, so any incoming review fires immediately. &lt;strong&gt;Author mode runs in the background on the PR author's own machine&lt;/strong&gt;, alongside their normal dev work.&lt;/p&gt;

&lt;h3&gt;
  
  
  How we ended up with Event Relay
&lt;/h3&gt;

&lt;p&gt;The current setup wasn't the original design.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First&lt;/strong&gt;: GitHub webhook → &lt;a href="https://smee.io/" rel="noopener noreferrer"&gt;smee.io&lt;/a&gt; → each machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Then&lt;/strong&gt;: GitHub webhook → Cloudflare Tunnel → each machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Now&lt;/strong&gt;: GitHub webhook → in-house &lt;strong&gt;Event Relay&lt;/strong&gt; with Firestore persistence → SSE to each machine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both smee.io and Cloudflare Tunnel ran into &lt;strong&gt;connection drops and missed deliveries&lt;/strong&gt;, which caused real misses for us. Switching to the in-house Event Relay brought event loss to zero (&lt;strong&gt;Firestore persistence + Last-Event-ID replay&lt;/strong&gt;), and the relay turned into a general-purpose layer we could reuse.&lt;/p&gt;

&lt;p&gt;The webhook ingestion for &lt;strong&gt;Self-Healing&lt;/strong&gt; (covered in Part 4) actually goes through &lt;strong&gt;the exact same Event Relay&lt;/strong&gt;. GitHub, Grafana, and other webhook sources get consolidated through one relay, and each machine's SSE client subscribes to whichever events it cares about. &lt;strong&gt;Having a single general-purpose webhook relay is a piece of infra that keeps paying off in unexpected ways&lt;/strong&gt; -- worth investing in early.&lt;/p&gt;

&lt;p&gt;When the reviewer's machine receives an event, the script spawns &lt;code&gt;claude -p&lt;/code&gt; and walks through 9 dimensions (Graph / Architecture / Security / Test / Doc / Impact / Observability / AI-Antipattern / Recurrence) sequentially, then reads the verdict marker the AI emitted at the end and posts &lt;code&gt;APPROVE&lt;/code&gt; or &lt;code&gt;REQUEST_CHANGES&lt;/code&gt; via &lt;code&gt;gh pr review&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79easc3e030ab4tyrcdf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79easc3e030ab4tyrcdf.png" alt="Auto review pipeline — distributed webhook architecture running on every developer's machine" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modes split the role&lt;/strong&gt; -- the same script started with &lt;code&gt;--mode reviewer&lt;/code&gt; becomes the reviewer process; with &lt;code&gt;--mode author&lt;/code&gt; it becomes the PR-author response process. The machine of whoever is assigned as reviewer runs reviewer mode; the machine of whoever opened the PR runs author mode. Event Relay multicasts the events, and &lt;strong&gt;each machine reacts in a distributed way&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-PR worktree isolation&lt;/strong&gt; -- author mode merges &lt;code&gt;origin/main&lt;/code&gt; into a fresh worktree before spawning the AI. Multiple PRs can be handled in parallel without file state contaminating across them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9 dimensions checked sequentially in one session&lt;/strong&gt; -- not parallel sub-agents. A single &lt;code&gt;claude -p&lt;/code&gt; session walks the 9 dimensions while keeping context shared, which also catches cross-dimension contradictions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review guidelines: public snapshot&lt;/strong&gt; -- &lt;a href="https://github.com/air-closet/cortex-review-guidelines" rel="noopener noreferrer"&gt;air-closet/cortex-review-guidelines&lt;/a&gt; (JP/EN). The live guidelines are inside cortex (private repo) and evolve daily; the public repo is a snapshot extracted for reference.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;:::message alert&lt;br&gt;
&lt;strong&gt;Guidelines alone scale only to projects in the tens-of-thousands-of-lines range.&lt;/strong&gt; At cortex's scale (&lt;strong&gt;over 1M lines of code&lt;/strong&gt;), the &lt;strong&gt;knowledge graph from &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2&lt;/a&gt; (cpg) is a hard prerequisite&lt;/strong&gt;. Porting the guidelines without cpg won't reproduce the same review quality -- the AI reviewer simply can't navigate the codebase fast enough to reason about impact.&lt;br&gt;
:::&lt;/p&gt;
&lt;h3&gt;
  
  
  Why sequential single-session review, not parallel sub-agents
&lt;/h3&gt;

&lt;p&gt;We initially tried splitting the 9 dimensions across parallel sub-agents. Three problems emerged: cpg / guidelines / PR diff got injected 9 times (token cost balloons), cross-dimension findings couldn't reference each other (a &lt;code&gt;[Test]&lt;/code&gt; issue rooted in a &lt;code&gt;[Graph]&lt;/code&gt; violation gets dropped in isolation), and aggregating 9 outputs into a single verdict required its own machinery.&lt;/p&gt;

&lt;p&gt;A single sequential session fixes all three: one cpg/guideline load, earlier findings stay in context for later dimensions (cross-dimension consistency comes for free), and one verdict marker at the end is the entire aggregation step.&lt;/p&gt;

&lt;p&gt;We also &lt;strong&gt;swap &lt;code&gt;CLAUDE.md&lt;/code&gt; to a review-specific version&lt;/strong&gt; at startup. The default &lt;code&gt;CLAUDE.md&lt;/code&gt; is dense with development-time context (Product Graph ops, prod-data safety, MCP ordering) -- noise for a reviewer. The review-specific version centers on severity, no-downgrade, and the verdict marker spec, keeping AI attention on the review task.&lt;/p&gt;

&lt;p&gt;Cutting wasted context lifts judgment precision and token cost at the same time.&lt;/p&gt;
&lt;h3&gt;
  
  
  Operational knobs
&lt;/h3&gt;

&lt;p&gt;A few filters and toggles we apply in actual use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Draft (WIP) PRs are excluded.&lt;/strong&gt; GitHub Draft state is received but skipped; review starts firing once the author flips it to Ready for Review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specific PRs can be targeted manually.&lt;/strong&gt; The webhook is the normal trigger, but you can also kick off a review against a specific PR number from the CLI -- useful after a CI failure or for re-checking a single PR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-merge is the PR author's call.&lt;/strong&gt; Whether the pipeline runs through to auto-merge after APPROVE + CI green is set by the PR author. Default is on; for changes that go directly to prod, the author can flip it off and hit merge themselves.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Output structure: tags and severity
&lt;/h2&gt;

&lt;p&gt;Every auto-review comment is structured as &lt;strong&gt;tag + severity + concrete example&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Tags (dimensions)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tag&lt;/th&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Primary target&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[Graph]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Product Graph integrity&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@graph-*&lt;/code&gt; JSDoc, node dependencies, doc consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[Doc]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Doc consistency&lt;/td&gt;
&lt;td&gt;Doc updates that should follow code changes, doc placement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[Impact]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Impact analysis&lt;/td&gt;
&lt;td&gt;Missed upstream/downstream fixes, &lt;code&gt;via:&lt;/code&gt; field inconsistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[Security]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Auth, input validation, secrets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[Architecture]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Composable Architecture&lt;/td&gt;
&lt;td&gt;app/package boundaries, dependency direction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[Test]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Test quality&lt;/td&gt;
&lt;td&gt;Coverage, matchers, naming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[Observability]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Structured logging, no-truncate rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[AI-Antipattern]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AI-generated code traps&lt;/td&gt;
&lt;td&gt;Hallucinated APIs, fallback overuse, dead code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[Recurrence]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Recurrence prevention&lt;/td&gt;
&lt;td&gt;Bug-fix triage (lint / horizontal rollout / new guideline)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Severity
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Critical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Security, data corruption, prod-risk, doc inconsistency, missing &lt;code&gt;@graph-*&lt;/code&gt;, quality-bar relaxation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;REQUEST_CHANGES&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Major&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spec violation, Composable Architecture violation, missing tests&lt;/td&gt;
&lt;td&gt;&lt;code&gt;REQUEST_CHANGES&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Minor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Naming, maintainability, light refactor&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;REQUEST_CHANGES&lt;/code&gt; (must be resolved)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Style preference, minor inconsistency&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;APPROVE&lt;/code&gt; (comment only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The single most important rule is &lt;strong&gt;"no downgrade"&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"&lt;strong&gt;Following existing patterns&lt;/strong&gt;" is not a valid reason to downgrade (existing violations are a separate problem to fix, not justification for new violations).&lt;/li&gt;
&lt;li&gt;"&lt;strong&gt;Will be addressed in a separate PR&lt;/strong&gt;" or "&lt;strong&gt;incrementally&lt;/strong&gt;" is not a valid reason to drop Critical/Major to Nit.&lt;/li&gt;
&lt;li&gt;"&lt;strong&gt;Leave a TODO/FIXME&lt;/strong&gt;" is not a valid deferral path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is spelled out in &lt;a href="https://github.com/air-closet/cortex-review-guidelines/blob/main/en/guidelines/severity.md" rel="noopener noreferrer"&gt;&lt;code&gt;severity.md&lt;/code&gt;&lt;/a&gt;, and the AI literally cites it back when it issues &lt;code&gt;REQUEST_CHANGES&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Example: PR migrating the embedding model from gemini-embedding-001 to gemini-embedding-2
&lt;/h3&gt;

&lt;p&gt;A real review comment is more useful than abstract description. Here's a typical one. This is a feature PR (&lt;code&gt;feat(meet): dual-write embeddings to new 'embedding' column (v2)&lt;/code&gt;) that was merged on 2026-05-19, after &lt;strong&gt;6 review-fix iterations across 1.5 hours&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First review (07:35:25)&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The dual-write approach, the idempotent migration script, and the &lt;code&gt;Promise.all&lt;/code&gt; parallelization all look sound. Please address the 3 Critical and 2 Minor findings below before merging.&lt;/p&gt;


&lt;h2&gt;
  
  
  Critical
&lt;/h2&gt;
&lt;h3&gt;
  
  
  [Graph] Missing &lt;code&gt;@graph-business&lt;/code&gt; tag (x3)
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;graph-integrity.md&lt;/code&gt; requires &lt;code&gt;@graph-business&lt;/code&gt; on every top-level declaration under the app layer (severity: Critical).&lt;br&gt;
All 3 new declarations below are missing it (detailed comments in separate threads).&lt;br&gt;
&lt;strong&gt;Existing code is also missing it, but "following existing patterns" is not a valid downgrade reason (severity.md).&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;generateEmbeddingV2&lt;/code&gt; (v2 embedding generation function)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EMBEDDING_MODEL_V2&lt;/code&gt; (model name config)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EMBEDDING_LOCATION_V2&lt;/code&gt; (region config)&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  [Graph] &lt;code&gt;embedMeetContent&lt;/code&gt;'s &lt;code&gt;@graph-connects&lt;/code&gt; doesn't reflect &lt;code&gt;generateEmbeddingV2&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The JSDoc on &lt;code&gt;embedMeetContent&lt;/code&gt; has &lt;code&gt;@graph-connects generateEmbedding [calls] Generate embedding&lt;/code&gt;, but no corresponding &lt;code&gt;@graph-connects&lt;/code&gt; line has been added for the newly introduced &lt;code&gt;generateEmbeddingV2&lt;/code&gt; call.&lt;br&gt;
The graph will be missing an edge to &lt;code&gt;generateEmbeddingV2&lt;/code&gt;.&lt;/p&gt;


&lt;pre class="highlight diff"&gt;&lt;code&gt;   * @graph-connects generateEmbedding [calls] Generate embedding
&lt;span class="gi"&gt;+  * @graph-connects generateEmbeddingV2 [calls] v2 embedding generation (dual-write)
&lt;/span&gt;   * @graph-connects insertMeetChunks [calls] Insert chunks into BQ
&lt;/code&gt;&lt;/pre&gt;



&lt;h3&gt;
  
  
  [Doc] Corresponding BigQuery schema doc is not updated
&lt;/h3&gt;

&lt;p&gt;The "BigQuery schema" section in the related doc is missing the new &lt;code&gt;embedding&lt;/code&gt; column.&lt;br&gt;
Both &lt;code&gt;graph-integrity.md&lt;/code&gt; and &lt;code&gt;severity.md&lt;/code&gt; define doc inconsistency as Critical.&lt;/p&gt;


&lt;pre class="highlight diff"&gt;&lt;code&gt; | `created_at`  | TIMESTAMP   | Created at                              |
&lt;span class="gi"&gt;+| `embedding`   | FLOAT64[]   | Embedding vector (v2: gemini-embedding-2) |
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;



&lt;h2&gt;
  
  
  Minor
&lt;/h2&gt;
&lt;h3&gt;
  
  
  [Test] &lt;code&gt;textEmbeddingV2&lt;/code&gt; value is not asserted
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;objectContaining&lt;/code&gt; allows extra fields, so the test still passes even when the v2 value is never set.&lt;/p&gt;


&lt;pre class="highlight diff"&gt;&lt;code&gt;         textEmbedding: [0.1, 0.2, 0.3],
&lt;span class="gi"&gt;+        textEmbeddingV2: [0.1, 0.2, 0.3],
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;
  
  
  [Test] No isolated scenario for "v2 returns null"
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;generateEmbeddingV2: mockGenerateEmbedding&lt;/code&gt; reuses the v1 mock, so the case "v2 returns null while v1 succeeds" is not independently verified.&lt;/p&gt;



&lt;p&gt;&lt;code&gt;&amp;lt;!-- VERDICT:REQUEST_CHANGES --&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The takeaway is the precision of the details.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File + line numbers&lt;/strong&gt; are concrete.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suggested fixes are in diff format&lt;/strong&gt; (copy-paste ready).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source guideline&lt;/strong&gt; (&lt;code&gt;graph-integrity.md&lt;/code&gt; / &lt;code&gt;severity.md&lt;/code&gt;) is cited explicitly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The typical excuse&lt;/strong&gt; ("existing code has the same problem") is &lt;strong&gt;pre-emptively closed&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The trailing &lt;code&gt;&amp;lt;!-- VERDICT:REQUEST_CHANGES --&amp;gt;&lt;/code&gt; is a &lt;strong&gt;machine-readable verdict marker&lt;/strong&gt; -- the trigger that moves the PR into &lt;code&gt;REQUEST_CHANGES&lt;/code&gt; state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After this, the PR author (= usually another AI running on the author's machine) pushes a fix, the reviewer re-reviews. The next review confirms all 3 Criticals are actually resolved, raises the next Major / Critical, and so on. &lt;strong&gt;6 iterations in 1.5 hours&lt;/strong&gt;, finally APPROVE, auto-merge.&lt;/p&gt;

&lt;p&gt;Plotted on a timeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgqoasedyujmet592wk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgqoasedyujmet592wk2.png" alt="Real example of the review-fix loop — embedding model migration PR, 6 iterations in 1.5 hours" width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With a human reviewer, this is "Critical x3 -&amp;gt; wait until tomorrow for the fix -&amp;gt; re-review the day after" -- 2 to 3 days per PR. cortex closes it in &lt;strong&gt;90 minutes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The difference between human review and auto review is not just speed. A single AI session walks all 9 dimensions in order and cites the guideline each time, which makes it &lt;strong&gt;much harder to miss the "deep" findings humans drop because their attention drifted&lt;/strong&gt; -- doc consistency, recurrence-prevention judgments, weak matchers. Side-by-side comparison:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyr8gzw2bernzvgz01iwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyr8gzw2bernzvgz01iwn.png" alt="Before / After — human review era vs. cortex's auto-review era" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is why the review bottleneck never forms here.&lt;/p&gt;
&lt;h2&gt;
  
  
  Evolving the guidelines: catching the moments AI gets it wrong, then fixing the rules
&lt;/h2&gt;

&lt;p&gt;The review guidelines I've been referring to are &lt;strong&gt;not a static document&lt;/strong&gt;. Running this in production surfaces recurring patterns where &lt;strong&gt;the AI mis-judges a specific class of issue&lt;/strong&gt;. Each time that happens, we don't add a comment to the individual PR; we &lt;strong&gt;rewrite the guideline so the AI behaves correctly next time&lt;/strong&gt; -- this is the meta-layer humans actually operate on.&lt;/p&gt;

&lt;p&gt;A few concrete failures we hit on cortex, and how we closed each one by changing the rule, not the PR.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. AI was downgrading because "existing code has the same issue"
&lt;/h3&gt;

&lt;p&gt;Early on, immediately after flagging a violation the AI would add "&lt;strong&gt;however, since existing code has the same violation, I'm downgrading this to Nit&lt;/strong&gt;" and self-downgrade. The result: violations on newly added code kept dropping to Nit, and the system kept emitting Approve.&lt;/p&gt;

&lt;p&gt;We closed this by adding &lt;strong&gt;the no-downgrade rule&lt;/strong&gt; to &lt;code&gt;severity.md&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Following existing patterns" is not a valid downgrade reason: if existing code violates a guideline, new code following that pattern still gets flagged at the same severity. Deferral language like "consider during the next refactor" is not accepted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That wasn't enough on its own. Over time other excuse patterns surfaced -- "&lt;strong&gt;will be addressed in a separate PR&lt;/strong&gt;," "&lt;strong&gt;will be addressed in the next session&lt;/strong&gt;," "&lt;strong&gt;out of scope&lt;/strong&gt;," "&lt;strong&gt;incrementally&lt;/strong&gt;" -- so we added those as forbidden downgrade categories too. We also explicitly forbade &lt;strong&gt;deferring via TODO/FIXME comments in code&lt;/strong&gt;. The mindset is: &lt;strong&gt;close every typical excuse path preemptively&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. The final verdict had 3 options, and "comment-only" left PRs in limbo
&lt;/h3&gt;

&lt;p&gt;The final verdict at the end of every review was originally &lt;code&gt;APPROVE&lt;/code&gt; / &lt;code&gt;REQUEST_CHANGES&lt;/code&gt; / &lt;code&gt;COMMENT&lt;/code&gt; (approve / request changes / comment-only). When the AI picked &lt;code&gt;COMMENT&lt;/code&gt; -- for example when only Minor issues existed -- the script took no action, the PR sat in review-pending forever, and ultimately someone had to manually pick it up. Classic anti-pattern, and it kept happening.&lt;/p&gt;

&lt;p&gt;We &lt;strong&gt;collapsed the verdict to 2 options&lt;/strong&gt;. Anything Minor or above is &lt;code&gt;REQUEST_CHANGES&lt;/code&gt;, a missing verdict marker defaults to &lt;code&gt;REQUEST_CHANGES&lt;/code&gt; (safe side), and only Nit-only or no findings (with CI passing) yields &lt;code&gt;APPROVE&lt;/code&gt;. The principle: "&lt;strong&gt;if the judgment is ambiguous, fail-safe by defaulting to the blocking side (&lt;code&gt;REQUEST_CHANGES&lt;/code&gt;)&lt;/strong&gt;." Going all-in on that design eliminated the stuck-PR class entirely.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Checklist items had no severity, so the AI's judgment kept drifting
&lt;/h3&gt;

&lt;p&gt;Originally, each guideline (&lt;code&gt;graph-integrity.md&lt;/code&gt;, &lt;code&gt;testing.md&lt;/code&gt;, etc.) was just a &lt;strong&gt;bulleted checklist&lt;/strong&gt;. Items like "Is the test name descriptive?" or "Are mocks minimized?" were listed, but &lt;strong&gt;without per-item severity&lt;/strong&gt;. As a result, the same violation could land as Major in one PR and Nit in another, depending on the session.&lt;/p&gt;

&lt;p&gt;We &lt;strong&gt;converted every guideline's checklist into a &lt;code&gt;severity&lt;/code&gt; / &lt;code&gt;scope&lt;/code&gt; / &lt;code&gt;criterion&lt;/code&gt; table&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;All PRs&lt;/td&gt;
&lt;td&gt;Missing &lt;code&gt;@graph-business&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Major&lt;/td&gt;
&lt;td&gt;App layer only&lt;/td&gt;
&lt;td&gt;Missing tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Minor&lt;/td&gt;
&lt;td&gt;Shared packages only&lt;/td&gt;
&lt;td&gt;More than 3 function args&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nit&lt;/td&gt;
&lt;td&gt;All PRs&lt;/td&gt;
&lt;td&gt;Naming inconsistency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;scope&lt;/code&gt; column is &lt;strong&gt;a machine-decidable filter&lt;/strong&gt; for which paths a check applies to, so the AI reviewer doesn't trigger irrelevant items on PRs outside that scope. Just putting it in a table -- the judgment reproducibility jumped significantly.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. The existing guidelines didn't catch AI-specific traps
&lt;/h3&gt;

&lt;p&gt;After running this for a while we noticed AI-generated code has its own cluster of antipatterns -- &lt;strong&gt;calling APIs that don't exist&lt;/strong&gt; (hallucinated APIs -- something like &lt;code&gt;user.findOrCreate()&lt;/code&gt; that looks plausible but isn't actually defined), &lt;strong&gt;swallowing errors and returning fallback values&lt;/strong&gt; (e.g., silently returning an empty array when an upstream API fails), &lt;strong&gt;leaving unused functions&lt;/strong&gt; (a refactor adds the new function but doesn't delete the old one, leaving dead code), &lt;strong&gt;expanding the modification scope beyond what was asked&lt;/strong&gt; (you ask it to change one function and it reformats the whole file), &lt;strong&gt;adding unnecessary backward-compatibility code&lt;/strong&gt; (creating a deprecated alias for an internal-only function) -- and &lt;code&gt;security.md&lt;/code&gt; / &lt;code&gt;testing.md&lt;/code&gt; couldn't catch these. There's a &lt;strong&gt;distinct class of "mistakes only AIs make."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We added a dedicated &lt;strong&gt;&lt;code&gt;ai-antipattern.md&lt;/code&gt;&lt;/strong&gt; for this. Reviews now pick these up explicitly under the &lt;code&gt;[AI-Antipattern]&lt;/code&gt; tag. &lt;strong&gt;Reviewing AI output requires designing around AI-specific traps&lt;/strong&gt; -- you don't get there just by porting human review heuristics onto an AI.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. The AI tries to relax "the standard itself"
&lt;/h3&gt;

&lt;p&gt;The last and most important pattern. When the AI was writing fix PRs, occasionally instead of fixing the guideline violation it would write &lt;strong&gt;a PR that relaxes the guideline&lt;/strong&gt;. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower the test coverage threshold to avoid writing more tests&lt;/li&gt;
&lt;li&gt;Narrow the in-house lint rule's scope to make the violation go away&lt;/li&gt;
&lt;li&gt;Soften the guideline doc language from "recommended" to "preferred" to weaken the binding constraint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the AI builds a formally-coherent justification: "&lt;strong&gt;existing code already violates this, so let's adjust the standard to match the implementation.&lt;/strong&gt;" Left unchecked, &lt;strong&gt;the AI gradually walks the quality bar down&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We closed this by adding &lt;strong&gt;"quality-bar relaxation" as a Critical&lt;/strong&gt; in &lt;code&gt;severity.md&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A PR that relaxes the quality bar -- guideline doc, lint rule, coverage threshold -- must not be Approved by the AI reviewer. It is sent back with &lt;code&gt;REQUEST_CHANGES&lt;/code&gt;. &lt;strong&gt;A human reviewer's approval is required&lt;/strong&gt;. "Existing code already violates this" is not a valid justification for relaxation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the one explicit boundary where &lt;strong&gt;we deliberately do not give the AI autonomous Approve authority&lt;/strong&gt;. Whether the standard itself moves is a human decision. It's the &lt;strong&gt;meta-level safety valve&lt;/strong&gt; for the "AI reviewing AI" architecture.&lt;/p&gt;
&lt;h3&gt;
  
  
  Evolving the guidelines is the meta-layer humans actually operate on
&lt;/h3&gt;

&lt;p&gt;The common thread: "&lt;strong&gt;when the AI gets it wrong, don't override the individual PR -- rewrite the guideline so the fix propagates forward.&lt;/strong&gt;"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI escapes via "existing code has the same issue" -&amp;gt; add no-downgrade rule&lt;/li&gt;
&lt;li&gt;AI picks "comment-only" and PR stalls -&amp;gt; collapse to 2-option verdict&lt;/li&gt;
&lt;li&gt;AI's judgment drifts -&amp;gt; add severity / scope columns to every item&lt;/li&gt;
&lt;li&gt;AI falls into its own traps -&amp;gt; add the AI-Antipattern category&lt;/li&gt;
&lt;li&gt;AI tries to relax the standard -&amp;gt; classify standard-relaxation as Critical, require human Approve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As long as this loop turns, the guideline is &lt;strong&gt;a living document that absorbs the failure patterns AI produces in production&lt;/strong&gt;. &lt;strong&gt;Don't try to write the perfect guideline up front. Catch the moment AI gets it wrong, and write the rule for that moment.&lt;/strong&gt; That's the actual mechanism behind "quality doesn't drop even when humans aren't inside the loop."&lt;/p&gt;

&lt;p&gt;And one more thread. Right now, the trigger for "AI got it wrong, time to rewrite the guideline" is still mostly a human judgment, but &lt;strong&gt;parts of that maintenance are gradually becoming automatable too&lt;/strong&gt;. &lt;strong&gt;Self-Healing&lt;/strong&gt; (Part 4 next time) -- where AI investigates production incidents, opens a fix PR, runs it through auto-review, and auto-redeploys -- requires every fix PR to write one of {add lint, add guideline, horizontal rollout} under the &lt;code&gt;[Recurrence]&lt;/code&gt; lens. So the &lt;strong&gt;AI is increasingly participating in the maintenance of its own review criteria&lt;/strong&gt;, with humans still in the loop on adoption. I'll come back to this in Part 4.&lt;/p&gt;
&lt;h2&gt;
  
  
  Auto-fix: a separate AI applies the changes and pushes
&lt;/h2&gt;

&lt;p&gt;Once &lt;code&gt;REQUEST_CHANGES&lt;/code&gt; lands, &lt;strong&gt;the same script running on the PR author's machine, but in author mode&lt;/strong&gt;, picks up the event and starts working.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[REQUEST_CHANGES detected]
   | SSE push via Event Relay
[Author mode boots on PR author's machine]
   | Merge origin/main into a worktree
   |  (lockfile resolved up front, remaining conflicts handled by AI)
   | Read the auto-review comment as context
   | Run claude -p inside the worktree
   | Commit + push the changes
   | New SHA is delivered back to the reviewer's machine via Event Relay -&amp;gt; re-review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two design choices matter here.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reviewer and author run on different machines in different sessions&lt;/strong&gt; -- reviewer mode and author mode are the same script, but they run on different machines in different processes. "Is the original critique correct?" is judged independently. Unlike a single AI fixing its own complaints, the judgment passes between two separate sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All iteration stays inside the same PR&lt;/strong&gt; -- we don't spawn a new PR. The "&lt;strong&gt;fix the root cause, no deferrals&lt;/strong&gt;" rule from Part 2 and the review guidelines kicks in here: if the AI tries to escape via &lt;code&gt;TODO/FIXME&lt;/code&gt; or by splitting work out into a separate PR, the next review rejects it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Auto-merge + parallel deploy
&lt;/h2&gt;

&lt;p&gt;Once auto-review returns APPROVE and CI is fully green, the &lt;code&gt;auto-merge&lt;/code&gt; script runs and squash-merges the PR.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Auto review APPROVE + CI green]
   |
auto-merge script
   | squash merge to main
   |
[main updated]
   |
Turborepo build (affected packages only)
   |
Pulumi up (multiple stacks in parallel)
   |- API services
   |- pipeline services
   |- MCP servers
   `- infra
   |
[Deploy complete]
   |
cpg index rebuilt (only changed nodes regenerate embeddings -- see Part 2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pulumi up &amp;lt;stack1&amp;gt; &amp;lt;stack2&amp;gt; ...&lt;/code&gt; runs in parallel, so deploying 9 stacks at once finishes in about 8-12 minutes. End to end, merge-to-production is averaging 10-15 minutes.&lt;/p&gt;

&lt;p&gt;This compounds nicely with Self-Healing PRs. &lt;strong&gt;Incident alert -&amp;gt; Self-Healing identifies root cause -&amp;gt; opens a fix PR -&amp;gt; auto review pass -&amp;gt; auto merge -&amp;gt; auto deploy&lt;/strong&gt; runs as a single closed loop without human involvement (covered in Part 4).&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers, in more detail
&lt;/h2&gt;

&lt;p&gt;Unpacking the headline numbers a bit further.&lt;/p&gt;

&lt;h3&gt;
  
  
  Depth of the review-fix loop
&lt;/h3&gt;

&lt;p&gt;Across 769 PRs in 30 days, the &lt;strong&gt;average per PR was 10.8 review iterations, max 56&lt;/strong&gt;. The fact that the average is past 10 means &lt;strong&gt;the first review almost always surfaces at least one finding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The embedding-model migration PR shown earlier needed 6 iterations to merge, and that's representative of the average PR. &lt;strong&gt;What would take a human reviewer days, cortex resolves in minutes.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the auto reviewer typically flags
&lt;/h3&gt;

&lt;p&gt;The most common findings out of the first review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;[Graph] Missing &lt;code&gt;@graph-business&lt;/code&gt;&lt;/strong&gt; -- a prerequisite cpg leans on (from Part 2). The classic finding on newly added declarations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[Doc] Doc inconsistency&lt;/strong&gt; -- code changed but the corresponding &lt;code&gt;docs/&lt;/code&gt; section was not updated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[Test] Weak matchers&lt;/strong&gt; -- &lt;code&gt;objectContaining&lt;/code&gt; weakening value assertions, single-property checks via &lt;code&gt;toBe&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[Observability] Unstructured error logs&lt;/strong&gt; -- &lt;code&gt;event&lt;/code&gt; field or required keys deviating from the structured-log spec.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[Recurrence] No recurrence-prevention action&lt;/strong&gt; -- a bug-fix PR description not declaring which of {lint / horizontal rollout / add guideline / nothing} applies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are categories &lt;strong&gt;human reviewers frequently miss in practice&lt;/strong&gt;, especially doc consistency and recurrence-prevention checks. The AI reviewer applies them mechanically on every PR.&lt;/p&gt;

&lt;h3&gt;
  
  
  Actual false-positive rate
&lt;/h3&gt;

&lt;p&gt;It's not zero. A few times a month we get "this is Nit, not Major" type misjudgments. The fix path is the one described above -- not a comment on the individual PR, but a guideline edit that corrects the judgment for all subsequent reviews.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed / Bridge to Part 4
&lt;/h2&gt;

&lt;p&gt;Over the past six months, the engineer's role on cortex shifted from "&lt;strong&gt;writer&lt;/strong&gt;" and "&lt;strong&gt;reviewer&lt;/strong&gt;" to "&lt;strong&gt;operator&lt;/strong&gt;" -- the human running the system, not acting inside each individual decision.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI writes the code (Claude Code)&lt;/li&gt;
&lt;li&gt;AI reviews the code (auto review)&lt;/li&gt;
&lt;li&gt;A different AI applies the fixes (author mode running on the PR author's machine)&lt;/li&gt;
&lt;li&gt;AI decides when to merge (auto-merge script)&lt;/li&gt;
&lt;li&gt;Deploys go in parallel (Turborepo + Pulumi)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What stays in human hands: "&lt;strong&gt;what to build at all&lt;/strong&gt; (product / requirements)," "&lt;strong&gt;is this direction actually right&lt;/strong&gt; (architectural judgment)," "&lt;strong&gt;which guideline to add and where&lt;/strong&gt;," and "&lt;strong&gt;look at the reviews and adjust prompts and guidelines accordingly&lt;/strong&gt;." High-abstraction work -- &lt;strong&gt;not individual decisions, but watching the whole system from above and steering&lt;/strong&gt;. &lt;strong&gt;From human-in-the-loop to human-on-the-loop&lt;/strong&gt;, you could say.&lt;/p&gt;

&lt;p&gt;The widely-reported phenomena -- "AI lowers quality," "the reviewer becomes the bottleneck" -- happen when &lt;strong&gt;the harness is extended on the writer side only, and the reviewer side is left to humans&lt;/strong&gt;. If writing speeds up and reviewing doesn't, of course it bottlenecks. Of course things get missed.&lt;/p&gt;

&lt;p&gt;cortex is the opposite. &lt;strong&gt;We extended the harness on the reviewer side first, before fully extending it on the writer side&lt;/strong&gt;. Anthropic's observation that the bottleneck shifts from writing to reviewing is exactly right -- which is precisely why "&lt;strong&gt;move the reviewer role to AI as well&lt;/strong&gt;" is the answer cortex chose.&lt;/p&gt;

&lt;p&gt;"The AI writes the code, the AI reviews the code." That's the core of cortex's auto-review pipeline. &lt;strong&gt;Quality drop and review bottleneck are functions of how far you extend the harness&lt;/strong&gt; -- they are not inherent to AI-assisted development.&lt;/p&gt;




&lt;p&gt;Up next in &lt;strong&gt;&lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86?lang=en"&gt;Part 4 — Self-Healing + Recurrence Prevention&lt;/a&gt;&lt;/strong&gt;: a pipeline where a production alert (observed via OTel/Loki/Mimir/Tempo/Faro) triggers AI investigation, an AI-authored fix PR plus a new lint/type gate, auto-review, auto-merge, and auto-redeploy. The fix and a recurrence-prevention guardrail land together, so the same class of incident structurally can't fire again. If auto review protects quality at PR time, Part 4 protects it &lt;strong&gt;at production time, while growing the quality gates themselves&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The headline number above includes Self-Healing PRs (production alerts that AI investigates, fixes, and auto-deploys). For certain classes of incidents, the fix is already merged before anyone has time to react — that's where cortex sits today.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>devops</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The Heart of the AI Harness: A Knowledge Graph of the AI, by the AI, for the AI (Series Part 2)</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Tue, 19 May 2026 14:16:20 +0000</pubDate>
      <link>https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm</link>
      <guid>https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: "cortex" and "cortex-product-graph" referenced in this article are internal code names for an AI platform developed in-house at airCloset. They are unrelated to existing commercial services such as Snowflake Cortex or Palo Alto Networks Cortex.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;Part 1 (Series Intro)&lt;/a&gt;, I wrote about how &lt;strong&gt;AI handles PR reviews and incident response&lt;/strong&gt; on top of a platform we call cortex. At the center of that flywheel is the &lt;strong&gt;Product Graph&lt;/strong&gt; (implementation name: &lt;code&gt;cortex-product-graph&lt;/code&gt;, or cpg) — a unified knowledge graph of code, docs, DB schemas, and infrastructure definitions, queryable through semantic search.&lt;/p&gt;

&lt;p&gt;In Part 1, I described cpg at a high level: "all of cortex is indexed in one graph." This post goes deeper — &lt;strong&gt;how it's built, why we landed on this design, and what actually changed&lt;/strong&gt; once it was in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series Index
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Theme&lt;/th&gt;
&lt;th&gt;Key scene&lt;/th&gt;
&lt;th&gt;Article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Series intro: cortex's harness&lt;/td&gt;
&lt;td&gt;PRs auto-merge / incidents self-heal before you notice&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa"&gt;ai-harness-intro&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product Graph (cpg)&lt;/td&gt;
&lt;td&gt;Code, docs, DB, infra unified into one graph&lt;/td&gt;
&lt;td&gt;this post ← you are here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;AI PR review&lt;/td&gt;
&lt;td&gt;webhook → AI review → auto-fix → squash merge&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;cortex-auto-review&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Self-Healing + observability + auto-added guardrails&lt;/td&gt;
&lt;td&gt;Alert → AI investigates → fix PR + new lint/type gate → auto redeploy + same-pattern writes get auto-rejected&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86"&gt;cortex-self-healing&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Democratizing the maintenance phase&lt;/td&gt;
&lt;td&gt;Domain experts open PRs to production; the harness owns the quality gate&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4"&gt;cortex-non-engineer-prs&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Series Final&lt;/td&gt;
&lt;td&gt;The underlying philosophy plus a retrospective on the failures and lessons&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa"&gt;cortex-philosophy&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Start with One Scene
&lt;/h2&gt;

&lt;p&gt;"I want to change the calculation logic behind the 'bug rate' KPI on the dashboard. &lt;strong&gt;Where is it, and what might break?&lt;/strong&gt;" — imagine that question comes up before you touch any code.&lt;/p&gt;

&lt;p&gt;When you ask an AI this directly, with no function name and no file path given, it hits cpg with a semantic search and pulls the relevant nodes in one shot. What comes back isn't just functions — it includes &lt;strong&gt;BigQuery tables&lt;/strong&gt; and &lt;strong&gt;API endpoints&lt;/strong&gt; alongside the code. And at the end of the response, there's a &lt;strong&gt;"next action candidates (Runbook)"&lt;/strong&gt; block that tells the AI to re-probe starting from the BQ table with the most reads and writes flowing through it.&lt;/p&gt;

&lt;p&gt;The final answer looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Calculation site&lt;/strong&gt;: &lt;code&gt;calculateRatePer100pt&lt;/code&gt; / &lt;code&gt;calculateBugCount&lt;/code&gt; — both pure functions with no I/O side effects; safe to change in isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writers (upstream)&lt;/strong&gt;: &lt;code&gt;syncKpiMetrics&lt;/code&gt; / &lt;code&gt;writeKpiMetrics&lt;/code&gt; / &lt;code&gt;backfillKpiMetrics&lt;/code&gt; all write to the &lt;code&gt;kpi_bug_rate_per_100pt&lt;/code&gt; table; these are the real aggregation batch jobs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Readers (downstream)&lt;/strong&gt;: &lt;code&gt;BigQueryKpiRepository.getSummaryByDate&lt;/code&gt; reads via BigQuery → &lt;code&gt;/kpi/bugs&lt;/code&gt; API → KPI dashboard page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related docs&lt;/strong&gt;: &lt;code&gt;docs/generator/kpi.md&lt;/code&gt; defines bug rate; updating the code without updating docs would leave them stale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"Update the docs together, and schedule the deploy when the aggregation batch isn't running" — that's a decision you can make with confidence.&lt;/p&gt;

&lt;p&gt;I personally know all this — I wrote it. But that's exactly the problem: &lt;strong&gt;anyone else who wanted to touch this had to track me down&lt;/strong&gt;. Three months ago, "finding out where something lives and what would break" meant finding me. Now, this same investigation is done by &lt;strong&gt;PMO members (non-engineers) using cpg on their own&lt;/strong&gt;. grep didn't get them there; documentation didn't get them there. One natural-language question did.&lt;/p&gt;

&lt;p&gt;What makes that possible is cpg — a graph where you can follow "&lt;strong&gt;what you want to do&lt;/strong&gt;" in plain language to the relevant nodes in one or two hops, even when you don't know the function name. The &lt;strong&gt;Runbook structure&lt;/strong&gt; — where the tool's return value itself contains the next tool call to make — is what lets the AI re-select its starting point and drill deeper on its own.&lt;/p&gt;

&lt;p&gt;That's the setup. Now let me explain how it's built.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Static Analysis Alone Couldn't Do
&lt;/h2&gt;

&lt;p&gt;cortex has a separate system that &lt;strong&gt;graph-analyzes the production codebase using static analysis&lt;/strong&gt; (I'll write about this in its own post — just touching it here). It parses JS/TS code with AST analysis across our external-facing production repos, automatically extracting function call graphs, API endpoints, DB access patterns, and event pub/sub relationships.&lt;/p&gt;

&lt;p&gt;This works well for what it does, and &lt;strong&gt;we still use it actively in the production repos&lt;/strong&gt;. But when we tried applying the same approach to cortex itself, it didn't get us where we wanted to go.&lt;/p&gt;

&lt;p&gt;Three specific gaps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No context&lt;/strong&gt; — nodes exist but carry no &lt;em&gt;meaning&lt;/em&gt;. "What is this API for?" "Why does this column exist?" isn't in the graph. Ask "where is the code that calculates the KPI bug rate?" and you'll miss unless the function name happens to look like it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No entry point&lt;/strong&gt; — you &lt;strong&gt;already have to know&lt;/strong&gt; the file path or function name before search can start. "Let me go find it" doesn't work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explosion after 1–2 hops&lt;/strong&gt; — starting from any node, related nodes multiply exponentially within a couple of hops, far exceeding what an AI can process in one context window. Trace results become too long to use.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The summary: &lt;strong&gt;mechanically accurate, but no semantic weighting&lt;/strong&gt;. To be genuinely useful to AI, you need one more layer: "&lt;strong&gt;what matters, and why things are connected.&lt;/strong&gt;"&lt;/p&gt;

&lt;h2&gt;
  
  
  Meanwhile, DB Graph Was Working
&lt;/h2&gt;

&lt;p&gt;Around the same time, a different approach — the &lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;DB Graph MCP&lt;/a&gt; we'd built — was working exactly as intended.&lt;/p&gt;

&lt;p&gt;DB Graph is an MCP server with access to &lt;strong&gt;15 schemas and 991 tables&lt;/strong&gt; inside cortex, supporting semantic search over tables and columns with &lt;strong&gt;AI-generated descriptions&lt;/strong&gt;. A natural-language query like "tables related to return processing confirmation" would find semantically connected nodes even when the table name doesn't contain those words.&lt;/p&gt;

&lt;p&gt;After thinking about why this worked, the answer became clear: &lt;strong&gt;DB Graph has a business-context description attached to every node, and that description is what feeds into the embeddings&lt;/strong&gt;. That semantic weight is what "finding by meaning" actually runs on.&lt;/p&gt;

&lt;p&gt;Static-analysis code graph had none of that. Type relationships and call graphs exist — but "&lt;strong&gt;why this function exists&lt;/strong&gt;" was never written anywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hypothesis — Bring DB Graph's Essence into the Code Graph
&lt;/h2&gt;

&lt;p&gt;The hypothesis was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"A business-context description on every node, loaded into embeddings" — if that's the core of why DB Graph works, then doing the same thing for the code graph should structurally overcome the limits of static analysis.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The problem was: &lt;strong&gt;where do you put the "business context"?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All the options:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;External docs&lt;/td&gt;
&lt;td&gt;Design docs / wiki / Notion&lt;/td&gt;
&lt;td&gt;Separate from code. Drifts instantly. Nobody maintains it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External metadata&lt;/td&gt;
&lt;td&gt;Sidecar YAML / &lt;code&gt;*.meta.json&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Dual-management. Breaks on rename.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedicated graph DB&lt;/td&gt;
&lt;td&gt;Write annotations directly into Neo4j / Neptune&lt;/td&gt;
&lt;td&gt;Dual-management again. Doesn't show up in PR diffs — unreviewable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript decorator&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@GraphNode({...})&lt;/code&gt; in code&lt;/td&gt;
&lt;td&gt;Lives in the transpiled output = runtime dependency. Can't be extracted by AST alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DSL file&lt;/td&gt;
&lt;td&gt;Custom &lt;code&gt;.graph&lt;/code&gt; file format&lt;/td&gt;
&lt;td&gt;High learning cost. No editor support out of the box.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSDoc comments&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@graph-business&lt;/code&gt; / &lt;code&gt;@graph-connects&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Physically co-located with the code. Extractable by AST alone. Zero runtime dependency.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The choice of &lt;strong&gt;JSDoc over decorators&lt;/strong&gt; was intentional:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero runtime dependency&lt;/strong&gt;: decorators survive into the transpiled output and can affect runtime behavior. JSDoc has no executable runtime semantics; with production builds that strip comments, it leaves no runtime artifact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generalizes beyond TypeScript&lt;/strong&gt;: the same &lt;code&gt;@graph-*&lt;/code&gt; syntax can extend to Pulumi definitions in &lt;code&gt;infra/&lt;/code&gt; and Markdown frontmatter in &lt;code&gt;docs/&lt;/code&gt;. Decorators are locked to TypeScript syntax.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single AST pass&lt;/strong&gt;: ts-morph can walk declarations and extract JSDoc in one scan. Decorators sometimes require type resolution, which slows builds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shows up naturally in PR diffs&lt;/strong&gt;: JSDoc sits directly above the code it annotates, so when code changes, the JSDoc diff appears in the same file. Reviewers can't miss it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Doubles as documentation for both humans and AI&lt;/strong&gt;: JSDoc already serves as IDE hover text and AI-readable context. Putting &lt;code&gt;@graph-business&lt;/code&gt; there means it simultaneously explains the declaration to a human reading the code, and gives a coding AI semantic context about the surrounding functions. Graph metadata that also functions as inline documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that the essence of this design is &lt;strong&gt;using parseable annotations co-located with code as the SSoT&lt;/strong&gt; — TypeScript / JSDoc is just one implementation. The same pattern works in any language with comparable comment + AST primitives: Python docstrings + &lt;code&gt;ast&lt;/code&gt;, Go comments + &lt;code&gt;go/ast&lt;/code&gt;, Rust &lt;code&gt;///&lt;/code&gt; + &lt;code&gt;syn&lt;/code&gt;. &lt;strong&gt;What matters isn't &lt;em&gt;where&lt;/em&gt; you write the annotations, but the invariant: "physically co-located with the code, extractable by AST alone."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Same goes for the monorepo: &lt;strong&gt;this pattern doesn't depend on cortex being a monorepo&lt;/strong&gt;. If anything, &lt;strong&gt;its real value shows when repositories are split and AI can't easily follow code across them&lt;/strong&gt;. In a monorepo, the AI can still grep / read files across the whole tree; in a multi-repo, the cross-repo calls and data flows are the hard part to follow. Run the same build per repo, emit nodes / edges, aggregate into a central graph, and those cross-repo connections become reachable in one hop. We actually run a parallel knowledge graph over our external-facing production repos (multi-repo) using the same pattern — more on that in a separate post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Approach — Abandon Code Inference, Make JSDoc the SSoT
&lt;/h2&gt;

&lt;p&gt;The code graph's problem was &lt;strong&gt;no meaning&lt;/strong&gt;. The answer is simple: &lt;strong&gt;embed the meaning directly in the code&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For cortex's own code graph, we &lt;strong&gt;completely abandoned the approach of inferring graph structure from code&lt;/strong&gt;. Instead:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Every declaration — function / class / method / API / Page / Cron / etc. — gets a dedicated JSDoc tag. The graph is assembled from those.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means the &lt;strong&gt;SSoT (Single Source of Truth) for business context becomes the code itself&lt;/strong&gt;. There's no gap between docs and code, because &lt;strong&gt;the JSDoc in the code is the authoritative source&lt;/strong&gt;. The structural problem of "AI makes mistakes because docs are stale" is resolved at the level of where the data lives.&lt;/p&gt;

&lt;p&gt;Placing the two side by side — "a graph from code inference alone" versus "a knowledge graph with JSDoc as SSoT" — makes the difference in what's carried on each node immediately visible:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgxsr5j73ufyjias48w9q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgxsr5j73ufyjias48w9q.png" alt="Before / After — graph from code inference alone vs. knowledge graph with JSDoc as SSoT" width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's a concrete example of the tags (from cpg's own source):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="cm"&gt;/**
 * Set embeddings on nodes in place.
 * Compares textForEmbedding against existing BQ data; only re-generates
 * for nodes where the text has changed.
 *
 * @graph-stack product-graph
 * @graph-domain Engineering
 * @graph-business Compares hash of textForEmbedding against existing BQ nodes; re-generates
 *   embedding only for nodes where text has changed. Unchanged nodes reuse BQ embeddings.
 * @graph-connects cortex.product_graph_nodes [queries, via:id] read existing embeddings
 * @graph-connects vertex-ai-embedding [calls] generate embeddings for changed nodes
 */&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;generateEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ProductGraphNode&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
  &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;force&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What each tag does:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tag&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@graph-node&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Explicitly declares node type (defaults to Function)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@graph-stack&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The infra stack this declaration belongs to&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@graph-domain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Business domain (comma-separated, multiple allowed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@graph-business&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;What this declaration specifically does&lt;/strong&gt; — the body of the embedding input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@graph-connects&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Connection targets (multiple allowed; &lt;code&gt;via:&lt;/code&gt; for parameter-level tracking; &lt;code&gt;none&lt;/code&gt; to explicitly declare no connections)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key is that &lt;code&gt;@graph-business&lt;/code&gt; &lt;strong&gt;feeds directly into the embedding input&lt;/strong&gt;. It's not the node name — it's a &lt;strong&gt;natural-language sentence&lt;/strong&gt; that carries semantic weight into search. In practice, almost all of these sentences are written by AI: during the normal flow of writing code in cortex, the AI writes the JSDoc alongside the code (and thanks to the ESLint enforcement below, it doesn't forget).&lt;/p&gt;

&lt;h3&gt;
  
  
  Making Omissions Physically Impossible
&lt;/h3&gt;

&lt;p&gt;This design collapses the moment someone leaves a tag out. One function without &lt;code&gt;@graph-business&lt;/code&gt; = that function is invisible to semantic search. One without &lt;code&gt;@graph-connects&lt;/code&gt; = the data flow through that function is absent from the graph.&lt;/p&gt;

&lt;p&gt;So we built &lt;strong&gt;enforcement that makes omissions physically impossible&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5 ESLint plugins&lt;/strong&gt; — tag presence validation, syntax validation, naming convention enforcement (stack / domain allowlists), &lt;code&gt;@graph-connects&lt;/code&gt; required, &lt;code&gt;@graph-connects none&lt;/code&gt; misuse detection (flags when &lt;code&gt;none&lt;/code&gt; appears on code that calls external services)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated PR review&lt;/strong&gt; (Part 1 ③) — tags missing are flagged as &lt;code&gt;[Graph] Critical&lt;/code&gt;; docs inconsistency is flagged as &lt;code&gt;[Doc] Critical&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: &lt;strong&gt;"write a declaration → business context is always written with it"&lt;/strong&gt; holds as an invariant. Add a function → its meaning and connections are necessarily in its JSDoc.&lt;/p&gt;

&lt;p&gt;One honest note: &lt;strong&gt;forcing "5 JSDoc tags on every declaration" on humans would blow up in code review within three days&lt;/strong&gt;. Writing a &lt;code&gt;@graph-business&lt;/code&gt; sentence per function, enumerating &lt;code&gt;@graph-connects&lt;/code&gt; exhaustively, checking the naming allowlists — that's genuinely tedious at scale.&lt;/p&gt;

&lt;p&gt;This works because &lt;strong&gt;AI writes the code&lt;/strong&gt;. Writing four required JSDoc tags (plus optional &lt;code&gt;@graph-node&lt;/code&gt; when the default &lt;code&gt;Function&lt;/code&gt; type isn't enough) is rounding error on top of writing the code itself. With ESLint and automated review in the feedback loop, the AI doesn't miss tags — and human reviewers only need to check "is this tag factually correct?" not "is it there?"&lt;/p&gt;

&lt;p&gt;:::message&lt;br&gt;
This design is one that &lt;strong&gt;can't realistically be maintained when humans write code&lt;/strong&gt;, but &lt;strong&gt;becomes viable the moment AI does&lt;/strong&gt;. It's an AI-first design. The premise of AI-first development is what lets business context be fixed in code as the SSoT.&lt;br&gt;
:::&lt;/p&gt;
&lt;h3&gt;
  
  
  Where Hallucination Happens Shifts
&lt;/h3&gt;

&lt;p&gt;Viewed from another angle, what's going on here is that &lt;strong&gt;the location of hallucination shifts&lt;/strong&gt;. &lt;strong&gt;Where you contain hallucination is, I think, fundamental to AI harness design&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As I &lt;a href="https://dev.to/ryantsuji/graph-rag-isnt-a-one-shot-anymore-the-case-for-agentic-graph-rag-mcps-1dj5"&gt;wrote elsewhere&lt;/a&gt;, when you combine AI with a graph system, "&lt;strong&gt;hallucination doesn't disappear — it just changes location.&lt;/strong&gt;" For cpg, here's where it lands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Graph build / query phase&lt;/strong&gt;: &lt;strong&gt;No fresh LLM generation.&lt;/strong&gt; Once reviewed metadata lands in the graph, the ts-morph AST pass, the BigQuery MERGE, and the MCP query responses are all deterministic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSDoc writing phase&lt;/strong&gt;: This is the entry point for hallucination. Whether &lt;code&gt;@graph-business&lt;/code&gt; is factually accurate, or whether &lt;code&gt;@graph-connects&lt;/code&gt; is exhaustively listed — these can go wrong since the AI is writing them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But &lt;strong&gt;the entry point is locked down by automated PR review&lt;/strong&gt;. Missing tags get &lt;code&gt;[Graph] Critical&lt;/code&gt;; factual drift gets &lt;code&gt;[Doc] Critical&lt;/code&gt;. When something's wrong, either the AI that wrote the code or another reviewer AI catches it and fixes it.&lt;/p&gt;

&lt;p&gt;The result: &lt;strong&gt;once data lands in the graph, it can be treated as deterministically sourced from reviewed code, not as a fresh generated answer that might hallucinate on every query&lt;/strong&gt;. AI agents calling cpg don't have to guard against "this might be a generated lie" on every returned node or edge. The tools can be designed as "return facts only" without compromise.&lt;/p&gt;
&lt;h2&gt;
  
  
  Build — AST to Graph via ts-morph
&lt;/h2&gt;

&lt;p&gt;Once JSDoc is established as the SSoT, the rest is mechanics: extract it and assemble the graph. The implementation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AST-analyze JS/TS with ts-morph&lt;/strong&gt; — walk every declaration (function / class / method / type / enum / variable / expression statement / &lt;code&gt;export default&lt;/code&gt; / etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extract &lt;code&gt;@graph-*&lt;/code&gt; tags from JSDoc&lt;/strong&gt; — collect the four required tags plus optional &lt;code&gt;@graph-node&lt;/code&gt; and normalize into a &lt;code&gt;ParsedGraphTags&lt;/code&gt; structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate nodes&lt;/strong&gt; — use &lt;code&gt;qualifiedName = "&amp;lt;filePath&amp;gt;:&amp;lt;name&amp;gt;"&lt;/code&gt; as the node ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate edges&lt;/strong&gt; — one edge per &lt;code&gt;@graph-connects&lt;/code&gt; entry, with &lt;code&gt;via:&lt;/code&gt; / &lt;code&gt;cardinality&lt;/code&gt; and other metadata preserved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate embeddings&lt;/strong&gt; — send &lt;code&gt;@graph-business&lt;/code&gt; text to Vertex AI Embedding (&lt;code&gt;gemini-embedding-2&lt;/code&gt;) and vectorize it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load into BigQuery&lt;/strong&gt; — MERGE all nodes / edges into &lt;code&gt;cortex.product_graph_nodes&lt;/code&gt; / &lt;code&gt;cortex.product_graph_edges&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Because &lt;code&gt;@graph-business&lt;/code&gt; goes directly into the embedding input, querying "&lt;strong&gt;code that calculates the KPI bug rate&lt;/strong&gt;" in natural language returns a hit based on semantic proximity of the description — even when the function name contains neither "bug" nor "rate."&lt;/p&gt;

&lt;p&gt;The overall flow: the three tracks (&lt;code&gt;apps/&lt;/code&gt; / &lt;code&gt;infra/&lt;/code&gt; / &lt;code&gt;docs/&lt;/code&gt;) each go through their own parser, are merged into a single node set by the generator, and only nodes whose text has changed are sent to Vertex AI before being stored in BigQuery:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcox0uq3d52e058rdgug8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcox0uq3d52e058rdgug8.png" alt="Build pipeline — assembling one knowledge graph from JSDoc, Pulumi, and docs" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Build Cost Is Effectively Zero
&lt;/h3&gt;

&lt;p&gt;The build runs automatically on push to main via GitHub Actions, using a differential embedding approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compare &lt;code&gt;textForEmbedding&lt;/code&gt; of each BQ node against the new text&lt;/li&gt;
&lt;li&gt;Unchanged nodes reuse their existing BQ embeddings&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Only changed nodes go to Vertex AI&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A typical push changes a few dozen nodes, so cost is &lt;strong&gt;under $0.001&lt;/strong&gt;. Full regeneration (for recovery, triggered via &lt;code&gt;workflow_dispatch&lt;/code&gt;) is ~$0.075 for 8,000+ nodes.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why BigQuery, Not a Graph Database
&lt;/h3&gt;

&lt;p&gt;When people hear "knowledge graph," they often imagine a dedicated graph DB (Neo4j, Neptune, Memgraph, etc.). cortex runs on &lt;strong&gt;just two BigQuery tables&lt;/strong&gt; (&lt;code&gt;product_graph_nodes&lt;/code&gt; / &lt;code&gt;product_graph_edges&lt;/code&gt;). Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Different cost structure&lt;/strong&gt; — dedicated graph DBs set a floor of "always-on cluster cost"; for the current implementation, BQ is &lt;strong&gt;storage + on-demand queries only&lt;/strong&gt;. Even with continuous AI traffic, it's clearly cheaper than running a server 24/7.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector search / cosine similarity / SQL in the same place&lt;/strong&gt; — BQ has &lt;a href="https://cloud.google.com/bigquery/docs/vector-search" rel="noopener noreferrer"&gt;&lt;code&gt;VECTOR_SEARCH&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-distance" rel="noopener noreferrer"&gt;&lt;code&gt;ML.DISTANCE&lt;/code&gt;&lt;/a&gt;, so semantic search over &lt;code&gt;@graph-business&lt;/code&gt; embeddings, filter by node properties, and adjacent-node JOINs can all live in &lt;strong&gt;one query&lt;/strong&gt;. That matters when "semantic search + property filter + neighbor JOIN" is the standard access pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration-ready for GQL once BQ Graph goes GA&lt;/strong&gt; — BQ already has &lt;a href="https://cloud.google.com/bigquery/docs/graph-overview" rel="noopener noreferrer"&gt;Graph in BigQuery&lt;/a&gt; in Preview; once it ships GA, you can put a graph view over the existing tables and likely shift to &lt;code&gt;MATCH (n)-[e]-&amp;gt;(m)&lt;/code&gt; queries in GQL. &lt;strong&gt;The current table design is already migration-ready.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In short: &lt;strong&gt;get the graph DB's future strength (GQL) while running on plain BQ tables today&lt;/strong&gt;. Compared to adding a graph DB on top of a generic RAG stack (pgvector / Pinecone / etc.), fewer systems to operate and lower learning curve.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Core Part Is Available as an Open-Source Sample
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;"parse JSDoc annotations with AST analysis and output a graph"&lt;/strong&gt; part is small enough to reproduce cleanly, so I published it as a working sample:&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;&lt;a href="https://github.com/thujikun/graph-jsdoc-extractor" rel="noopener noreferrer"&gt;graph-jsdoc-extractor&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's a ~500-line library that extracts &lt;code&gt;@graph-*&lt;/code&gt; and outputs ndjson of &lt;code&gt;{ kind: "node", ... }&lt;/code&gt; / &lt;code&gt;{ kind: "edge", ... }&lt;/code&gt; objects. Comes with a &lt;code&gt;pnpm run example&lt;/code&gt; that runs end-to-end. For those who just want to see the output format without cloning, the built ndjson is checked in: &lt;strong&gt;&lt;a href="https://github.com/thujikun/graph-jsdoc-extractor/blob/main/examples/sample/output.ndjson" rel="noopener noreferrer"&gt;examples/sample/output.ndjson&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is intentionally just the "turn code into a graph" part. The real value in cortex starts when &lt;strong&gt;docs and DB schemas land on the same graph&lt;/strong&gt; — that's the next section.&lt;/p&gt;
&lt;h2&gt;
  
  
  Connections — Landing Docs and DB on the Same Graph
&lt;/h2&gt;

&lt;p&gt;Looking at the sample ndjson, a &lt;code&gt;@graph-connects users [reads_from, via:id]&lt;/code&gt; entry has &lt;code&gt;users&lt;/code&gt; stored as a &lt;strong&gt;raw string&lt;/strong&gt; in &lt;code&gt;targetId&lt;/code&gt;. Leaving that as-is means it's just a string. Resolving &lt;code&gt;users&lt;/code&gt; into a &lt;strong&gt;rich node carrying column definitions, partition info, and per-column descriptions&lt;/strong&gt; — that's where the resolution power of search takes a real step forward.&lt;/p&gt;

&lt;p&gt;cortex does this in three directions.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. DB Schemas as Nodes in the Same Graph
&lt;/h3&gt;

&lt;p&gt;cpg ingests not just code but cortex's DB schemas in the same build. A &lt;code&gt;@graph-connects users [queries, via:id]&lt;/code&gt; on the code side gets resolved at build time into a &lt;strong&gt;rich Table node&lt;/strong&gt; carrying column definitions, partition metadata, and descriptions (if the same-named stub exists, its internals are replaced while its ID and all inbound edges survive).&lt;/p&gt;

&lt;p&gt;The key point: &lt;strong&gt;table and column descriptions aren't AI-generated annotations attached after the fact — they're pulled directly from the &lt;code&gt;description&lt;/code&gt; fields in the Pulumi schema definitions&lt;/strong&gt;. Here's what that looks like (excerpt from cpg's own table definition):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;productGraphNodesTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;gcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bigquery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cortex-prod-product-graph-nodes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;datasetId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cortex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tableId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;product_graph_nodes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Product Graph nodes — unified knowledge graph of code + DB + docs. &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Auto-generated from JSDoc @graph-* tags&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;STRING&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;REQUIRED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Unique node ID (graphId:nodeType:filePath:name format)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;nodeType&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;STRING&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;REQUIRED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Node type — ApiEndpoint, BigQueryTable, Function, Module, Document, etc.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;qualifiedName&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;STRING&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Fully qualified name — filePath:exportName format&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="p"&gt;]),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both the table-level and column-level descriptions &lt;strong&gt;become the embedding input for semantic search directly from the Pulumi definition&lt;/strong&gt;. The same philosophy as cpg's JSDoc — "write the description at the place the thing is defined" — runs all the way through the DB layer. Fix a Pulumi &lt;code&gt;description&lt;/code&gt; → semantic search improves. Same mechanics as fixing a JSDoc.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Docs Auto-Promoted to Nodes via Directory Convention
&lt;/h3&gt;

&lt;p&gt;Markdown files under &lt;code&gt;docs/&lt;/code&gt; also land in the graph. The mechanism is simple: &lt;strong&gt;the directory structure is conventionalized&lt;/strong&gt; so that which stack and domain each doc belongs to is deterministically resolvable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docs/{category}/{name}.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples from cpg itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;docs/product-graph/README.md&lt;/code&gt; → stack: &lt;code&gt;product-graph&lt;/code&gt;, domain: &lt;code&gt;Engineering&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docs/code-graph/README.md&lt;/code&gt; → stack: &lt;code&gt;code-graph&lt;/code&gt;, domain: &lt;code&gt;Engineering&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docs/mcp/db-graph/README.md&lt;/code&gt; → stack: &lt;code&gt;mcp-db-graph-server&lt;/code&gt;, domain: &lt;code&gt;Engineering&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each file is ingested as a &lt;strong&gt;Document node&lt;/strong&gt; in the graph, and a &lt;code&gt;documented_by&lt;/code&gt; edge is auto-generated from code nodes whose &lt;code&gt;@graph-stack&lt;/code&gt; matches the doc's stack. Code under &lt;code&gt;apps/graph/product/&lt;/code&gt; all carries &lt;code&gt;@graph-stack product-graph&lt;/code&gt;, so it's automatically linked to &lt;code&gt;docs/product-graph/README.md&lt;/code&gt;. Change code → related docs are already linked.&lt;/p&gt;

&lt;p&gt;This means an AI reviewer can answer "did this code change leave related docs stale?" &lt;strong&gt;in one graph hop&lt;/strong&gt; (that's the source of the &lt;code&gt;[Doc] Critical&lt;/code&gt; comments from Part 1).&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Infrastructure Definitions as Nodes
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;@graph-*&lt;/code&gt; tags go on Pulumi code in &lt;code&gt;infra/&lt;/code&gt; too. An example from cortex's own graph infrastructure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="cm"&gt;/**
 * @graph-node {CronSchedule}
 * @graph-stack code-graph
 * @graph-domain Engineering
 * @graph-business graph-boundary-daily: runs cross-repository boundary analysis at 7:00 AM JST
 *   daily (auto-detecting API, DB, and Event connections across repos)
 * @graph-connects graph-index-job [triggers] trigger Cloud Run Job
 */&lt;/span&gt;
&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;gcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudscheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-graph-boundary-schedule`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This becomes a &lt;strong&gt;CronSchedule node&lt;/strong&gt; in the graph, connected to the target CloudRunJob node by a &lt;code&gt;triggers&lt;/code&gt; edge. The Pulumi definition is itself a graph entry point — "&lt;strong&gt;what code runs in this cron?&lt;/strong&gt;" is now answerable by graph traversal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Result: Four Layers on One Graph
&lt;/h3&gt;

&lt;p&gt;Adding the three together, the node types in the graph look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Node type&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Function / Class / Method&lt;/td&gt;
&lt;td&gt;Code (JSDoc)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ApiEndpoint / Page&lt;/td&gt;
&lt;td&gt;Code (JSDoc &lt;code&gt;@graph-node&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BigQueryTable / FirestoreCollection (stub)&lt;/td&gt;
&lt;td&gt;Code &lt;code&gt;@graph-connects&lt;/code&gt; targets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Table / Column / Schema&lt;/strong&gt; (rich)&lt;/td&gt;
&lt;td&gt;Schema files defined in Pulumi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Document&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Directory parser over &lt;code&gt;docs/&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CronSchedule / PubSubTopic / CloudRunService&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;infra/&lt;/code&gt; JSDoc&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Edge types correspondingly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Edge type&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;calls / queries / reads_from / writes_to / publishes / triggers&lt;/td&gt;
&lt;td&gt;code → other nodes (&lt;code&gt;@graph-connects&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;documented_by&lt;/td&gt;
&lt;td&gt;code → Document (auto-generated on stack match)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HAS_TABLE / HAS_COLUMN&lt;/td&gt;
&lt;td&gt;Schema → Table → Column (DB side)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;shares_topic&lt;/td&gt;
&lt;td&gt;Between boundary nodes sharing a topic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Code ↔ DB ↔ docs ↔ infra&lt;/strong&gt; — all reachable in one hop on the same graph. This is what "Product Graph" means: cortex's unified knowledge graph.&lt;/p&gt;

&lt;p&gt;Here's an actual visualization of a slice of cpg itself. Starting from &lt;code&gt;generateEmbeddings&lt;/code&gt; (code), you can see &lt;code&gt;cortex.product_graph_nodes&lt;/code&gt; (BigQueryTable) with its columns, the Pulumi table definition resource, &lt;code&gt;docs/product-graph/README.md&lt;/code&gt;, external services like Vertex AI, and a separate layer's &lt;code&gt;graph-boundary-daily&lt;/code&gt; (CronSchedule) — &lt;strong&gt;all connected by edges on the same node set&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwa75qtqgt7043xrl8w5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwa75qtqgt7043xrl8w5f.png" alt="Product Graph — a knowledge graph with four layers on the same node set" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the Sample Stops
&lt;/h3&gt;

&lt;p&gt;graph-jsdoc-extractor &lt;strong&gt;intentionally leaves out&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resolving &lt;code&gt;@graph-connects&lt;/code&gt; targets to real node IDs&lt;/strong&gt; (cortex uses a seven-stage resolver; the rules are project-specific)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same-name merging&lt;/strong&gt; (cortex promotes DB-schema-side rich nodes to replace stubs; the merge source is project-specific)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The docs directory convention parser&lt;/strong&gt; (cortex's &lt;code&gt;docs/{category}/{name}.md&lt;/code&gt; convention is cortex-specific)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding generation&lt;/strong&gt; (Vertex AI setup is up to you)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are parts where &lt;strong&gt;the right answer differs per project&lt;/strong&gt; — naming conventions, where docs live, which embedding model to use, when to promote a stub to a rich node. Baking one answer into the sample library would make it harder to use, not easier. The sample draws the line at JSDoc → graph structure, and this article's job is "here's how we did it in cortex — translate it to your project's context."&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Tool Design and the Runbook Pattern
&lt;/h2&gt;

&lt;p&gt;The graph is now assembled. Next: &lt;strong&gt;how AI uses it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;cpg runs as an MCP server (&lt;code&gt;cortex-product-graph&lt;/code&gt;). From the AI's side, three tools are visible, applying the &lt;strong&gt;three-layer tool design&lt;/strong&gt; (search / detail / traverse) from &lt;a href="https://dev.to/ryantsuji/graph-rag-isnt-a-one-shot-anymore-the-case-for-agentic-graph-rag-mcps-1dj5"&gt;the Agentic Graph RAG MCP post&lt;/a&gt; directly to cpg:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_product_graph_nodes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Find entry points (vector search + name search)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_product_graph_node_detail&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deterministically fetch detail by ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;trace_product_graph_connections&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BFS subgraph traversal (&lt;code&gt;via_filter&lt;/code&gt; for parameter-level tracking)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three layers only shows you what's &lt;em&gt;in&lt;/em&gt; the graph. For jumping from graph nodes to the actual data they point to, &lt;strong&gt;supplementary tools live in the same MCP&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Supplementary tool&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;read_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pass a node's &lt;code&gt;path&lt;/code&gt; property directly to fetch source (Function / Class / Method / ApiEndpoint / Document — any code-origin node carries &lt;code&gt;path&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;grep_code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pattern search across the repository&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;git_blame&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Last author, commit, and timestamp per line&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;query_product_graph_bq&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Direct SQL against BigQuery. Find a BQTable node in the graph, then jump to its live data (executed via user OAuth, so BQ IAM applies as-is)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;read_firestore&lt;/code&gt; / &lt;code&gt;write_firestore&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Read/write Firestore collections. Find a FirestoreCollection node in the graph, then go to the live documents (Firestore access follows the same user / environment permission boundary; cpg provides the entry point, not a bypass around IAM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;list_product_graph_stacks&lt;/code&gt; / &lt;code&gt;list_product_graph_domains&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Lists all stack / domain names present in the graph; useful for orienting before a search&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In other words, cpg's MCP is &lt;strong&gt;a two-tier design: the three-layer structure for graph traversal + supplementary tools for descending into live data (source code / BQ / Firestore)&lt;/strong&gt;. The AI can do "search by meaning → traverse by structure → pull live data" &lt;strong&gt;entirely within one MCP server&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Runbook Pattern — Return Values Contain the Next Action
&lt;/h3&gt;

&lt;p&gt;Every MCP response ends with a &lt;strong&gt;"related nodes (next action candidates)" block&lt;/strong&gt;. For example, after a search returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3 nodes found:
- apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount (Function)
- backlog_no_embedding.kpi_bug_rate_per_100pt (BigQueryTable)
- /kpi/bugs (ApiEndpoint)

## Related nodes (next action candidates)

### 🛠 Code (1)
- apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount
  → `get_product_graph_node_detail("apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount")`

### 🗄 DB tables (1)
- backlog_no_embedding.kpi_bug_rate_per_100pt
  → `trace_product_graph_connections(start_node: "backlog_no_embedding.kpi_bug_rate_per_100pt", direction: "backward")`

### 🌐 API (1)
- /kpi/bugs
  → `get_product_graph_node_detail("/kpi/bugs")`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Copy-pasteable tool calls are lined up by node type, showing exactly what to call next.&lt;/strong&gt; The AI gets new options on every call, so it never has to figure out "what should I do now?"&lt;/p&gt;

&lt;p&gt;Here's the AI ↔ MCP loop in diagram form. The MCP bundles next action candidates into every search response; the AI picks one and makes the next call, repeating:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe0siod62k9ohnj4pz9j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe0siod62k9ohnj4pz9j.png" alt="Runbook pattern — tool return values contain the next tool call to make" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;usecase&lt;/code&gt; Parameter — Switching the Runbook
&lt;/h3&gt;

&lt;p&gt;Every tool accepts a &lt;strong&gt;&lt;code&gt;usecase&lt;/code&gt; parameter&lt;/strong&gt; where the AI declares what kind of investigation it's doing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;usecase&lt;/th&gt;
&lt;th&gt;Strategy (summary of what cpg optimizes for)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;general&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Basic investigation with unknown entry point. Default.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;design&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Understanding existing feature structure. Read business / connections via &lt;code&gt;get_product_graph_node_detail&lt;/code&gt;. Deep trace is unnecessary; Document nodes take priority.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;impact&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Trace upstream and downstream impact deeply. Hit &lt;code&gt;trace_product_graph_connections&lt;/code&gt; with &lt;code&gt;direction=both&lt;/code&gt; / &lt;code&gt;max_depth=5&lt;/code&gt;. Code + DB + infra + schedules are all on the same graph, so one traversal covers a wide area.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test-create&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Test design. Fetch detail to read parameters and connected DB / called functions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test-review&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Compare existing tests against implementation coverage. Cross-check branch structure of target Function / Method against test case count.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;code-review&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check impact of changes and detect &lt;code&gt;@graph-business&lt;/code&gt; violations. Trace impact → detail to check business / source.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bug&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deep trace from error origin. &lt;code&gt;direction=both&lt;/code&gt; / &lt;code&gt;max_depth=5&lt;/code&gt; for upstream callers + downstream data flow.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The same &lt;code&gt;search_product_graph_nodes&lt;/code&gt; call with &lt;code&gt;usecase: "code-review"&lt;/code&gt; returns next action candidates optimized for "verify the change's impact first." With &lt;code&gt;usecase: "bug"&lt;/code&gt; it returns candidates optimized for "trace deep from error origin + fetch logs." The Runbook switches to match the declared intent.&lt;/p&gt;

&lt;p&gt;This matters because &lt;strong&gt;having the AI declare "what kind of investigation I'm doing"&lt;/strong&gt; yields different angles from the same graph. Auto Review internally fires with &lt;code&gt;code-review&lt;/code&gt;; Self-Healing fires with &lt;code&gt;bug&lt;/code&gt; — the flywheel elements from Part 1 each run a different Runbook.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLAUDE.md Convention — Forcing AI to Always Hit cpg First
&lt;/h3&gt;

&lt;p&gt;Throughout this post I've said "the AI uses cpg," but AI doesn't &lt;strong&gt;spontaneously choose&lt;/strong&gt; cpg. Claude Code defaults to grep / glob / file read as its first instinct. To flip that, the root CLAUDE.md in cortex opens with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Product Graph MCP (cortex-product-graph)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is the single most important asset in this repository.&lt;/strong&gt; cortex-product-graph MCP indexes all code, DB schemas, docs, and infra into a unified knowledge graph with business context. It knows everything about this repository.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Always query Product Graph MCP first&lt;/strong&gt; before grep/glob/file reads. It returns richer, contextualized results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If Product Graph MCP is unavailable&lt;/strong&gt; (auth expired, server down) and you are NOT in autonomous/auto mode, &lt;strong&gt;stop all work immediately&lt;/strong&gt; and ask the user to authenticate. Do not proceed with degraded grep-only investigation.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two things matter here. First, the explicit ordering — "cpg first, grep only as fallback." Second, &lt;strong&gt;fallback to grep is explicitly forbidden if cpg is unavailable&lt;/strong&gt;. Without that second clause, the AI happily degrades to "cpg seems down, I'll just grep" and proceeds with stale context and wrong assumptions. With it, cpg unavailability is a hard stop, not a graceful degradation.&lt;/p&gt;

&lt;p&gt;One clause in CLAUDE.md, and Claude Code's first move on any code investigation is pinned to cpg. Article writing, Auto Review, Self-Healing — all follow the same convention, so the entry point is always unified.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Live Example — Investigating cpg with cpg
&lt;/h2&gt;

&lt;p&gt;Enough abstraction. Let me walk through a real cpg query: &lt;strong&gt;using cpg to investigate cpg's own builder core&lt;/strong&gt; — the meta-example.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Semantic search for "the code that extracts graph source data from code annotations"
&lt;/h3&gt;

&lt;p&gt;No function name assumed. Just the intent in plain language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;search_product_graph_nodes(
  query: "code that extracts graph source data from annotations written in code",
  search_mode: "semantic",
  usecase: "design"
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Top 5 results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- apps/graph/product/src/parsers/jsdoc-parser.ts:applyGraphTag (Function)
- apps/graph/product/src/parsers/jsdoc-parser.ts:extractTagsFromNode (Function)
- packages/eslint-plugin-graph/src/utils/jsdoc-utils.ts:extractGraphTags (Function)
- apps/graph/product/src/parsers/jsdoc-parser.ts:parseJSDocExports (Function)
- packages/eslint-plugin-graph/src/utils/jsdoc-utils.ts:getGraphTagValue (Function)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The query contained neither "JSDoc" nor "&lt;code&gt;@graph-*&lt;/code&gt;" nor "parser" — yet the intent found the right nodes &lt;strong&gt;via the &lt;code&gt;@graph-business&lt;/code&gt; embedding&lt;/strong&gt;. grep cannot do this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Trace downstream from that node (&lt;code&gt;usecase: "design"&lt;/code&gt; prioritizes Documents)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;trace_product_graph_connections(
  start_node: "apps/graph/product/src/parsers/jsdoc-parser.ts:parseJSDocExports",
  direction: "forward",
  usecase: "design"
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Edges returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- parseJSDocExports --calls--&amp;gt; extractDeclarationsFromFile
- parseJSDocExports --calls--&amp;gt; extractTagsFromNode
- parseJSDocExports --reads_from[via:filePath]--&amp;gt; filesystem
- parseJSDocExports --documented_by--&amp;gt; docs/product-graph/README.md (Document)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last one — &lt;code&gt;documented_by&lt;/code&gt; — is the point: &lt;strong&gt;the edge from code to the Document node was auto-generated&lt;/strong&gt;. Following it with &lt;code&gt;read_file&lt;/code&gt; retrieves &lt;code&gt;docs/product-graph/README.md&lt;/code&gt; — and with it, &lt;strong&gt;the background, design rationale, and tag specification for this implementation&lt;/strong&gt;, all in one hop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The meta-structure — this article itself is written with cpg
&lt;/h3&gt;

&lt;p&gt;This article was drafted by Claude Code, not by me — I provided direction and review. That Claude Code has cpg MCP connected, so every time I said "show a real example from cpg's own code" or "use a cpg-related infra example," Claude queried cpg to pull actual function names, JSDoc, Pulumi definitions, and docs structure, then embedded them in the text.&lt;/p&gt;

&lt;p&gt;In other words: the &lt;strong&gt;&lt;code&gt;generateEmbeddings&lt;/code&gt; JSDoc, the Pulumi &lt;code&gt;productGraphNodesTable&lt;/code&gt; description, the &lt;code&gt;graph-boundary-daily&lt;/code&gt; cron annotation, the auto-link to &lt;code&gt;docs/product-graph/README.md&lt;/code&gt;&lt;/strong&gt; — none of these came from my memory. &lt;strong&gt;Claude queried cpg and found the real artifacts&lt;/strong&gt;. My role is only the review judgment: "this is right / this is wrong."&lt;/p&gt;

&lt;p&gt;This is the pattern repeating across all of cortex. &lt;strong&gt;Humans set the direction; AI uses cpg to verify and generate implementations / text / reviews&lt;/strong&gt;. Part 1's ③ Auto Review and ④ Self-Healing run on the same structure. Article writing isn't a special case — as long as cpg exists, AI-driven work always takes this shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed / Bridge to Part 3
&lt;/h2&gt;

&lt;p&gt;That covers the inside of cpg. A closing summary of how it affects cortex as a whole:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. I stopped running grep&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without knowing file names or symbol names, I can get the relevant code back by just describing what I want to do. The combination of 120+ apps and a team of one works because of this, more than anything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Auto Review produces context-grounded comments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;[Graph]&lt;/code&gt; / &lt;code&gt;[Impact]&lt;/code&gt; / &lt;code&gt;[Doc]&lt;/code&gt; / &lt;code&gt;[Security]&lt;/code&gt; level comments Part 1's ③ Auto Review produces all stand on cpg. The substance is &lt;strong&gt;review carried out with the entire codebase as context&lt;/strong&gt; — that's the real benefit of the cpg integration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Self-Healing can trace from error origin to root cause&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Part 1's ④ Self-Healing can hop from a Grafana alert → code → dependent tables → related docs in one graph traversal because cpg exists. It fires with &lt;code&gt;usecase: "bug"&lt;/code&gt; and takes the shortest path from error to root cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The static-analysis code graph is working somewhere else&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I said "we abandoned code inference" at the top, but that was specifically for cortex itself. For the external-facing production repositories (the core of the business), a different approach supplies context, and static analysis continues to run there. More on that in a separate post.&lt;/p&gt;

&lt;p&gt;Most AI coding setups try to make the AI better at reading an &lt;em&gt;unchanged&lt;/em&gt; repository. cpg takes the opposite approach: &lt;strong&gt;change the repository's information structure so AI has a first-class semantic map to read&lt;/strong&gt;. That's the line between "another GraphRAG" and what cpg actually is.&lt;/p&gt;

&lt;p&gt;In that sense, Product Graph is literally a knowledge graph of the AI, by the AI, for the AI: generated alongside AI-written code, maintained through AI review, and consumed by AI agents as their primary map of the product.&lt;/p&gt;

&lt;p&gt;Coming up in &lt;strong&gt;&lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;Part 3 — automated PR review&lt;/a&gt;&lt;/strong&gt;: the full pipeline of automated PR review built on top of cpg — from GitHub webhook ingestion through AI review / automated fix / automated merge / parallel deploy. What happens when Auto Review fires with &lt;code&gt;usecase: "code-review"&lt;/code&gt;, how &lt;code&gt;[Graph] Critical&lt;/code&gt; comments are generated, and the worktree mechanism that lets AI apply fixes and push back.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>graphrag</category>
      <category>jsdoc</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Minimal post (test fixture)</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Sun, 17 May 2026 16:48:43 +0000</pubDate>
      <link>https://dev.to/ryantsuji/minimal-post-test-fixture-1pk2</link>
      <guid>https://dev.to/ryantsuji/minimal-post-test-fixture-1pk2</guid>
      <description>&lt;p&gt;Minimal test fixture used by &lt;code&gt;$slug.test.tsx&lt;/code&gt;. No headings, no tags — covers the&lt;br&gt;
null branches of TOC rendering and tag-list rendering in &lt;code&gt;routes/posts/$slug.tsx&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Slugs prefixed with &lt;code&gt;_&lt;/code&gt; are excluded from &lt;code&gt;/posts&lt;/code&gt; listing (production publishing&lt;br&gt;
surface) but remain reachable via direct &lt;code&gt;getRenderedPost(slug, lang)&lt;/code&gt; (the&lt;br&gt;
&lt;code&gt;virtual:rendered-posts&lt;/code&gt; lookup that backs &lt;code&gt;/posts/$slug&lt;/code&gt;) so test fixtures can&lt;br&gt;
be SSR'd without polluting the index.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Building a Real AI Harness: Auto-Reviewed PRs, Self-Healing Ops, and Non-Engineer Contributors (Series Intro)</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Tue, 12 May 2026 16:34:39 +0000</pubDate>
      <link>https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa</link>
      <guid>https://dev.to/ryantsuji/building-a-real-ai-harness-auto-reviewed-prs-self-healing-ops-and-non-engineer-contributors-3lfa</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;p&gt;In my previous posts I've introduced &lt;a href="https://dev.to/ryantsuji/we-built-17-mcp-servers-to-let-ai-run-our-internal-operations-3lk2"&gt;the full picture of our 17 internal MCP servers&lt;/a&gt;, &lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;an MCP server that searches 991 internal tables in natural language&lt;/a&gt;, &lt;a href="https://dev.to/ryantsuji/we-built-a-custom-graph-rag-to-let-ai-answer-did-that-initiative-actually-work-3oda"&gt;a custom Graph RAG for measuring initiative impact&lt;/a&gt;, and &lt;a href="https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a"&gt;the Sandbox MCP that lets non-engineers publish AI-built apps safely&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;All of those run on top of an internal AI development platform we call &lt;strong&gt;cortex&lt;/strong&gt;. This post is the first in a series about cortex itself — the platform, the design choices, and the operational experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series Index
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Theme&lt;/th&gt;
&lt;th&gt;Key scene&lt;/th&gt;
&lt;th&gt;Article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Series intro: cortex's harness&lt;/td&gt;
&lt;td&gt;PRs auto-merge / incidents self-heal before you notice&lt;/td&gt;
&lt;td&gt;this post ← you are here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product Graph (cpg)&lt;/td&gt;
&lt;td&gt;Code, docs, DB, infra unified into one graph&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;cortex-product-graph&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;AI PR review&lt;/td&gt;
&lt;td&gt;webhook → AI review → auto-fix → squash merge&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/human-on-the-loop-ai-reviewing-ai-prs-at-cortex-769-prsmonth-while-raising-the-quality-bar-4lh5"&gt;cortex-auto-review&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Self-Healing + observability + auto-added guardrails&lt;/td&gt;
&lt;td&gt;Alert → AI investigates → fix PR + new lint/type gate → auto redeploy + same-pattern writes get auto-rejected&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/fixed-before-anyone-notices-stronger-after-every-fix-self-healing-recurrence-prevention-series-1e86"&gt;cortex-self-healing&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Democratizing the maintenance phase&lt;/td&gt;
&lt;td&gt;Domain experts open PRs to production; the harness owns the quality gate&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4"&gt;cortex-non-engineer-prs&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Series Final&lt;/td&gt;
&lt;td&gt;The underlying philosophy plus a retrospective on the failures and lessons&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa"&gt;cortex-philosophy&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Two Scenes, Up Front
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scene 1: PRs merge themselves
&lt;/h3&gt;

&lt;p&gt;Monday morning. An engineer implements a feature locally, pushes a branch, opens a PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A few minutes later, the AI reviewer comes back with REQUEST_CHANGES. Multiple comments:

&lt;ul&gt;
&lt;li&gt;"This data formatting duplicates &lt;code&gt;formatRow()&lt;/code&gt; in the shared package. Please consolidate."&lt;/li&gt;
&lt;li&gt;"You changed an API response type, but the related docs (&lt;code&gt;docs/api/...&lt;/code&gt;) still describe the old shape."&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;A separate AI agent spawns a worktree, applies the fixes, pushes a follow-up commit&lt;/li&gt;
&lt;li&gt;Re-review comes back as APPROVE&lt;/li&gt;
&lt;li&gt;Auto squash-merge&lt;/li&gt;
&lt;li&gt;GitHub Actions detects only the changed stacks and deploys them to Cloud Run / Cloudflare Pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;No human touched any of this&lt;/strong&gt;. The engineer refreshes the PR tab and notices it's already merged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scene 2: Incidents fix themselves before you notice
&lt;/h3&gt;

&lt;p&gt;7 AM. A Grafana alert fires: "BQ pipeline failed 3 times in a row."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AI receives the webhook, fetches the error logs from Loki via the &lt;strong&gt;Grafana MCP&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Walks the &lt;strong&gt;Product Graph&lt;/strong&gt; (implementation name: &lt;code&gt;cortex-product-graph&lt;/code&gt; — a unified knowledge graph of the codebase, docs, DB schemas, and infrastructure definitions; covered later in this post and in Part 2) to trace the pipeline's code, dependent tables, and related docs, identifying the root cause&lt;/li&gt;
&lt;li&gt;Opens a fix PR&lt;/li&gt;
&lt;li&gt;AI reviewer APPROVE → auto squash-merge → automatic redeploy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the time the engineer logs in at 9 AM, Slack already shows: "pipeline patched." The only incidents engineers personally handle are the ones AI genuinely can't crack.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Froni34pik4wrqw9156g5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Froni34pik4wrqw9156g5.png" alt="Two automation loops" width="799" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's behind both scenes is the dev environment described in the rest of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Industry Context — "Harness Engineering"
&lt;/h2&gt;

&lt;p&gt;Before I get to cortex, one paragraph of context. Over the past six months, &lt;strong&gt;the practice of building proper foundations for AI agents in production&lt;/strong&gt; has crystallized into a recognized industry trend.&lt;/p&gt;

&lt;p&gt;"Harness" itself isn't a new word. In AI specifically, it traces back to &lt;strong&gt;EleutherAI's &lt;a href="https://github.com/EleutherAI/lm-evaluation-harness" rel="noopener noreferrer"&gt;lm-evaluation-harness&lt;/a&gt; (2020)&lt;/strong&gt; — the LLM evaluation framework that put the term in active use. What changed in the past six months is its elevation into an engineering discipline for &lt;strong&gt;LLM agents in production&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Feb 2026&lt;/strong&gt;: OpenAI published &lt;a href="https://openai.com/index/harness-engineering/" rel="noopener noreferrer"&gt;"Harness engineering: leveraging Codex in an agent-first world"&lt;/a&gt;, describing how a small internal team led by Codex shipped &lt;strong&gt;1 million lines in 5 months&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A few days later, Mitchell Hashimoto (HashiCorp co-founder, Terraform creator) distilled it into the formula &lt;code&gt;Agent = Model + Harness&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;April 2026&lt;/strong&gt;: Martin Fowler (author of &lt;em&gt;Refactoring&lt;/em&gt;, ThoughtWorks Chief Scientist) published &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;"Harness engineering for coding agent users"&lt;/a&gt;, establishing the &lt;strong&gt;Guides (proactive controls) / Sensors (reactive controls)&lt;/strong&gt; framing&lt;/li&gt;
&lt;li&gt;Same month: Anthropic and Cursor each published their own harness write-ups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The catchphrase that's gone viral: &lt;strong&gt;"2025 was the year of agents. 2026 is the year of harnesses."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The framing is: &lt;strong&gt;the model itself is rapidly commoditizing&lt;/strong&gt; (the gap between Claude / GPT / Gemini is narrowing from the user side). Where you actually get differentiation is &lt;strong&gt;how you design the harness — the foundation that lets AI run in production&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;cortex is most cleanly read as &lt;strong&gt;a real attempt to build that "harness" inside a real company&lt;/strong&gt;. In this post I'll organize cortex using Fowler's Guides / Sensors framing.&lt;/p&gt;

&lt;p&gt;From here, I'll show &lt;strong&gt;how the "harness beats model" thesis takes concrete shape on cortex&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Builds the Code
&lt;/h2&gt;

&lt;p&gt;For the first few months, &lt;strong&gt;I built 100% of cortex by myself&lt;/strong&gt;. The accurate framing isn't "without a harness, others can't safely PR" but rather "&lt;strong&gt;without a harness, no one — including me with extra hands — could ride this thing&lt;/strong&gt;."&lt;/p&gt;

&lt;p&gt;Even back then, between &lt;a href="https://dev.to/ryantsuji/how-we-built-an-automated-meeting-intelligence-system-with-google-meet-slack-and-rag-42ln"&gt;our Google Meet recording pipeline&lt;/a&gt; (Japanese), about half of the &lt;a href="https://dev.to/ryantsuji/we-built-17-mcp-servers-to-let-ai-run-our-internal-operations-3lk2"&gt;17 MCP servers&lt;/a&gt;, and a long tail of unpublished features, &lt;strong&gt;roughly 50 loosely-coupled applications were already running&lt;/strong&gt;. Each one had its purpose, background, and data flow documented carefully. But the volume was such that &lt;strong&gt;even with AI in the loop, you couldn't realistically have it read all the relevant docs and absorb the whole picture for any given change&lt;/strong&gt;. The codebase had outgrown what a person — or an AI given pieces — could hold in their head at once.&lt;/p&gt;

&lt;p&gt;Recently, with the harness in place, &lt;strong&gt;non-engineers&lt;/strong&gt; (business-side managers, PMOs, etc.) have started shipping PRs to cortex too. As of writing, the cumulative commit ratio is &lt;strong&gt;~91% me, ~9% other recent contributors&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you imagine non-engineers opening PRs against a production repo, "can quality really hold?" is the obvious question. In cortex, the answer is yes, because &lt;strong&gt;AI review and automation own the quality gates&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PRs missing annotations, tests, or lint cleanliness get REQUEST_CHANGES from the AI reviewer&lt;/li&gt;
&lt;li&gt;A separate AI agent applies the fixes&lt;/li&gt;
&lt;li&gt;Until everything is satisfied, nothing merges&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So whoever writes a PR — engineer or not — &lt;strong&gt;at the moment it merges, the same quality bar is met&lt;/strong&gt;. The key point: it's not "you can write freely," it's "&lt;strong&gt;you can write inside rails that don't let you derail&lt;/strong&gt;." The author's job stops at "communicating the intent precisely"; the harness owns code correctness.&lt;/p&gt;

&lt;p&gt;The shift is from "&lt;strong&gt;X could write that because they're X&lt;/strong&gt;" to "&lt;strong&gt;X can write that because of cortex&lt;/strong&gt;." That property only emerges once the harness is built — and it's the core of cortex's design.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Running
&lt;/h2&gt;

&lt;p&gt;cortex consists of microservices, jobs, MCP servers, web frontends, Cloudflare Workers, and so on. As of writing, there are &lt;strong&gt;123 apps&lt;/strong&gt;. The features I've already covered in past posts are each composed of multiple apps — but even adding them up by feature, &lt;strong&gt;only about 10% of cortex has been written about&lt;/strong&gt;. The remaining 90% hasn't appeared in a post yet. A few examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A unified product UX measurement web app&lt;/strong&gt; — UX metrics, screen analysis, funnels, and error analysis in one place&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A dev-org portal web app&lt;/strong&gt; — KPIs (bug rate, etc.), per-member GitHub Activity, QA evaluation results, plus an AI chat that answers natural-language questions about KPIs via Agentic RAG&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A family of Slack bots&lt;/strong&gt; for operational support:

&lt;ul&gt;
&lt;li&gt;A config bot that lets you manage job configurations (DBs, attendance SaaS, Google Drive, etc.) directly from Slack&lt;/li&gt;
&lt;li&gt;An accounting-assist bot that takes invoice OCR and drafts payment requests / expense filings in our accounting SaaS&lt;/li&gt;
&lt;li&gt;In-channel knowledge search, issue/request management, meeting creation; a BigQuery cross-table RAG bot; a Google Drive cross-corpus RAG bot&lt;/li&gt;
&lt;li&gt;A marketing bot that returns insights (trend, creative analysis) from BigQuery marketing data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An APM auto-analysis agent&lt;/strong&gt; that runs daily on monitoring-SaaS APM data, detects performance issues, and opens tickets in our issue-tracking SaaS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An AI-bot auditor bot&lt;/strong&gt; that runs E2E tests against the Slack bots above and detects spec drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…and so on. &lt;strong&gt;Each will get its own dedicated post later in the series.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Scale at a glance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;apps (microservices, jobs, MCP servers, web, etc.)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;123&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;packages (shared libraries)&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP servers&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pulumi stacks&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript (implementation)&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;630K lines&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;560K lines&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown documentation&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;110K lines / 389 files&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duration&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;5 months&lt;/strong&gt; (intensive development: ~4 months)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merged PRs&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;790&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The 4-Element Flywheel — cortex's Harness
&lt;/h2&gt;

&lt;p&gt;What lets "&lt;strong&gt;~4 months of intensive dev, mostly solo&lt;/strong&gt;" coexist with "&lt;strong&gt;non-engineers shipping into the same repo&lt;/strong&gt;" is a harness design that &lt;strong&gt;delegates quality to AI and automation across every layer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;cortex's harness is structured as a &lt;strong&gt;flywheel&lt;/strong&gt; of 4 elements, mapped to Fowler's &lt;strong&gt;Guides (proactive) / Sensors (reactive)&lt;/strong&gt; split, that &lt;strong&gt;mutually reinforce one another&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh17ehxbeab84i6jfxtnw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh17ehxbeab84i6jfxtnw.png" alt="cortex AI Harness Flywheel" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ① Product Graph (Guides — supplying the right context)
&lt;/h3&gt;

&lt;p&gt;All of cortex — &lt;strong&gt;code, documentation, DB schemas, infrastructure definitions&lt;/strong&gt; — is indexed in real time as a single unified graph. It's queryable via MCP through &lt;strong&gt;semantic search&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;"Where is the code that calculates this KPI?" → "Which BQ tables does that code touch?" → "What are those tables' column definitions?" → "What docs are related?" — all of these can be answered from a single query traversal. That graph becomes the context source for everything the AI does.&lt;/p&gt;

&lt;p&gt;This is the foundation that &lt;strong&gt;"structurally reduces how often the AI gets confused."&lt;/strong&gt; Where grep tells you "where the string appears," the Product Graph tells you "&lt;strong&gt;what is connected, why, and how&lt;/strong&gt;." Implementation details come in Part 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  ② Lint / Quality Gates (Guides — physically blocking deviations)
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;eslint-disable&lt;/code&gt; / &lt;code&gt;oxlint-disable&lt;/code&gt; are forbidden anywhere in the repo. In hand-written code, occurrences of &lt;code&gt;: any&lt;/code&gt; / &lt;code&gt;as any&lt;/code&gt; / TODO / FIXME are &lt;strong&gt;0&lt;/strong&gt; (excluding generated files and unavoidable external-library cases). &lt;strong&gt;Type checking&lt;/strong&gt; (using &lt;strong&gt;tsgo&lt;/strong&gt; — Microsoft's Go port of the TypeScript compiler, ~10× faster than &lt;code&gt;tsc&lt;/code&gt;; we use it to keep CI time down) runs on the entire codebase in CI.&lt;/p&gt;

&lt;p&gt;On top of that, test coverage is enforced at &lt;strong&gt;≥90% for statements / branches / functions / lines&lt;/strong&gt;. &lt;strong&gt;Lowering the threshold to pass is forbidden&lt;/strong&gt; — you write tests instead.&lt;/p&gt;

&lt;p&gt;With every escape hatch sealed, &lt;strong&gt;even when the AI writes wrong code, it doesn't merge&lt;/strong&gt;. This is also what stabilizes AI review judgments downstream.&lt;/p&gt;

&lt;h3&gt;
  
  
  ③ Auto Review (Sensors — auto-fixing until the bar is met)
&lt;/h3&gt;

&lt;p&gt;Scene 1 above is exactly this. The implementation-side note: &lt;strong&gt;AI review here isn't "lint with extra steps" — every comment is grounded in Product-Graph traversal of the actual impact&lt;/strong&gt;. That's where it earns its keep. To give you a feel, comments that actually fire fall into categories like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;[Graph] Critical&lt;/strong&gt; — missing annotation that breaks an edge in the graph&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[Impact] Critical&lt;/strong&gt; — a BQ MERGE statement referencing a column not present in the existing target table; would fail in production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[Doc] Critical&lt;/strong&gt; — code change that left related docs stale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[Security] Minor&lt;/strong&gt; — &lt;code&gt;execSync&lt;/code&gt; doing string interpolation on an env var, opening a command injection vector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What you might mentally classify as "AI review" — surface-level — isn't this. &lt;strong&gt;Comments here are produced with the entire codebase carried as context&lt;/strong&gt;, which is what the Product Graph integration buys you.&lt;/p&gt;

&lt;p&gt;The only PRs that actually need a human are "AI review hits a hard case." Day-to-day PRs go from push to merge without anyone touching them.&lt;/p&gt;

&lt;h3&gt;
  
  
  ④ Self-Healing (Sensors — re-injecting production anomalies into the loop)
&lt;/h3&gt;

&lt;p&gt;Scene 2 above is exactly this. Starting from a Grafana alert, the AI traces the root cause through Product Graph + Loki + git blame, opens a fix PR, and pushes it through ③ Auto Review until it's auto-merged. &lt;strong&gt;Re-injecting anomalies into the loop&lt;/strong&gt; is the essence of Sensors. Details in a later post.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Makes It a Flywheel
&lt;/h3&gt;

&lt;p&gt;These 4 elements &lt;strong&gt;mutually reinforce one another&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;① Product Graph exists, so ③ Auto Review can comment with real impact awareness&lt;/li&gt;
&lt;li&gt;② Lint enforces the ground rules, so ③ Auto Review can assume "everything in the codebase meets the bar"&lt;/li&gt;
&lt;li&gt;③ Auto Review exists, so new code lands in ① Product Graph with correct semantic annotations&lt;/li&gt;
&lt;li&gt;④ Self-Healing's incidents loop back through ③, maintaining the quality bar all the way back to ①&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The harness's effectiveness scales with the size of the codebase&lt;/strong&gt;, not against it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Supporting Foundations
&lt;/h3&gt;

&lt;p&gt;Three foundations make the 4 elements possible (covered in detail in Part 4):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tests and coverage&lt;/strong&gt;: ~630K lines of implementation, ~560K lines of tests (&lt;strong&gt;impl : test ≒ 1.13 : 1&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt;: ~110K lines / 389 files, written &lt;strong&gt;for both humans and AI&lt;/strong&gt;, also ingested as Document nodes in the Product Graph&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt;: Frontend = Faro, backend = OTel, infrastructure and CI logs all consolidated in Grafana. &lt;strong&gt;The AI sees the same data humans see.&lt;/strong&gt; Gemini API token usage and cost are tracked separately in Prometheus.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical Foundation
&lt;/h2&gt;

&lt;p&gt;cortex is a &lt;strong&gt;full-TypeScript monorepo&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Applications (&lt;code&gt;apps/&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;TypeScript (Hono, TanStack Router, Vite, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shared packages (&lt;code&gt;packages/&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure (&lt;code&gt;infra/&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;TypeScript (&lt;strong&gt;Pulumi&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge (&lt;code&gt;worker/&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;TypeScript (Cloudflare Workers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lint plugins&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Doc scripts&lt;/td&gt;
&lt;td&gt;TypeScript (tsx)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Having everything in one language is &lt;strong&gt;a much bigger win when viewed from the AI's side&lt;/strong&gt; than from a human's. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You can feed the AI ASTs and type definitions directly as context&lt;/strong&gt; — no language boundary fragments the picture&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactors don't cross language boundaries&lt;/strong&gt; — one ESLint plugin can inspect and auto-fix &lt;code&gt;apps/&lt;/code&gt;, &lt;code&gt;packages/&lt;/code&gt;, and &lt;code&gt;infra/&lt;/code&gt; together&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edges don't break in the Product Graph&lt;/strong&gt; — for example, a Cloud Run service definition (&lt;code&gt;infra/&lt;/code&gt;, TS) connects in a single graph to the Hono route (&lt;code&gt;apps/&lt;/code&gt;, TS) it actually invokes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you ask the AI "what does this change affect?", the reason it can hop &lt;code&gt;infra → apps → packages&lt;/code&gt; and answer in one round-trip is that all of this is one language.&lt;/p&gt;

&lt;p&gt;Build is parallelized via &lt;a href="https://turbo.build/" rel="noopener noreferrer"&gt;Turborepo&lt;/a&gt; and &lt;a href="https://pnpm.io/" rel="noopener noreferrer"&gt;pnpm workspaces&lt;/a&gt;. Deploys go through GitHub Actions, which &lt;strong&gt;detects only changed stacks&lt;/strong&gt; and applies them in parallel via Pulumi.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers (snapshot at time of writing)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftcdo80auzk2j147fb9qr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftcdo80auzk2j147fb9qr.png" alt="Scale" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Duration&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;5 months&lt;/strong&gt; (intensive development: ~4 months)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commits&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;4,000&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merged PRs&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;790&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;% of commits authored by me&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;91%&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;apps&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;123&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;packages&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP servers&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pulumi stacks&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript (implementation)&lt;/td&gt;
&lt;td&gt;~630K lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript (tests)&lt;/td&gt;
&lt;td&gt;~560K lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown documentation&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;110K lines / 389 files&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;as any&lt;/code&gt; / TODO / unjustified lint-disable in hand-written code&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0&lt;/strong&gt; (excluding generated files / unavoidable external-library cases)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coverage gate&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;90%&lt;/strong&gt; (statements / branches / functions / lines)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The PR-flow Switch That Multiplied Throughput
&lt;/h3&gt;

&lt;p&gt;Up until April, &lt;strong&gt;I was AI-assisted reviewing every change carefully on my own machine and then committing directly to main&lt;/strong&gt;. The review bar was unchanged, but throughput was bottlenecked on my hands.&lt;/p&gt;

&lt;p&gt;In April, switching to &lt;strong&gt;fine-grained, PR-based operation&lt;/strong&gt; (auto review → auto fix → auto merge) dramatically changed the per-month merged-PR count:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Month&lt;/th&gt;
&lt;th&gt;Merged PRs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2026-02&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-03&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2026-04&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;518&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05 (through the 10th)&lt;/td&gt;
&lt;td&gt;235&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A &lt;strong&gt;~22× jump&lt;/strong&gt; between March and April. Total commits actually went down (because committing directly to main was replaced by going through PRs), so this isn't "I wrote more code." This is "&lt;strong&gt;the manual review step got replaced by the harness, and the throughput ceiling moved&lt;/strong&gt;." &lt;strong&gt;The 22× is exactly the moment a human reviewer was swapped for Auto Review&lt;/strong&gt; — clean evidence of the flywheel property where the harness's effectiveness scales with codebase size.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's Required for These Numbers to Hold
&lt;/h3&gt;

&lt;p&gt;These numbers are &lt;strong&gt;not explained by "we use AI" alone&lt;/strong&gt;. The prerequisites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full TypeScript monorepo&lt;/strong&gt; — code, tests, infrastructure, scripts all under one static-analysis system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composable Architecture&lt;/strong&gt; — &lt;code&gt;packages/&lt;/code&gt; holds reusable parts; &lt;code&gt;apps/&lt;/code&gt; compose them. Direct imports between &lt;code&gt;apps/&lt;/code&gt; are forbidden — everything routes through &lt;code&gt;packages/&lt;/code&gt;. This is what guarantees components don't interfere with each other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strict quality gates&lt;/strong&gt; — lint / coverage / annotations are run "no lowering, no working around"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified graph&lt;/strong&gt; — code, docs, DB, infrastructure on a single graph as the foundation that lets the AI act with context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto PR review / auto fix / auto merge / auto self-healing&lt;/strong&gt; — the harness that swaps the rate-limiting manual step for AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified observability&lt;/strong&gt; — humans and AI see the same data (OTel + Faro + Prometheus)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The design has to be in place first, and AI runs on top of it. That's what makes both volume and quality possible at the same time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Composable Architecture&lt;/strong&gt; in particular is what drives the headcount-of-one production. Because components don't interfere, &lt;strong&gt;multiple Claude Code sessions can run in parallel on different parts of the codebase&lt;/strong&gt;. In practice, I've run up to ~10 sessions in parallel at peak — this multiplies with the harness's effectiveness.&lt;/p&gt;

&lt;p&gt;It's &lt;strong&gt;system design, not magic&lt;/strong&gt;. Each piece will get its own deep-dive in this series.&lt;/p&gt;

&lt;h2&gt;
  
  
  Some Honest Caveats
&lt;/h2&gt;

&lt;p&gt;If you've read this far, it might sound like everything runs perfectly on autopilot. It doesn't. Three things I want to be upfront about:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. High code quality doesn't prevent bugs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What the harness protects is &lt;strong&gt;"correctness of the code"&lt;/strong&gt; — not &lt;strong&gt;"correctness of the spec."&lt;/strong&gt; Even when implementation is clean, getting the spec interpretation wrong still ships bugs. AI review can catch "code contradicts the documented spec," but if the spec itself is wrong, the issue sails right through. That part is still a human responsibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The work is split deliberately.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;New pipelines that connect to external APIs, and anything touching secure data, are &lt;strong&gt;handled by engineers&lt;/strong&gt;. Non-engineers mostly work on &lt;strong&gt;modifications to features that already exist&lt;/strong&gt; (peeking at our business-side members' PRs makes it concrete pretty quickly). &lt;strong&gt;"Non-engineers can develop too"&lt;/strong&gt; means &lt;strong&gt;"the harness provides rails they can't derail from, so they can safely modify in maintenance mode"&lt;/strong&gt; — not "anyone can build anything from scratch."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. This level of automation works because it's an internal platform.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, cortex's full-auto deploy works partly because Composable Architecture cleanly separates apps and infrastructure. But honestly, &lt;strong&gt;a big part of it is that this is an internal-only platform&lt;/strong&gt;. If something breaks, only employees are affected, and we can roll back fast. The same approach can't be applied directly to consumer products or systems where downtime is immediately critical (warehouse management, for example). We've started moves to close that gap on the consumer side too, but that's a separate post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series Roadmap
&lt;/h2&gt;

&lt;p&gt;The series is planned as 6 parts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 1: Series Intro&lt;/strong&gt; (this post)&lt;br&gt;
   The big picture of what cortex is and why it works in "harness" form. The map to the rest of the series.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Part 2: Product Graph — code, docs, DB, infrastructure as one unified graph&lt;/a&gt;&lt;/strong&gt; ★ recommended next&lt;br&gt;
   The implementation side: how the unified graph is built and maintained. What happens when you take the design principles from &lt;a href="https://dev.to/ryantsuji/graph-rag-isnt-a-one-shot-anymore-the-case-for-agentic-graph-rag-mcps-1dj5"&gt;the Agentic Graph RAG MCP post&lt;/a&gt; and apply them to the entire cortex codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 3: AI reviews, fixes, merges, and deploys PRs&lt;/strong&gt;&lt;br&gt;
   GitHub webhook → AI review → on REQUEST_CHANGES, AI fixes via worktree → auto squash merge → changed-stack detection → parallel deploy: the full pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 4: Incidents self-heal, guardrails self-strengthen&lt;/strong&gt;&lt;br&gt;
   Grafana alert → AI investigation (Loki + Product Graph + git blame) → fix PR + new lint/type gate → auto merge → automatic redeploy: the auto self-healing system. Also covers the full OTel + Loki + Mimir + Tempo + Faro stack, Gemini cost tracking, and how the quality gates are designed to be "non-loweriable, non-bypassable, and self-growing."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 5: Scaling the harness from cortex to toC services&lt;/strong&gt;&lt;br&gt;
   The first half covers how business members can already open PRs directly to cortex -- and where that breaks (additions to existing pipelines work; new pipelines and architectural changes still need humans in the loop). The second half is the roadmap and the thinking behind scaling cortex's harness across the whole product org (multiple services, multiple infra stacks, multiple teams).&lt;/p&gt;

&lt;p&gt;Each post stands on its own, but &lt;strong&gt;Part 2 (Product Graph) is the foundation for the others&lt;/strong&gt;, so the recommended reading order is Part 1 → Part 2 → any.&lt;/p&gt;

&lt;p&gt;Cadence: Tuesdays or Thursdays, 8–10 AM JST.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Building cortex, what's struck me is that &lt;strong&gt;in an AI-era dev environment, "absorbing everything that comes after the writing" wins over "reducing the burden on the writer"&lt;/strong&gt;. Tests, lint, types, coverage, code review, incident response — instead of "these get in the way, let's reduce them," the choice that worked was "&lt;strong&gt;have the AI do all of them, without compromise&lt;/strong&gt;." The counterintuitive result is that quality and dev speed both go up at the same time.&lt;/p&gt;

&lt;p&gt;And it expands two things — &lt;strong&gt;how much one engineer can ship&lt;/strong&gt;, and &lt;strong&gt;how much non-engineers can participate&lt;/strong&gt; — well beyond what was possible before. That's the texture of the "harness" we've built on top of cortex.&lt;/p&gt;

&lt;p&gt;In subsequent parts, I'll walk through the individual mechanisms that make this work.&lt;/p&gt;

&lt;p&gt;→ Part 2: &lt;a href="https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm"&gt;Product Graph — code, docs, DB, infrastructure as one unified graph&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>graphrag</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Graph RAG Isn't a One-Shot Anymore — The Case for Agentic Graph RAG MCPs</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Thu, 07 May 2026 09:57:32 +0000</pubDate>
      <link>https://dev.to/ryantsuji/graph-rag-isnt-a-one-shot-anymore-the-case-for-agentic-graph-rag-mcps-1dj5</link>
      <guid>https://dev.to/ryantsuji/graph-rag-isnt-a-one-shot-anymore-the-case-for-agentic-graph-rag-mcps-1dj5</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;p&gt;Over my last few posts, I've introduced internal MCP servers we've been building: &lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;DB Graph MCP&lt;/a&gt;, &lt;a href="https://dev.to/ryantsuji/we-built-17-mcp-servers-to-let-ai-run-our-internal-operations-3lk2"&gt;the full picture of our 17 internal MCP servers&lt;/a&gt;, &lt;a href="https://dev.to/ryantsuji/we-built-a-custom-graph-rag-to-let-ai-answer-did-that-initiative-actually-work-3oda"&gt;Biz Graph&lt;/a&gt;, and &lt;a href="https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a"&gt;Sandbox MCP&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;DB Graph is built from ORM parsing. Biz Graph extracts initiatives from meeting slides and uses a hand-designed Week node structure. Sandbox MCP is an app deployment platform. The purposes and implementations are completely different — but as I was writing each piece, I noticed that &lt;strong&gt;the design ideas at the root are the same&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This post is about that root. &lt;strong&gt;Agentic Graph RAG&lt;/strong&gt; — a design frame we keep coming back to whenever we build graphs across different domains.&lt;/p&gt;

&lt;p&gt;If you've heard "Graph RAG" before — maybe Microsoft's open-source project — wait a moment. The same words mean different things in &lt;strong&gt;the era when retrieval was assumed to be a single shot&lt;/strong&gt; versus &lt;strong&gt;the era when AI agents are everywhere&lt;/strong&gt;. The optimal design changes completely. This post is about the latter — a new way to think about Graph RAG in a world where Claude Code, Codex, and friends are doing the orchestration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is RAG, Really?
&lt;/h2&gt;

&lt;p&gt;Quick refresher. Skip if this is familiar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG (Retrieval Augmented Generation)&lt;/strong&gt; is the umbrella term for any technique that &lt;strong&gt;retrieves&lt;/strong&gt; related information from external data and mixes it into the prompt before the LLM generates an answer.&lt;/p&gt;

&lt;p&gt;Why was this needed? In the early days of generative AI — late 2022 and through 2023 — we ran into three problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tiny context windows&lt;/strong&gt;: GPT-3.5 had 4K tokens, early GPT-4 had 8K. You couldn't fit your internal docs in there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stale model knowledge&lt;/strong&gt;: The model didn't know anything past its training cutoff. It certainly didn't know your internal data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination&lt;/strong&gt;: It would confidently fabricate answers when it didn't know.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The RAG idea was: &lt;strong&gt;every time&lt;/strong&gt; the user asks something, fetch the relevant chunks from external data and feed them in before generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vector RAG — The First Practical Answer
&lt;/h2&gt;

&lt;p&gt;The earliest RAG implementation that actually caught on was &lt;strong&gt;Vector RAG&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The recipe is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Split documents into small chunks (say, 500 tokens each)&lt;/li&gt;
&lt;li&gt;Embed each chunk with a model (e.g., 1536-dim vectors)&lt;/li&gt;
&lt;li&gt;Store them in a vector DB (Pinecone, Weaviate, pgvector...)&lt;/li&gt;
&lt;li&gt;Embed the user's question with the same model, retrieve the top-k closest by cosine similarity&lt;/li&gt;
&lt;li&gt;Stuff those chunks into the prompt and call the LLM&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For its time, this was a great invention. Because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search is fast&lt;/strong&gt;: tens to hundreds of milliseconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No training needed&lt;/strong&gt;: feed it docs, it's instantly searchable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain-agnostic&lt;/strong&gt;: works for legal documents, medical charts, internal wikis — the same machinery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rides model improvements&lt;/strong&gt;: better embedding models, better recall&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And critically, agent technology was still immature. OpenAI's Function Calling shipped in June 2023, was unstable for a while, and running a meaningful &lt;strong&gt;agentic loop&lt;/strong&gt; of multiple tool calls was both slow and expensive. So RAG was designed around the assumption: &lt;strong&gt;one retrieval has to fetch everything you need&lt;/strong&gt;. Vector RAG was perfectly tuned for this constraint.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Limits of Vector RAG
&lt;/h3&gt;

&lt;p&gt;But anyone who runs Vector RAG in production discovers the same thing fast: &lt;strong&gt;it can't follow relationships&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Take a question like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How did last month's SNS ad campaign affect new member signups?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Vector search returns chunks that are &lt;strong&gt;textually similar&lt;/strong&gt; to the question. The campaign description might come up. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When&lt;/strong&gt; was the campaign actually running?&lt;/li&gt;
&lt;li&gt;What were the new-member numbers during &lt;strong&gt;that same period&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;What happened with &lt;strong&gt;previous similar campaigns&lt;/strong&gt;?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't textual similarity — they're structural traversals across data. Embedding maps "spring SNS ads" and "spring promotion initiative" close together, but it cannot &lt;strong&gt;start from "ran from March 1 to March 31" and reach "new member counts in that same period"&lt;/strong&gt;. That's not a similarity problem; that's a join problem.&lt;/p&gt;

&lt;p&gt;On top of that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chunk boundaries kill context&lt;/strong&gt;: related info gets split across chunks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top-k cliff&lt;/strong&gt;: critical info at rank 11 is invisible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Granularity mismatch&lt;/strong&gt;: questions like "summarize the whole thing" can't be answered by collecting chunks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vector RAG nailed "fetch text similar to the question in one step." It's weak at "follow data through structural relationships." That's the gap that Graph RAG was born to address.&lt;/p&gt;

&lt;h2&gt;
  
  
  Graph RAG — Search That Follows Relationships
&lt;/h2&gt;

&lt;p&gt;The basic idea of Graph RAG: extract &lt;strong&gt;entities&lt;/strong&gt; (people, organizations, concepts) and &lt;strong&gt;relationships&lt;/strong&gt; (belongs-to, affects, references) from your documents, store them as a graph, and at query time traverse the graph to gather information across multiple hops.&lt;/p&gt;

&lt;p&gt;This handles questions like our SNS-ads-and-new-members example — anything that requires &lt;strong&gt;multi-hop reasoning&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Classical Graph RAG — Built for the One-Shot Era
&lt;/h3&gt;

&lt;p&gt;The most well-known implementation right now is Microsoft's &lt;a href="https://github.com/microsoft/graphrag" rel="noopener noreferrer"&gt;GraphRAG&lt;/a&gt;, released in 2024. The papers are well-written and I have a lot of respect for it. But the design philosophy is squarely &lt;strong&gt;from the one-shot retrieval era&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Roughly, Microsoft GraphRAG does this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Entity extraction&lt;/strong&gt;: feed the entire corpus through an LLM to extract entities and relationships&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community detection&lt;/strong&gt;: find graph clusters (communities) using the &lt;a href="https://en.wikipedia.org/wiki/Leiden_algorithm" rel="noopener noreferrer"&gt;Leiden algorithm&lt;/a&gt; (a community detection method)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical summarization&lt;/strong&gt;: have the LLM summarize each community. Then summarize groups of communities into higher-level summaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query time&lt;/strong&gt;: pick the relevant community for the user's question, dump its summary into the prompt, answer in a single shot&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why is the preprocessing this heavy? Because of the assumption underneath: &lt;strong&gt;"calling tools many times at query time isn't realistic"&lt;/strong&gt;. Function calling loops were slow, expensive, and unstable. So you preprocess the entire corpus with an LLM, build community summaries, and &lt;strong&gt;front-load the work to make query-time retrieval a single hop or two&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This wasn't a design failure — it was the &lt;strong&gt;rational answer for that era&lt;/strong&gt;. LangChain's RetrievalQA, LlamaIndex's query engines — all of them were built on the same premise: "retrieval is single-shot, generation is one-turn."&lt;/p&gt;

&lt;h3&gt;
  
  
  What Classical Graph RAG Solved, and Didn't
&lt;/h3&gt;

&lt;p&gt;What it solved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Relationship-aware search (community summaries even cover "the big picture")&lt;/li&gt;
&lt;li&gt;Multi-hop questions like "the relationship between Sam Altman, OpenAI, and Microsoft"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it didn't solve cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Construction is expensive&lt;/strong&gt;: extracting entities from a large corpus via LLM costs real money&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema is at the LLM's mercy&lt;/strong&gt;: the entities and relationships extracted are whatever the LLM thinks. This works fine for public-knowledge corpora (papers, news, etc.), but for domains that lean on internal tacit knowledge, the extracted units don't always match what's meaningful for the business&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Updates are heavy&lt;/strong&gt;: every new document means recomputing communities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sometimes off-target&lt;/strong&gt;: community summaries get over-abstracted, and the specific information you actually need falls out&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Honest disclaimer: I haven't seriously run classical Graph RAG in production myself. By the time I started building graph-based MCPs in our company, Claude Code was already running on my laptop, and I started from a world where &lt;strong&gt;agents calling tools many times was the default&lt;/strong&gt;. As a result, I never actually needed the heavy "compress the answer ahead of time" preprocessing of community summaries. If AI can re-fetch as many times as needed, the graph just has to hold the facts accurately.&lt;/p&gt;

&lt;p&gt;The flip side: if I had been doing this in 2023, I likely would have ended up on the same path as community summaries. The problems classical Graph RAG was solving are real — &lt;strong&gt;the underlying assumptions just changed faster than the design&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things Changed — The Agentic Era
&lt;/h2&gt;

&lt;p&gt;From late 2024 through 2025, the landscape shifted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production-grade agents arrived&lt;/strong&gt;: Claude Code, OpenAI Codex — agents that can run long tasks while orchestrating their own tool calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP (Model Context Protocol) landed&lt;/strong&gt;: tool descriptions became a standardized contract the model can read&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-use accuracy from Sonnet/Opus-class models&lt;/strong&gt;: "pick the right tool from 20" became reliable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long context windows + prompt caching&lt;/strong&gt;: stacking many tool calls in a session is now economically reasonable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;stop_reason: tool_use&lt;/code&gt; as a natural loop&lt;/strong&gt;: the model itself decides "I have enough info" or "I need to look more"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When all of these line up, &lt;strong&gt;the assumption "we can't afford retrieval as a loop" no longer holds&lt;/strong&gt;. Five tool calls per session, ten, twenty — that's now the norm.&lt;/p&gt;

&lt;p&gt;The constraint Microsoft GraphRAG was designed against — "loops are expensive at query time" — has dissolved.&lt;/p&gt;

&lt;p&gt;This isn't to say Microsoft GraphRAG is "outdated." It was the right answer for its constraints. The constraints just changed, and &lt;strong&gt;so does the optimal answer&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agentic Graph RAG — Deterministic Retrieval, AI-Driven Orchestration
&lt;/h2&gt;

&lt;p&gt;Here's the thesis. In one line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Each retrieval step is deterministic. Only the orchestration is AI.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqw13x9j8mxqbd2hecdcz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqw13x9j8mxqbd2hecdcz.png" alt="The three eras of RAG" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For context: "Agentic Graph RAG" isn't a term I coined. Neo4j's &lt;a href="https://neo4j.com/videos/nodes-ai-2026-agentic-graphrag-autonomous-knowledge-graph-construction-and-adaptive-retrieval-2/" rel="noopener noreferrer"&gt;NODES AI 2026&lt;/a&gt; featured a session titled "Agentic GraphRAG," and O'Reilly is publishing &lt;a href="https://www.oreilly.com/library/view/agentic-graph-rag/9798341623163/" rel="noopener noreferrer"&gt;Agentic GraphRAG&lt;/a&gt; by Anthony Alcaraz and Sam Julien in November 2026. The industry as a whole is pivoting from "one-shot Graph RAG" toward "agent-driven Graph RAG." This article is my attempt to put words around the design we'd been arriving at independently inside our company.&lt;/p&gt;

&lt;p&gt;That said, when "Agentic GraphRAG" is used in public contexts, the dominant framing centers on &lt;strong&gt;agents automating the graph construction itself&lt;/strong&gt; (Neo4j's talk above is in that lineage). What this article takes from that broader idea is specifically &lt;strong&gt;the query-side agentic pattern&lt;/strong&gt;. We still hand-design the graphs because the domains we target (internal DB schemas, initiatives × KPIs, codebases) lean heavily on internal tacit knowledge — for now, hand-designing produces better results in practice. We aren't rejecting auto-construction in principle; we're applying the query-side concept to graphs we still build by hand.&lt;/p&gt;

&lt;p&gt;Vector RAG had &lt;strong&gt;probabilistic retrieval&lt;/strong&gt;. Embedding cosine is an approximation, and it sometimes misses. Hallucination starts at the retrieval layer.&lt;/p&gt;

&lt;p&gt;Classical Graph RAG &lt;strong&gt;runs retrieval once at query time&lt;/strong&gt;. Heavy preprocessing prepares "the answer itself" in advance, and at query time you just look it up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic Graph RAG sits between these two.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The graph is &lt;strong&gt;designed by humans&lt;/strong&gt;. Our domains lean on internal tacit knowledge, so humans deciding "this is the granularity I want to slice the data with" produces better results.&lt;/li&gt;
&lt;li&gt;Each tool call is &lt;strong&gt;deterministic&lt;/strong&gt;. Pass an ID and you get the connected nodes and edges. There's no embedding wiggle.&lt;/li&gt;
&lt;li&gt;The AI only judges &lt;strong&gt;which tool to call next, what ID to pass in, and when to stop&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: &lt;strong&gt;errors get localized&lt;/strong&gt;. Retrieval itself is deterministic, so the only places to be wrong are "AI picked the wrong starting point" or "AI stopped too early." The data in the response is the truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Return Values Become a Runbook
&lt;/h2&gt;

&lt;p&gt;The most important design move in Agentic Graph RAG: &lt;strong&gt;the tool's return value tells the AI what to do next&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2fofdk33xhoetnpswdeb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2fofdk33xhoetnpswdeb.png" alt="Tool return values become the next instruction" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is different from a regular API. Regular APIs answer the question they were asked. MCP tools are &lt;strong&gt;in conversation with an AI&lt;/strong&gt;. The other side of the conversation needs not just an "answer" but &lt;strong&gt;candidates for the next move&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Concrete example.&lt;/p&gt;

&lt;p&gt;When the AI calls DB Graph MCP's &lt;code&gt;search_tables&lt;/code&gt; tool, it gets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;5 tables matched (vector similarity ranked):

warehouse.return_package_table (postgresql) (distance: 0.2557)
warehouse.receipt_record_table (postgresql) (distance: 0.2720)
inventory.receipt_confirmation_table (mysql) (distance: 0.2921)
warehouse.receipt_record_detail_table (postgresql) (distance: 0.2951)
app.return_status_change_history_table (mysql) (distance: 0.3170)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;※ Schema and table names are anonymized — they map to internal system names.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notice that &lt;strong&gt;the response itself contains the next tool's argument&lt;/strong&gt;. The qualified name &lt;code&gt;warehouse.receipt_record_table&lt;/code&gt; is exactly what &lt;code&gt;get_table_detail(table_name: "warehouse.receipt_record_table")&lt;/code&gt; expects. If the AI decides "let me look at the details," it just copy-pastes.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;get_table_detail&lt;/code&gt; response is even more direct:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# warehouse.receipt_record_table&lt;/span&gt;
DB: POSTGRESQL / ORM: typeorm / Repo: warehouse-api

&lt;span class="gu"&gt;## Columns (9)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; id: int [PK, AI, NOT NULL]
&lt;span class="p"&gt;-&lt;/span&gt; shipping_order_id: varchar [NOT NULL]
&lt;span class="p"&gt;-&lt;/span&gt; status: enum [NOT NULL, default=IN_PROGRESS]
&lt;span class="p"&gt;-&lt;/span&gt; ...

&lt;span class="gu"&gt;## References (2)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; shipping_order_id → warehouse.shipping_order_table.id (explicit)
&lt;span class="p"&gt;-&lt;/span&gt; operator_id → warehouse.user_table.id (explicit)

&lt;span class="gu"&gt;## Enum / Status Definitions (2)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Status: COMPLETE = received, IN_PROGRESS = in progress
&lt;span class="p"&gt;-&lt;/span&gt; Type: RENTAL_RETURN = rental return, ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This response implicitly tells the AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"The meaning of &lt;code&gt;status&lt;/code&gt; is in the Enum definition"&lt;/strong&gt; → don't guess, read it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"There are FK references"&lt;/strong&gt; → if needed, you can follow them with &lt;code&gt;trace_relationships&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"There's no direct FK to the &lt;code&gt;app&lt;/code&gt; schema"&lt;/strong&gt; → you'll need a different path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, &lt;strong&gt;the tool's response is a runbook for the AI&lt;/strong&gt;. The AI reads it and assembles the next move on its own.&lt;/p&gt;

&lt;p&gt;Now look at the response from &lt;code&gt;sql_query_database&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gs"&gt;**app**&lt;/span&gt; (staging) — 1 row

| id     | status   | warehouse_order_code |
|--------|----------|----------------------|
| 98765  | RETURNED | SO-2026-00012345     |
&lt;span class="gt"&gt;
&amp;gt; **Table**: Manages the full lifecycle of delivery orders...&lt;/span&gt;

&lt;span class="gu"&gt;### Column descriptions&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**status**&lt;/span&gt;: Delivery status (1=awaiting shipment, 2=ready, 3=delivered, 4=returned, ...)
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**warehouse_order_code**&lt;/span&gt;: Link code to the warehouse-side shipping order

&lt;span class="gu"&gt;### Related tables&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; → &lt;span class="gs"&gt;**app.member_table**&lt;/span&gt; (user_id → id)
&lt;span class="p"&gt;-&lt;/span&gt; → &lt;span class="gs"&gt;**app.plan_master**&lt;/span&gt; (plan_id → id)
&lt;span class="p"&gt;-&lt;/span&gt; ← &lt;span class="gs"&gt;**app.order_history_table**&lt;/span&gt; (delivery_id → id)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Column descriptions and related tables are auto-attached below the query result.&lt;/strong&gt; This is composed dynamically from the graph data we cached in BQ. Reading that "warehouse_order_code links to the warehouse side," the AI immediately decides "next, look up the warehouse table by this code."&lt;/p&gt;

&lt;p&gt;Nobody had to tell the AI "now look at warehouse." &lt;strong&gt;The response itself is the instruction.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  DB Graph in Action — A Production Investigation in 4 Steps
&lt;/h2&gt;

&lt;p&gt;Here's the full flow (also shown in the &lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;DB Graph MCP article&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The scenario: a CS agent asks, "This member shows 'returned' in the app, but did the warehouse actually confirm receipt?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1&lt;/strong&gt;: Find tables in natural language (vector-similarity entry-point search)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;search_tables(query: "return processing confirmation", search_type: "semantic")
→ warehouse.receipt_record_table, warehouse.return_package_table, ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2&lt;/strong&gt;: Look at the details (deterministic detail retrieval)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;get_table_detail(table_name: "warehouse.receipt_record_table")
→ status=COMPLETE means "warehouse received it"
→ shipping_order_id connects to warehouse.shipping_order_table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3&lt;/strong&gt;: Find the path to the other schema (deterministic graph traversal)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;trace_relationships(table_name: "warehouse.shipping_order_table", direction: "both")
→ from the app side, connection goes through an intermediate table
search_tables(query: "warehouse linkage")
→ app.warehouse_linkage_table (warehouse_order_code maps to warehouse.shipping_order.code)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4&lt;/strong&gt;: Verify against real data (deterministic query execution)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;sql_query_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"SELECT ... WHERE user_id=12345 AND status='RETURNED'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;warehouse_order_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"SO-2026-00012345"&lt;/span&gt;

&lt;span class="n"&gt;sql_query_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"warehouse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"SELECT ... WHERE code='SO-2026-00012345'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;receive_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;COMPLETE&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;confirmed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;warehouse&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The crucial part: &lt;strong&gt;the AI built this 4-step flow autonomously&lt;/strong&gt;. The human only asked the original question. Each step's response carried "look here next" inside it, so the AI could keep composing the next call correctly.&lt;/p&gt;

&lt;p&gt;And &lt;strong&gt;each step's retrieval is deterministic&lt;/strong&gt;. The enum definitions for &lt;code&gt;status&lt;/code&gt; in &lt;code&gt;warehouse.receipt_record_table&lt;/code&gt; are facts pulled from the graph — not values the AI invented. &lt;code&gt;warehouse_order_code = SO-2026-00012345&lt;/code&gt; is real data — not an ID the AI fabricated.&lt;/p&gt;

&lt;p&gt;This is a different experience from both Vector RAG and classical Graph RAG. Vector RAG is "return all the text in one shot," but hallucinations slip in. Classical Graph RAG is "return the community summary in one shot," but specifics get lost in summarization. Agentic Graph RAG is "&lt;strong&gt;fetch as many times as you need, but every fetch returns nothing but facts&lt;/strong&gt;."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Same Pattern, Across Many Graphs
&lt;/h2&gt;

&lt;p&gt;This pattern — what we adopt: &lt;strong&gt;human-designed graph + deterministic retrieval tools + responses that double as AI runbooks&lt;/strong&gt; — isn't limited to DB Graph and Biz Graph. We use it across many MCP servers internally.&lt;/p&gt;

&lt;p&gt;Including the ones I mentioned by name in &lt;a href="https://dev.to/ryantsuji/we-built-17-mcp-servers-to-let-ai-run-our-internal-operations-3lk2"&gt;the 17 internal MCP servers post&lt;/a&gt;, the lineup looks like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Graph&lt;/th&gt;
&lt;th&gt;What it covers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DB Graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;991 tables × 15 schemas across the company&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Biz Graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5,000+ initiatives × 4,000+ KPIs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Functions, APIs, events across all repos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cortex Product Graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code + DB + docs + infra unified for the cortex repo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Service Product Graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API → DB dependencies per service&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The structures are all different. DB Graph from ORM parsing. Biz Graph from meeting-slide extraction plus hand-designed MetricDomain. Code Graph from static analysis. Product Graph from JSDoc annotations on top of everything else. Different sources, different assembly.&lt;/p&gt;

&lt;p&gt;But &lt;strong&gt;the shape from the MCP-tool side is identical&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Entry-point search&lt;/strong&gt;: vector or substring to find "around here" (the only place fuzziness is allowed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detail retrieval&lt;/strong&gt;: pass an ID, get facts (deterministic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationship traversal&lt;/strong&gt;: jump from ID to ID along edges (deterministic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embed next-step hints in responses&lt;/strong&gt;: related IDs, enum definitions, annotations, links&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This &lt;strong&gt;3+1&lt;/strong&gt; template is the universal Agentic Graph RAG shape. Different graph internally, identical surface. From the AI side, &lt;strong&gt;they all feel the same&lt;/strong&gt; — Claude Code uses DB Graph and Code Graph and Product Graph with the same "search → drill down → traverse" rhythm.&lt;/p&gt;

&lt;p&gt;Of the graphs above, only DB Graph and Biz Graph have dedicated deep-dive posts so far. Code Graph and the Product Graph family will get their own writeups; for this post, they're listed as fellow examples of the pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Designer's Checklist
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;For implementers.&lt;/strong&gt; Below are the six things I always keep top of mind when adapting Agentic Graph RAG to a new domain.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Things I keep top of mind when building an Agentic Graph RAG:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Choose the graph-construction method based on the domain
&lt;/h3&gt;

&lt;p&gt;If the domain leans on internal tacit knowledge, &lt;strong&gt;humans deciding the nodes and edges&lt;/strong&gt; produces better results. Sometimes you intentionally design a structure that doesn't exist naturally — Biz Graph's "Week node" and "MetricDomain" are examples. &lt;strong&gt;The design is what determines quality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Conversely, when the domain is mostly public knowledge (papers, news, public docs), having agents automate construction is a strong option (the Neo4j talk lineage). This article assumes the former.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Make retrieval deterministic
&lt;/h3&gt;

&lt;p&gt;The entry-point search may use vector similarity (to accept natural-language queries). After that, "get details by ID" and "follow relationships from this ID" must always return &lt;strong&gt;definite values via graph traversal&lt;/strong&gt;. Using similarity here lets hallucination back into the retrieval layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Tool granularity: search → detail → traverse
&lt;/h3&gt;

&lt;p&gt;Don't pile everything into one giant tool. Split into search-style entry points, detail lookups, and traversal/data tools. The AI understands the difference and uses them appropriately.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Tool descriptions are AI runbooks
&lt;/h3&gt;

&lt;p&gt;Write tool descriptions as &lt;strong&gt;execution guides for the AI&lt;/strong&gt;, not human documentation. "If you see this kind of response, call this tool next." "In this situation, format the argument like this." As I mentioned in &lt;a href="https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a"&gt;the Sandbox MCP post&lt;/a&gt;, this directly determines how smart the agent appears.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Embed "next move candidates" in responses
&lt;/h3&gt;

&lt;p&gt;Don't just return data. Return:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Related IDs&lt;/strong&gt;: where to traverse next (FK targets, similar initiatives, parent commits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enums and definitions&lt;/strong&gt;: so the AI can interpret values without guessing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Annotations and warnings&lt;/strong&gt;: DEAD flags, deprecation marks, PII (personally identifiable information) redaction notes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At a granularity where the AI can read "this is what I should do next" out of the response.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Let the AI do the summarization
&lt;/h3&gt;

&lt;p&gt;Don't pre-bake "community summaries" or similar on the server. The AI assembles facts case by case at the right granularity. &lt;strong&gt;Return facts. Let the AI interpret.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Limits and Caveats
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Heads up.&lt;/strong&gt; This approach has clear weak spots. If you're considering adopting it, read this section before you start designing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Agentic Graph RAG is not a silver bullet. To be honest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quality depends entirely on graph design&lt;/strong&gt;. If the schema doesn't carve up the domain correctly, no number of tool calls will reach what you want. And in tacit-knowledge-heavy domains, the call about which nodes/edges to include is one only someone deeply familiar with the domain can make.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the agent picks the wrong entry, it falls into a deep hole&lt;/strong&gt;. Miss at the first &lt;code&gt;search_*&lt;/code&gt; and the rest of the graph traversal goes sideways. Entry-point quality matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost is tool-call-count × context length&lt;/strong&gt;. 10–20 tool calls per session add up tokens straightforwardly. Prompt caching and progress reporting via MCP help, but you have to keep an eye on it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination doesn't disappear — it relocates&lt;/strong&gt;. From the retrieval layer to "entry point selection" and "stop judgment." But it's much narrower territory, so debugging and evals get easier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first item is the one designers should worry about most. &lt;strong&gt;In tacit-knowledge domains specifically, graphs aren't found — they're designed.&lt;/strong&gt; I wrote this in the Biz Graph post too, and for these domains I don't think it can be overstated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The three eras of RAG, in one table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Era&lt;/th&gt;
&lt;th&gt;Representative&lt;/th&gt;
&lt;th&gt;Retrieval&lt;/th&gt;
&lt;th&gt;Orchestration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Early days&lt;/td&gt;
&lt;td&gt;Vector RAG&lt;/td&gt;
&lt;td&gt;Probabilistic (cosine)&lt;/td&gt;
&lt;td&gt;None (one-shot)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Function-calling era&lt;/td&gt;
&lt;td&gt;Classical Graph RAG&lt;/td&gt;
&lt;td&gt;Pre-summarized&lt;/td&gt;
&lt;td&gt;Light, mostly one-shot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent era&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Agentic Graph RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Deterministic (graph traversal)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AI assembles in many steps&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Vector RAG made "search and dump some context" work. Classical Graph RAG packaged "follow relationships" into a single-shot lookup. Agentic Graph RAG &lt;strong&gt;separates "tools that return only facts, accurately" from "AI agents that orchestrate them in multiple steps."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The graphs we've built internally — DB Graph, Biz Graph, Code Graph, Product Graph family — they're all from the same lineage. The contents and construction differ, but in our domains they all share the same shape: &lt;strong&gt;"give Claude Code a human-designed graph through deterministic tools."&lt;/strong&gt; Which is why, from the AI side, they all feel the same.&lt;/p&gt;

&lt;p&gt;If you're building AI-native internal infrastructure, give this perspective a try. &lt;strong&gt;Don't hand the AI an answer. Hand it a map.&lt;/strong&gt; It walks much further than you think.&lt;/p&gt;

&lt;p&gt;And the quality of that map comes down to how deeply you understand the domain — at least for the domains where the relevant knowledge sits as tacit understanding inside people's heads. &lt;strong&gt;In those domains, the best AI systems are still built by the people who know the problem space best.&lt;/strong&gt; Domain expertise hasn't lost value in the AI era — it's gained it. That's been my strongest takeaway from two years of building graphs across our company.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>graphrag</category>
      <category>mcp</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Cutting Self-Built MCP Server Token Usage by 90% — The Parking Pattern</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Fri, 01 May 2026 01:10:27 +0000</pubDate>
      <link>https://dev.to/ryantsuji/cutting-self-built-mcp-server-token-usage-by-90-the-parking-pattern-3e7o</link>
      <guid>https://dev.to/ryantsuji/cutting-self-built-mcp-server-token-usage-by-90-the-parking-pattern-3e7o</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;p&gt;In my previous posts I introduced &lt;a href="https://dev.to/ryantsuji/we-built-17-mcp-servers-to-let-ai-run-our-internal-operations-3lk2"&gt;the full picture of our 17 internal MCP servers&lt;/a&gt;, &lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;an MCP server that lets you search 991 internal tables in natural language&lt;/a&gt;, &lt;a href="https://dev.to/ryantsuji/we-built-a-custom-graph-rag-to-let-ai-answer-did-that-initiative-actually-work-3oda"&gt;a Graph RAG MCP for measuring initiative impact&lt;/a&gt;, and &lt;a href="https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a"&gt;the Sandbox MCP that lets non-engineers publish AI-built apps safely&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This time I want to share something that came out of running those in production — &lt;strong&gt;a small trick we use to cut token consumption on self-built MCP servers&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Annoyance: MCPs Eat More Tokens Than You'd Think
&lt;/h2&gt;

&lt;p&gt;The first surprise when extending an AI agent with MCP is that &lt;strong&gt;token consumption is higher than expected&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An MCP tool call is, at the end of the day, JSON-RPC over HTTP. Both the arguments the AI sends and the result the tool returns &lt;strong&gt;land directly in the conversation context&lt;/strong&gt;. If you implement things naively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sending whole files as arguments → thousands of lines of source code stick to the context&lt;/li&gt;
&lt;li&gt;Returning all DB query rows → a multi-thousand-row × multi-column table sticks to the context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single tool call can easily consume tens of thousands of tokens, putting the Claude Code session straight into compaction.&lt;/p&gt;

&lt;p&gt;It's worse than just inefficiency: above a certain row count, &lt;strong&gt;the response simply fails to come back at all&lt;/strong&gt; because it exceeds MCP's payload size limit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furet1i8zmntgb72kdkn8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furet1i8zmntgb72kdkn8.png" alt="Naive implementation bloats the context" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we were ramping up our internal MCP fleet, this little mismatch was reliably making the tool experience worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern: Park the Big Stuff Elsewhere, Pass Only a Key
&lt;/h2&gt;

&lt;p&gt;The fix is embarrassingly simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Take the parts that tend to grow and move them off the MCP wire. Pass only a reference key (or URL) through MCP itself.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Both the request side and the response side benefit from the same idea.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Direction&lt;/th&gt;
&lt;th&gt;What to remove&lt;/th&gt;
&lt;th&gt;Where to park it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Request&lt;/td&gt;
&lt;td&gt;Large files / source code&lt;/td&gt;
&lt;td&gt;GitHub, Drive, or any object store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response&lt;/td&gt;
&lt;td&gt;Large list data / query results&lt;/td&gt;
&lt;td&gt;Spreadsheet / GCS / BigQuery&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffet7i38yhfketsu8x8yx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffet7i38yhfketsu8x8yx.png" alt="The parking pattern" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two examples from airCloset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example 1: Lighter Requests — Sandbox MCP × Self-Hosted Git Server
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a"&gt;Last time&lt;/a&gt; I wrote about &lt;strong&gt;Sandbox MCP&lt;/strong&gt;, the platform that lets non-engineers publish AI-built apps internally. The first iteration was fully &lt;strong&gt;MCP tool-driven file uploads&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;sandbox_write_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;app_name: &lt;/span&gt;&lt;span class="s2"&gt;"todo-app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;path: &lt;/span&gt;&lt;span class="s2"&gt;"index.html"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;content: &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;html&amp;gt;..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sandbox_write_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;app_name: &lt;/span&gt;&lt;span class="s2"&gt;"todo-app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;path: &lt;/span&gt;&lt;span class="s2"&gt;"app.js"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;content: &lt;/span&gt;&lt;span class="s2"&gt;"import ..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sandbox_publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;app_name: &lt;/span&gt;&lt;span class="s2"&gt;"todo-app"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The moment apps got slightly bigger, this collapsed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Constant chunking&lt;/strong&gt;: hitting the payload size limit, the AI looped through "first half of file A → second half → first half of file B → ..."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokens going up in flames&lt;/strong&gt;: full source code landed in the conversation context — a single deploy of a few-thousand-line app could burn tens of thousands of tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retries made it worse&lt;/strong&gt;: the AI would "verify after sending" by re-reading the same file with &lt;code&gt;sandbox_read_file&lt;/code&gt;. Write → read → write loops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we changed the contract: &lt;strong&gt;MCP only returns a URL; the actual content moves over git push&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. MCP returns a git URL — no payload involved&lt;/span&gt;
sandbox_init_repo&lt;span class="o"&gt;(&lt;/span&gt;app_name: &lt;span class="s2"&gt;"todo-app"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;# → https://mcp-sandbox.example.com/git/sandbox/ryan/todo-app.git&lt;/span&gt;

&lt;span class="c"&gt;# 2. AI runs git in the background — MCP isn't involved&lt;/span&gt;
git init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git add &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"init"&lt;/span&gt;
git remote add sandbox &amp;lt;returned URL&amp;gt;
git push sandbox main

&lt;span class="c"&gt;# 3. Only the deploy command goes through MCP&lt;/span&gt;
sandbox_publish&lt;span class="o"&gt;(&lt;/span&gt;app_name: &lt;span class="s2"&gt;"todo-app"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;git push gives us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No file size limit&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Differential transfer — second-time pushes are fast&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Source code never lands in the MCP conversation context&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the AI's point of view, it's just "I got handed a git URL; I push to it." Fundamentally different in token economics.&lt;/p&gt;

&lt;p&gt;By the way, we &lt;strong&gt;don't use GitHub Organizations&lt;/strong&gt; here. Issuing GitHub seats for every employee wasn't worth the cost or operational overhead, and we already had a self-hosted Git Server on GCE for a different purpose, so we just added one repo (&lt;code&gt;sandbox-apps&lt;/code&gt;). The "park" doesn't have to be something you build from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example 2: Lighter Responses — DB Graph MCP × Spreadsheet
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;DB Graph MCP&lt;/a&gt; is the MCP that lets us search and query 991 internal tables in natural language.&lt;/p&gt;

&lt;p&gt;The annoying-but-common case here is &lt;strong&gt;"give me everything"-style queries&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;service_main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;user&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the result is several thousand to tens of thousands of rows, you get either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A multi-million-token response that triggers immediate session compaction&lt;/li&gt;
&lt;li&gt;An MCP error because the payload exceeds the size limit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or both. The "right" AI behavior is to do &lt;code&gt;LIMIT 100&lt;/code&gt; and analyze a sample — but if the user actually wanted &lt;strong&gt;the full list as a CSV&lt;/strong&gt;, that doesn't help them.&lt;/p&gt;

&lt;p&gt;So we built a &lt;strong&gt;"export to spreadsheet, return only the URL"&lt;/strong&gt; mode into DB Graph MCP. You can opt in explicitly, but the MCP &lt;strong&gt;also auto-falls back to this mode whenever the result exceeds a row-count threshold&lt;/strong&gt;. Even if the AI forgets to add a &lt;code&gt;LIMIT&lt;/code&gt; and the query is about to return 10,000 rows, the server decides "this is too big to return inline," exports to a spreadsheet, and hands back the URL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Conceptual call (the real shape is documented in the tool description)&lt;/span&gt;
&lt;span class="nf"&gt;sql_query_database&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SELECT * FROM ...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;spreadsheet&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;// ← explicit export mode&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// Without `output`, the server still auto-falls back over a threshold (e.g. 500 rows)&lt;/span&gt;
&lt;span class="nf"&gt;sql_query_database&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SELECT * FROM ...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;// → server detects row count → spreadsheet export + URL response&lt;/span&gt;

&lt;span class="c1"&gt;// Either way, the response shape is the same&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://docs.google.com/spreadsheets/d/{...}/edit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12483&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;created_at&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;
  &lt;span class="nx"&gt;exported_reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;row_count_exceeded&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;// set on auto-fallback&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response is just a URL plus metadata. The real data never enters the context. &lt;strong&gt;"Light if you're careful" becomes "light even when you're not"&lt;/strong&gt; — and that's what makes it feel safe in day-to-day operation.&lt;/p&gt;

&lt;p&gt;This pattern works because &lt;strong&gt;a surprisingly large fraction of real use cases are just "I want this data somewhere I can use it later"&lt;/strong&gt; — not "let's analyze this in chat with AI." Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Save it to a spreadsheet I can stare at later&lt;/li&gt;
&lt;li&gt;Share it with another team&lt;/li&gt;
&lt;li&gt;VLOOKUP it against another sheet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For those, MCP's job ends at "write the query, drop the result somewhere." That's enough.&lt;/p&gt;

&lt;p&gt;If the user genuinely does want AI-side analysis, you do still need the data in context. The standard workflow becomes a two-step: &lt;code&gt;LIMIT 100&lt;/code&gt; for sample analysis, then &lt;code&gt;output: spreadsheet&lt;/code&gt; for the full export once the conclusion is clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Much Did It Save?
&lt;/h2&gt;

&lt;p&gt;Every MCP we run logs every tool call. After rolling these patterns out, &lt;strong&gt;total token consumption across all tools dropped 70–90%&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bonus: Google Workspace OAuth Pairs Beautifully With This
&lt;/h2&gt;

&lt;p&gt;A note on choosing where to "park" data: &lt;strong&gt;if your MCP authenticates via Google Workspace OAuth, this whole design becomes much easier&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The reason is that you get two things from a single OAuth flow — &lt;strong&gt;two birds with one stone&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Authentication for MCP itself&lt;/strong&gt; — figuring out who's using the tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authorization for Workspace apps&lt;/strong&gt; — scoped access to Spreadsheet / Drive / Gmail / Calendar&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclyhrmb0vk8c4atojbpy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclyhrmb0vk8c4atojbpy.png" alt="Two birds with one stone" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the user has logged into the MCP, you don't have to ask for any additional permissions to write to the park location. Which means you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;the operating user's own permissions&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;To save files to &lt;strong&gt;that user's My Drive&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Without the MCP itself owning a write-anywhere service account&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Files end up in the user's drive, not on a shared service account. "Accidentally world-readable" or "visible to people who shouldn't see it" stops being a realistic accident — it's structurally prevented.&lt;/p&gt;

&lt;p&gt;You also dodge the operational cost of issuing a separate GCP service account, storing its key safely, and managing its IAM policy out of band. The safety property genuinely comes for free.&lt;/p&gt;

&lt;p&gt;There's one catch though:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The AI agent has to be able to read the spreadsheet URL it got back.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Returning a URL alone doesn't help the AI access the underlying data. Stock tooling in Claude Code can't read a Spreadsheet directly, so you need a separate Workspace-operating MCP.&lt;/p&gt;

&lt;p&gt;At airCloset we run &lt;strong&gt;a dedicated MCP that wraps the Google Workspace APIs&lt;/strong&gt; (Drive / Sheets / Gmail / Calendar). Combined with the export pattern above, it gives us a clean flow: "drop results into a spreadsheet → call into the Workspace MCP later if the AI wants to actually read them."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DB Graph MCP → exports to Spreadsheet → returns URL
                                          ↓
              Workspace MCP ← invoked when the AI decides it needs to read the data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the user's side, this naturally produces the rhythm of "dump it into a spreadsheet first, ask AI to analyze only when needed."&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-Up
&lt;/h2&gt;

&lt;p&gt;A few small tricks for keeping self-built MCP server token consumption under control:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Move the parts that tend to grow off the MCP wire&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Park them somewhere — Git server, Spreadsheet, GCS — and only pass keys/URLs through MCP&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pick a park that pairs well with Google Workspace OAuth — you get safety almost for free&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;If you want the AI to read parked data later, run a Workspace-style MCP alongside&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's an unflashy design move, but &lt;strong&gt;the difference in MCP usability before and after is dramatic&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you're running self-built MCP servers internally and feeling the token squeeze, give it a try.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>mcp</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Bridging 'I Want to Build' and 'I Want to Publish Safely' for Non-Engineers — Sandbox MCP</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Mon, 27 Apr 2026 23:04:57 +0000</pubDate>
      <link>https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a</link>
      <guid>https://dev.to/ryantsuji/bridging-i-want-to-build-and-i-want-to-publish-safely-for-non-engineers-sandbox-mcp-392a</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;p&gt;In my previous posts, I've introduced our internal MCP servers: &lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;an MCP server for natural-language search across all our databases&lt;/a&gt;, &lt;a href="https://dev.to/ryantsuji/we-built-17-mcp-servers-to-let-ai-run-our-internal-operations-3lk2"&gt;the full picture of our 17 internal MCP servers&lt;/a&gt;, and &lt;a href="https://dev.to/ryantsuji/we-built-a-custom-graph-rag-to-let-ai-answer-did-that-initiative-actually-work-3oda"&gt;a custom Graph RAG that lets AI answer "Did that initiative actually work?"&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This time I'm covering something a bit different: &lt;strong&gt;Sandbox MCP&lt;/strong&gt; — a platform that lets non-engineer employees deploy apps they built with AI to a safe, internal-only URL &lt;strong&gt;with a single command&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The pitch is simple: "If Claude Code can build an app, why not publish it directly?" The hard part is making "directly" mean &lt;strong&gt;safely&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Building Got Easy. Publishing Safely Did Not.
&lt;/h2&gt;

&lt;p&gt;The arrival of Claude Code and other AI coding agents is reshaping how work happens inside our company.&lt;/p&gt;

&lt;p&gt;"Building an app" used to be an engineer's job. You had to do requirements, design, frontend, backend, database, CI/CD, production deploy — all in one head.&lt;/p&gt;

&lt;p&gt;Now PMs, designers, and customer-success folks are talking to Claude Code with "build me a screen that does X" and getting working mockups on the spot. Inside airCloset we're seeing more and more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mockups for new project proposals&lt;/li&gt;
&lt;li&gt;Interactive reports that visualize research findings&lt;/li&gt;
&lt;li&gt;KPI dashboards used only by a single team&lt;/li&gt;
&lt;li&gt;Small tools for everyday operational improvements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These &lt;strong&gt;non-engineer outputs&lt;/strong&gt; are growing fast. People are even saying "let's just run with this in production for a bit."&lt;/p&gt;

&lt;p&gt;That's where the wall hits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Easy to Build. Hard to Publish Safely.
&lt;/h3&gt;

&lt;p&gt;Anyone can build something that runs locally now. Spin up &lt;code&gt;python -m http.server 8000&lt;/code&gt;, view it on your Mac — five minutes max.&lt;/p&gt;

&lt;p&gt;But the moment it becomes "I want my team to see this" or "I want others to actually use it," the difficulty curve goes vertical.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Where do you run it?&lt;/strong&gt; Cloud means GCP/AWS accounts, IAM, billing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What URL?&lt;/strong&gt; Domain registration, DNS, SSL certificates, Cloudflare.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What about auth?&lt;/strong&gt; If it touches confidential info, you need employees-only. OAuth implementation, domain restriction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;And the data?&lt;/strong&gt; Is localStorage enough, or do you need a real DB? If a DB, who manages the password?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How do you deploy?&lt;/strong&gt; Can you write a Dockerfile? Cloud Run config, env vars, service accounts, IAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What about security?&lt;/strong&gt; What if the AI-written code has a vulnerability? An auth bypass?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You &lt;em&gt;could&lt;/em&gt; "let the AI write all of it." But the result is &lt;strong&gt;left to the AI&lt;/strong&gt;. Cloudflare misconfigured and exposed to the world. Auth bypassed. A service account with production database write access slipped into the code. The more code AI writes, the higher the risk of these accidents.&lt;/p&gt;

&lt;p&gt;When a non-engineer says "I want to try building this," we need to clearly separate &lt;strong&gt;what the builder is responsible for&lt;/strong&gt; from &lt;strong&gt;what the platform must guarantee by default&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;There's also a quieter problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  UI Inconsistency and Data Sprawl
&lt;/h3&gt;

&lt;p&gt;When non-engineers build apps independently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One person uses React, another Vue, another raw HTML&lt;/li&gt;
&lt;li&gt;Buttons look and behave differently&lt;/li&gt;
&lt;li&gt;Some store data in localStorage, some in Google Sheets, some in Firebase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After 10 or 20 such apps, internal tooling becomes &lt;strong&gt;chaos&lt;/strong&gt;. Users wonder "wait, who built this one?" and "why does this button work differently?"&lt;/p&gt;

&lt;p&gt;Even for internal tools, you need &lt;strong&gt;a baseline of consistency&lt;/strong&gt; — both in design and in where data lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sandbox MCP — Standing Between "Build" and "Publish"
&lt;/h2&gt;

&lt;p&gt;That's why we built &lt;strong&gt;Sandbox MCP&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A non-engineer just says "build this" to Claude Code, and:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An app is generated using a unified UI Kit&lt;/li&gt;
&lt;li&gt;They can verify it works locally&lt;/li&gt;
&lt;li&gt;A single command deploys it to &lt;code&gt;https://sbx-{nickname}--{app-name}.example.com/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Self-hosted OAuth on the Cloudflare Worker enforces internal-only access&lt;/li&gt;
&lt;li&gt;Data is stored, isolated, in a dedicated Firestore database&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;— all of this completes within a single chat session with the AI.&lt;br&gt;
The builder is only responsible for &lt;strong&gt;functionality&lt;/strong&gt;. &lt;strong&gt;Security, data isolation, domain &amp;amp; SSL, authentication&lt;/strong&gt; are all handled by the Sandbox MCP platform by default.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2ibte5mq2jqws89wiby.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2ibte5mq2jqws89wiby.png" alt="System Overview" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Scale
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP tools&lt;/td&gt;
&lt;td&gt;10 (publish, status, schedule, list, delete, write_file, read_file, list_files, init_repo, unschedule)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supported runtimes&lt;/td&gt;
&lt;td&gt;Python (Flask + gunicorn), Node.js, static HTML/SPA, custom Dockerfile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;URL&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sbx-{nickname}--{app-name}.example.com&lt;/code&gt; (covered by Universal SSL, no ACM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authentication&lt;/td&gt;
&lt;td&gt;Self-hosted OAuth on a Cloudflare Worker (Google Workspace)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data&lt;/td&gt;
&lt;td&gt;Firestore named DB &lt;code&gt;sandbox&lt;/code&gt;, namespaced per nickname × app&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Self-hosted Git Server (GCE) + Cloud Run + Cloudflare Worker + KV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy time&lt;/td&gt;
&lt;td&gt;Typically 2–5 minutes (git push to public URL)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let's walk through the internals.&lt;/p&gt;
&lt;h2&gt;
  
  
  What It Does — Web, API, DB, and Cron
&lt;/h2&gt;

&lt;p&gt;Sandbox MCP supports four app shapes so it can cover almost any "I want to ship something internally" use case.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Detected by&lt;/th&gt;
&lt;th&gt;Use cases&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Python&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.py&lt;/code&gt; files present&lt;/td&gt;
&lt;td&gt;Flask + gunicorn for APIs, analysis tools with a UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Node.js&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;package.json&lt;/code&gt; present&lt;/td&gt;
&lt;td&gt;Express APIs + UI; Bun also works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Static HTML/SPA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;only &lt;code&gt;.html&lt;/code&gt; files (no Python/Node)&lt;/td&gt;
&lt;td&gt;nginx-served, React/Vue dist supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;includes a &lt;code&gt;Dockerfile&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Any runtime — Go, Rust, Bun, anything&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pick any of these and &lt;code&gt;sandbox_publish&lt;/code&gt; deploys it with no extra config.&lt;/p&gt;

&lt;p&gt;There's also &lt;code&gt;sandbox_schedule&lt;/code&gt; for &lt;strong&gt;scheduled batch apps via Cloud Scheduler&lt;/strong&gt;. Things like "post a risk summary to Slack at 9 AM every morning" become one-line cron setups.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;sandbox_schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="ss"&gt;app_name: &lt;/span&gt;&lt;span class="s2"&gt;"risk-alert"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="ss"&gt;schedule: &lt;/span&gt;&lt;span class="s2"&gt;"0 9 * * *"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="ss"&gt;path: &lt;/span&gt;&lt;span class="s2"&gt;"/api/cron"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="ss"&gt;timezone: &lt;/span&gt;&lt;span class="s2"&gt;"Asia/Tokyo"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloud Scheduler now hits the app's &lt;code&gt;/api/cron&lt;/code&gt; every morning at 9. No need to open the scheduler UI or translate cron syntax into IaC.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frontend — Unified Design via sandbox-ui-kit
&lt;/h2&gt;

&lt;p&gt;Even apps built by non-engineers should feel &lt;strong&gt;consistent as a tool family&lt;/strong&gt;. That's the job of the &lt;code&gt;sandbox-ui-kit&lt;/code&gt; repo.&lt;/p&gt;

&lt;p&gt;It lives on &lt;code&gt;mcp-sandbox.example.com/git&lt;/code&gt; and provides:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Contents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox-ui.css&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Design tokens + glass-morphism component styles (dark/light)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox-ui.js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Theme switcher, modals, toasts, generic JS utilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox-db.js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SandboxDB client SDK (more below)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;index.html&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Storybook-style component catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full API documentation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key: it's designed &lt;strong&gt;for AI to read and use&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;sandbox_publish&lt;/code&gt; tool description literally says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When building an app, first read README.md with read_file and use the UI Kit.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When Claude Code builds a new app, it &lt;code&gt;read_file&lt;/code&gt;s this README, learns which CSS/JS to load and which component names to use, then generates code accordingly. &lt;strong&gt;Instead of a human walking the AI through UI guidelines, we centralized the "how to use" in one place targeted at the AI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The result: apps built by anyone (with AI) end up with consistent buttons, modals, and forms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backend — Auto-Generated Dockerfile + Cloud Run
&lt;/h2&gt;

&lt;p&gt;"I don't want to write Docker." "I don't want to think about runtime configuration." Classic non-engineer requests.&lt;/p&gt;

&lt;p&gt;Sandbox MCP &lt;strong&gt;inspects the source files and generates a Dockerfile automatically&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// apps/mcp/git-server/src/sandbox/tools.ts&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hasPy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;dockerfile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generatePythonDockerfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hasRequirements&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Auto-create requirements.txt if missing&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;hasRequirements&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;writeFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;requirements.txt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;flask&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;gunicorn&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hasPackageJson&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;dockerfile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generateNodeDockerfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hasHtml&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;dockerfile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generateStaticDockerfile&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, a Python app gets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.12-slim&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PORT=8080&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "-u", "$(ls *.py | head -1)"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;requirements.txt&lt;/code&gt; is missing, &lt;code&gt;flask&lt;/code&gt; + &lt;code&gt;gunicorn&lt;/code&gt; get added automatically. AI can write &lt;code&gt;from flask import Flask&lt;/code&gt; and the dependencies will resolve — no missing-package surprises.&lt;/p&gt;

&lt;p&gt;Deployment uses &lt;code&gt;gcloud run deploy --source&lt;/code&gt;, with Cloud Build handling the image build. App authors &lt;strong&gt;can&lt;/strong&gt; write a &lt;code&gt;Dockerfile&lt;/code&gt;, but they don't have to. No Dockerfile gets the standard, with one customizes — friendly to both non-engineers and engineers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmn1j5ttav3yf7fw1eqvf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmn1j5ttav3yf7fw1eqvf.png" alt="Deploy Flow" width="800" height="286"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Database — Transparent Fallback Between localStorage and Firestore
&lt;/h2&gt;

&lt;p&gt;"I want to save data. I don't want to set up a database."&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;SandboxDB SDK&lt;/strong&gt; handles that. The same code uses &lt;code&gt;localStorage&lt;/code&gt; locally and Firestore once deployed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"https://mcp-sandbox.example.com/api/db/sdk.js"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"module"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SandboxDB&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;googleOAuthAccessToken&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Save (storage location auto-detected from hostname)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;items&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// List&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;items&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Get / update / delete&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;items&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;updated&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;items&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SDK internals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_isLocal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;location&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;localhost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
              &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;location&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;127.0.0.1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_isLocal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_localAdd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// localStorage&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_req&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                  &lt;span class="c1"&gt;// Firestore REST API&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When running on &lt;code&gt;localhost&lt;/code&gt;, it uses localStorage. The moment it's deployed under &lt;code&gt;sbx-*.example.com&lt;/code&gt;, it switches to Firestore. &lt;strong&gt;No code changes required.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This dramatically improves the experience of building apps with AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local: no network, no auth, all features work&lt;/li&gt;
&lt;li&gt;Deployed: same code runs, data is properly persisted&lt;/li&gt;
&lt;li&gt;Development data never leaks into systems outside Sandbox (it physically can't reach them)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Firestore Namespace Isolation
&lt;/h3&gt;

&lt;p&gt;Once deployed, data paths are strictly isolated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sandbox_data/{nickname}--{app}/{collection}/{docId}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;nickname&lt;/code&gt;: user identifier resolved via OAuth&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;app&lt;/code&gt;: Sandbox app name&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_createdAt&lt;/code&gt; / &lt;code&gt;_updatedAt&lt;/code&gt;: auto-attached by the SDK&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data from different apps is physically unreachable from each other. Even apps built by the same person live in different paths.&lt;/p&gt;

&lt;p&gt;The most important point: &lt;strong&gt;we use a dedicated &lt;code&gt;sandbox&lt;/code&gt; named database&lt;/strong&gt;. It's a completely separate Firestore database from the &lt;code&gt;(default)&lt;/code&gt; DB used by other internal systems. No matter how badly an app's code misbehaves, it can never touch data outside Sandbox.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure — Wildcard DNS + Cloudflare Worker + Self-Hosted Git Server
&lt;/h2&gt;

&lt;p&gt;Now for the infrastructure highlights.&lt;/p&gt;

&lt;h3&gt;
  
  
  How URLs Are Determined
&lt;/h3&gt;

&lt;p&gt;The public URL takes the form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://sbx-{nickname}--{app-name}.example.com/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;nickname&lt;/code&gt; is &lt;strong&gt;automatically pulled from the MCP OAuth session&lt;/strong&gt;. When a user logs into Sandbox MCP via Google, the email is looked up in a Firestore &lt;code&gt;users&lt;/code&gt; collection to resolve the nickname. Users never have to repeat "I am ryan" each time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r.tsuji@air-closet.com → users[r.tsuji@air-closet.com].nickname → "ryan"
                                                       ↓
                                  sbx-ryan--todo-app.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The &lt;code&gt;users&lt;/code&gt; collection is &lt;strong&gt;kept in sync from a separate internal pipeline&lt;/strong&gt; (a daily batch that pulls from our HR system and Google Workspace directory). Sandbox MCP just reads from it — no need to maintain its own employee master.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The benefit: you can tell &lt;strong&gt;whose app it is&lt;/strong&gt; just by reading the URL. When someone says "go look at ryan's todo-app," reading the URL aloud naturally communicates ownership.&lt;/p&gt;

&lt;h3&gt;
  
  
  Instant Publishing via Cloudflare Worker
&lt;/h3&gt;

&lt;p&gt;Normally, publishing a new subdomain requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Adding A/CNAME DNS records&lt;/li&gt;
&lt;li&gt;Issuing an SSL certificate (15–30 minute wait with ACM or Let's Encrypt)&lt;/li&gt;
&lt;li&gt;Configuring a load balancer or DomainMapping&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Sandbox MCP skips all of this with a &lt;strong&gt;Cloudflare Edge Router Worker&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6yct2iedaklpo8mt07s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6yct2iedaklpo8mt07s.png" alt="URL Routing" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DNS is fixed as &lt;code&gt;*.example.com&lt;/code&gt; &lt;strong&gt;wildcard&lt;/strong&gt; + Cloudflare proxy, with Universal SSL automatically covering every subdomain. The Cloudflare Worker receives all &lt;code&gt;*.example.com/*&lt;/code&gt; traffic and routes by subdomain.&lt;/p&gt;

&lt;p&gt;The logic is three-tier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// apps/worker/edge-router/src/index.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;handleRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// ① sbx-* prefix → Sandbox routing&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sandboxSub&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractSandboxSubdomain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sandboxSub&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;handleSandboxRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sandboxSub&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// ② KV route:{subdomain} registered → Cloud Run proxy&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;subdomain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractSubdomain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;subdomain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxyResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;handleCloudRunProxy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;subdomain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;proxyResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;proxyResponse&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// ③ Otherwise → fetch(request) passthrough&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;sandbox_publish&lt;/code&gt; finishes, all it does is &lt;strong&gt;write a &lt;code&gt;route:{nickname}/{app}&lt;/code&gt; key into Cloudflare KV&lt;/strong&gt;. That single write makes the new subdomain routable instantly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;kvPut&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`route:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;nickname&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;serviceUrl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No DNS setup. No waiting for SSL issuance. No IaC deploy. Everything completes within the MCP tool execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Self-Hosted Git Server for Larger Apps
&lt;/h3&gt;

&lt;p&gt;This setup actually started out &lt;strong&gt;without git at all&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Since the primary users were going to be PMs and CS folks, we figured "git concepts are too high a bar — let's keep everything inside MCP tools." Write files via &lt;code&gt;sandbox_write_file&lt;/code&gt;, deploy via &lt;code&gt;sandbox_publish&lt;/code&gt;. That should be enough, we thought.&lt;/p&gt;

&lt;p&gt;The approach hit two walls quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall 1: Constant chunking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MCP tool calls travel over HTTP, with a payload size limit. React/Vue build bundles, SPAs with images, business tools with dozens of files — they don't fit in a single call. We added an &lt;code&gt;append&lt;/code&gt; mode to &lt;code&gt;sandbox_write_file&lt;/code&gt; for chunking, but every "first half of file A → second half of file A → first half of file B → ..." sequence triggered error recovery and retries. Deployments became flaky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall 2: Massive token consumption&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was the real killer. When you tell the AI "deploy this app," it sends the entire source as MCP tool arguments. &lt;strong&gt;The file contents land in the conversation context&lt;/strong&gt;, and a few-thousand-line app burns through tokens fast. A single deploy easily consumed tens of thousands of tokens, and Claude Code sessions hit compaction quickly.&lt;/p&gt;

&lt;p&gt;Worse, the AI tends to "verify after sending" — re-reading the same file via &lt;code&gt;sandbox_read_file&lt;/code&gt;. &lt;strong&gt;Write → read → write loops, with tokens going up in flames.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So we pivoted to &lt;strong&gt;using git push as well&lt;/strong&gt;. With git push:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No file size limit&lt;/li&gt;
&lt;li&gt;Differential transfer — second-time pushes are fast&lt;/li&gt;
&lt;li&gt;Source code stays out of the MCP conversation context (no AI tokens consumed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We never expected business-side employees to run &lt;code&gt;git push&lt;/code&gt; by hand. But if &lt;strong&gt;Claude Code runs git commands in the background&lt;/strong&gt;, it's not a barrier. The user just says "build this and publish it" — the AI runs &lt;code&gt;git init &amp;amp;&amp;amp; git push&lt;/code&gt; on its own when needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why a Self-Hosted Git Server?
&lt;/h3&gt;

&lt;p&gt;Once we adopted git push, the next question was: where do we host the repos? We considered using GitHub Organizations but ruled it out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issuing and managing GitHub accounts for every employee&lt;/strong&gt; — including non-engineers — wasn't worth the cost or the operational overhead. Paying for a GitHub seat just to ship one app is overkill.&lt;/p&gt;

&lt;p&gt;Fortunately, we already operated &lt;strong&gt;a self-hosted Git Server on GCE for a different purpose&lt;/strong&gt;: hosting an internal "read-only Git MCP for code investigation." A VM with repositories cloned under &lt;code&gt;/mnt/repos/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We just added a &lt;strong&gt;Git Smart HTTP Protocol&lt;/strong&gt; endpoint and one new repo (&lt;code&gt;sandbox-apps&lt;/code&gt;) to it. The VM was already running, so the marginal cost was near zero. Authentication piggybacks on the existing Google OAuth setup. Repository management is just OS directory operations. Borrowing space on the existing internal Git Server was vastly simpler than spinning up new infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Actual Usage Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Get the git URL from the MCP tool (nickname is automatic)&lt;/span&gt;
sandbox_init_repo&lt;span class="o"&gt;(&lt;/span&gt;app_name: &lt;span class="s2"&gt;"my-app"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;# → https://mcp-sandbox.example.com/git/sandbox/ryan/my-app.git&lt;/span&gt;

&lt;span class="c"&gt;# 2. Local commit (the AI does this in the background)&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/my-app/
git init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git add &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"init"&lt;/span&gt;
git remote add sandbox &amp;lt;returned URL&amp;gt;

&lt;span class="c"&gt;# 3. Push&lt;/span&gt;
git push sandbox main
&lt;span class="c"&gt;# Username: oauth2accesstoken&lt;/span&gt;
&lt;span class="c"&gt;# Password: $(gcloud auth print-access-token)&lt;/span&gt;

&lt;span class="c"&gt;# 4. Deploy&lt;/span&gt;
sandbox_publish&lt;span class="o"&gt;(&lt;/span&gt;app_name: &lt;span class="s2"&gt;"my-app"&lt;/span&gt;, description: &lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Auth uses a Google OAuth token as the Basic Auth password (same pattern as GCP Source Repos). Only &lt;code&gt;@air-closet.com&lt;/code&gt; accounts pass. No GitHub account required — any employee can push.&lt;/p&gt;

&lt;p&gt;The remote repo is configured with &lt;code&gt;receive.denyCurrentBranch=updateInstead&lt;/code&gt;, so the working tree updates server-side on push. Cloud Run uses that directory as &lt;code&gt;--source&lt;/code&gt;, so there's no extra step between push and publish.&lt;/p&gt;

&lt;p&gt;For small apps (a few files, hundreds of lines each), &lt;code&gt;sandbox_write_file&lt;/code&gt; still works fine. &lt;strong&gt;Switch between MCP-only and git push depending on app size.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Security — Four Independent Gates
&lt;/h2&gt;

&lt;p&gt;That covered the "convenient to build" side. Now the &lt;strong&gt;"safe to publish"&lt;/strong&gt; side.&lt;/p&gt;

&lt;p&gt;As I noted at the start, exposing AI-generated code in front of users is risky. So Sandbox MCP layers four independent safety mechanisms that &lt;strong&gt;don't depend on the app's own implementation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84l7lqwqw5pke0mgquaf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84l7lqwqw5pke0mgquaf.png" alt="Security Layers" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ① Public-Facing Gate — Self-Hosted OAuth on the Cloudflare Worker
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;sbx-*.example.com&lt;/code&gt; sits behind a &lt;strong&gt;self-hosted OAuth gate built into the same Cloudflare Worker&lt;/strong&gt; that handles routing. When someone visits, the Worker first checks the &lt;code&gt;cortex_session&lt;/code&gt; cookie; if it's missing or invalid, it redirects to a Google Workspace SSO entry point (&lt;code&gt;auth.example.com/__edge/auth/start&lt;/code&gt;). Without an &lt;code&gt;@air-closet.com&lt;/code&gt; account, requests never reach Cloud Run.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;independent of the app's implementation&lt;/strong&gt;. Even if the AI didn't write a single line of auth code, the Worker stops the request first. "Accidentally public" is physically impossible.&lt;/p&gt;

&lt;h4&gt;
  
  
  Why we migrated from ZeroTrust Access to self-hosted OAuth
&lt;/h4&gt;

&lt;p&gt;The first iteration used &lt;strong&gt;Cloudflare ZeroTrust Access&lt;/strong&gt;. You just configure the &lt;code&gt;@air-closet.com&lt;/code&gt; domain restriction in the Cloudflare dashboard and you're done — no auth code at all. As a starting point it was ideal.&lt;/p&gt;

&lt;p&gt;The catch: &lt;strong&gt;ZeroTrust's free tier caps at 50 users&lt;/strong&gt;. As headcount grew and Sandbox MCP usage spread, we approached the cap, and switching to pay-as-you-go (~$7/user/month) wasn't trivially cheap. On top of that we wanted to share the same auth foundation with internal apps in production (KPI dashboards, inventory tools, etc.), so we decided to &lt;strong&gt;consolidate everything into a self-hosted OAuth with no user limit&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Conveniently, the Cloudflare Worker already in front of every &lt;code&gt;*.example.com&lt;/code&gt; request — the routing layer Sandbox MCP relies on — was perfectly positioned for this. A small extension gave us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;auth.example.com/__edge/auth/start&lt;/code&gt; to kick off Google OAuth 2.0&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;auth.example.com/__edge/auth/callback&lt;/code&gt; to exchange tokens, persist the session in Upstash Redis, and issue a &lt;code&gt;cortex_session&lt;/code&gt; cookie scoped to &lt;code&gt;Domain=.example.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Worker-level gating for sandbox + internal-app subdomains, injecting &lt;code&gt;X-Cortex-User-Email&lt;/code&gt; and friends into the Cloud Run request when authenticated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this fits inside the existing Worker — no extra Cloud Run, no extra VM. Workers do have a CPU-time budget, but &lt;strong&gt;OAuth flows and cookie checks complete in single-digit milliseconds&lt;/strong&gt;, so latency is indistinguishable from ZeroTrust.&lt;/p&gt;

&lt;p&gt;Net result: the user cap is gone, anyone with &lt;code&gt;@air-closet.com&lt;/code&gt; can use Sandbox out of the box, and the auth implementation is fully visible in our own codebase.&lt;/p&gt;

&lt;h3&gt;
  
  
  ② Deploy Gate — MCP OAuth
&lt;/h3&gt;

&lt;p&gt;Operations like &lt;code&gt;sandbox_publish&lt;/code&gt; and &lt;code&gt;sandbox_delete&lt;/code&gt; &lt;strong&gt;enforce Google OAuth on the MCP server side&lt;/strong&gt;. Sandbox MCP implements RFC 8414 (&lt;code&gt;/.well-known/oauth-authorization-server&lt;/code&gt;), so Claude Code runs the OAuth flow automatically on first connection.&lt;/p&gt;

&lt;p&gt;The strongest guarantee is &lt;strong&gt;"you can't accidentally update or delete someone else's app."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When multiple people share a Sandbox MCP, an AI accident like "wait, I overwrote a coworker's app while updating mine" would be devastating. To prevent that, &lt;strong&gt;the AI doesn't get to decide whose app is being touched&lt;/strong&gt;. The server injects &lt;code&gt;nickname&lt;/code&gt; automatically from the OAuth session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Strip the `nickname` property from the MCP tool schema and have&lt;/span&gt;
&lt;span class="c1"&gt;// the server force-inject the logged-in user's nickname.&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;injectNickname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;McpTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;userNickname&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;McpTool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;nickname&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;restProperties&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;restProperties&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;nickname&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userNickname&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the AI's perspective, the &lt;code&gt;nickname&lt;/code&gt; input doesn't exist. Even with a prompt injection like "delete ryan's app," there's no mechanism to do so. &lt;strong&gt;"You can only touch your own apps" is enforced at the API spec level.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On top of that, inputs are validated strictly against &lt;code&gt;/^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$/&lt;/code&gt;, rejecting shell-injection and path-traversal patterns (&lt;code&gt;..&lt;/code&gt;, &lt;code&gt;/&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  ③ Data Gate — SandboxDB Namespace Isolation
&lt;/h3&gt;

&lt;p&gt;As mentioned earlier, data lives at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sandbox_data/{nickname}--{app}/...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per request, the SandboxDB API resolves the path &lt;strong&gt;server-side&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browser (OAuth): resolve &lt;code&gt;email → users → nickname&lt;/code&gt;, take &lt;code&gt;app&lt;/code&gt; from the &lt;code&gt;Origin&lt;/code&gt; header&lt;/li&gt;
&lt;li&gt;Backend (SA token): take &lt;code&gt;nickname/app&lt;/code&gt; from the &lt;code&gt;X-Sandbox-App&lt;/code&gt; header (required — missing returns 400)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The client cannot spoof the path.&lt;/p&gt;

&lt;p&gt;We deliberately do &lt;strong&gt;not&lt;/strong&gt; use the &lt;code&gt;K-Service&lt;/code&gt; header (the Cloud Run-injected service name). That's a client-spoofable header, and another implementation that relied on it had a "read another app's data" vulnerability disclosed. Requiring &lt;code&gt;X-Sandbox-App&lt;/code&gt; keeps the only valid route through an explicitly server-validated path.&lt;/p&gt;

&lt;p&gt;The clincher: &lt;strong&gt;a dedicated named database for Sandbox&lt;/strong&gt;. Instead of the &lt;code&gt;(default)&lt;/code&gt; DB (which contains data from other systems), we use an independent Firestore database called &lt;code&gt;sandbox&lt;/code&gt;, and the Cloud Run SA gets an IAM Condition that allows access only to the &lt;code&gt;sandbox&lt;/code&gt; DB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// From infra/mcp/git-server/index.ts&lt;/span&gt;
&lt;span class="c1"&gt;// IAM Condition on roles/datastore.user:&lt;/span&gt;
&lt;span class="c1"&gt;//   resource.name == "projects/.../databases/sandbox" ||&lt;/span&gt;
&lt;span class="c1"&gt;//   resource.name.startsWith("projects/.../databases/sandbox/")&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No matter how badly the AI-written code goes wrong, it physically cannot reach data outside Sandbox.&lt;/p&gt;

&lt;h3&gt;
  
  
  ④ Execution Gate — Cloud Run SA + IAM
&lt;/h3&gt;

&lt;p&gt;All &lt;code&gt;sandbox-*&lt;/code&gt; Cloud Run services run under &lt;strong&gt;a single shared SA&lt;/strong&gt; (e.g. &lt;code&gt;sandbox-run&lt;/code&gt;). The permissions on that SA are minimal.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;roles/logging.logWriter&lt;/code&gt; (write its own logs)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;roles/bigquery.jobUser&lt;/code&gt; + &lt;code&gt;bigquery.dataViewer&lt;/code&gt; scoped to the &lt;code&gt;sandbox_logs&lt;/code&gt; dataset only (its own access logs, nothing else)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;roles/datastore.user&lt;/code&gt; (IAM Condition limiting to &lt;code&gt;sandbox&lt;/code&gt; DB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it does &lt;strong&gt;not&lt;/strong&gt; have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access to the &lt;code&gt;(default)&lt;/code&gt; Firestore that holds data from other systems&lt;/li&gt;
&lt;li&gt;Access to BigQuery datasets used by other internal systems&lt;/li&gt;
&lt;li&gt;Direct access to Secret Manager&lt;/li&gt;
&lt;li&gt;Permission to manage other Cloud Run services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, &lt;strong&gt;even if a Sandbox app goes completely rogue, the blast radius is limited to &lt;code&gt;sandbox_data&lt;/code&gt; and &lt;code&gt;sandbox_logs&lt;/code&gt;&lt;/strong&gt;. Nothing outside Sandbox is affected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logging — Apps Can Query Their Own Access Logs
&lt;/h2&gt;

&lt;p&gt;Sandbox apps eventually want to look at logs too. "How many views did this page get?" "Who hit that error?"&lt;/p&gt;

&lt;p&gt;We forward Cloud Run request logs to BigQuery via a &lt;strong&gt;Logging Sink&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// From infra/mcp/git-server/index.ts&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sandboxLogSink&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;gcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ProjectSink&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sandbox-logs-sink&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`bigquery.googleapis.com/projects/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/datasets/sandbox_logs`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;resource.type="cloud_run_revision"&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;resource.labels.service_name:"sandbox-"&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;logName:"run.googleapis.com%2Frequests"&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; AND &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;bigqueryOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;usePartitionedTables&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;sandbox_logs&lt;/code&gt; dataset is locked down with &lt;strong&gt;project-owner-only ACLs&lt;/strong&gt; (it contains PII like remoteIp and User-Agent), and the Sandbox SA gets a tightly scoped &lt;code&gt;bigquery.dataViewer&lt;/code&gt; to it.&lt;/p&gt;

&lt;p&gt;This lets apps query their own access logs from BigQuery. "Post last week's user count for this app to Slack" can be done entirely inside Sandbox.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Design — Making AI Use Tools Correctly
&lt;/h2&gt;

&lt;p&gt;Let me close with a note on tool definitions. I personally think this is where MCP design really makes or breaks.&lt;/p&gt;

&lt;p&gt;Sandbox MCP exposes 10 tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_publish&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Start deploy (async)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_deploy_status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check deploy status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_init_repo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Initialize git push repo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_write_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Write file (overwrite/append)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_list&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_delete&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Delete app&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_schedule&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configure Cloud Scheduler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_unschedule&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Remove Cloud Scheduler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_read_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read source code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sandbox_list_files&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List files&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Whether the AI picks the right tool at the right moment is almost entirely determined by &lt;strong&gt;what's written in the tool description&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example, the description for &lt;code&gt;sandbox_publish&lt;/code&gt; covers not just functionality but also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supported app types and required files (Python / Node.js / static HTML / custom)&lt;/li&gt;
&lt;li&gt;Startup command and PORT requirement per type&lt;/li&gt;
&lt;li&gt;When to use &lt;code&gt;write_file&lt;/code&gt; vs &lt;code&gt;git push&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;How to use SandboxDB (with SDK code samples)&lt;/li&gt;
&lt;li&gt;How to use the UI Kit (explicit instruction to fetch README.md via &lt;code&gt;read_file&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With this in place, the AI can autonomously do:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User says "build me a tool that displays Slack emoji scores"&lt;/li&gt;
&lt;li&gt;→ Reads &lt;code&gt;sandbox_publish&lt;/code&gt; description and sees "first read the UI Kit README"&lt;/li&gt;
&lt;li&gt;→ Calls &lt;code&gt;read_file&lt;/code&gt; on &lt;code&gt;sandbox-ui-kit/README.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;→ Generates HTML/CSS/JS following the guidelines&lt;/li&gt;
&lt;li&gt;→ Sees the SandboxDB SDK usage in the description and integrates persistence&lt;/li&gt;
&lt;li&gt;→ Calls &lt;code&gt;sandbox_publish&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;— without asking the user a single follow-up question. &lt;strong&gt;Writing not just "what it does" but "what to do with it" into the tool definition&lt;/strong&gt; is the secret to AI-friendly design.&lt;/p&gt;

&lt;p&gt;If you write tool definitions tersely, the AI keeps coming back asking "what should I do next?" The description is less of a human-facing doc and more of an &lt;strong&gt;AI-facing runbook&lt;/strong&gt;. That framing helps a lot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-Up
&lt;/h2&gt;

&lt;p&gt;Sandbox MCP exists to answer two challenges of building internal tools in the AI era:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Building&lt;/strong&gt; is now possible for anyone, thanks to AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publishing safely&lt;/strong&gt; remains hard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To close that gap, we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standardized every layer&lt;/strong&gt; on the platform side: frontend / backend / DB / infra / auth / domain / SSL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedded a runbook into tool descriptions&lt;/strong&gt; so the AI naturally uses things correctly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layered four access gates&lt;/strong&gt; (Worker-level OAuth / MCP OAuth / namespace isolation / IAM) so safety &lt;strong&gt;doesn't depend on the implementation being correct&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building this, what struck me again is that &lt;strong&gt;the role of platforms in an AI-powered development era is shifting&lt;/strong&gt;. Platforms used to optimize for "easy for humans." Now they also need to optimize for &lt;strong&gt;"used correctly by AI."&lt;/strong&gt; Tool descriptions are AI-facing docs, and safety must be designed assuming AI will write incorrect code.&lt;/p&gt;

&lt;p&gt;At the same time, by &lt;strong&gt;limiting what the builder is responsible for&lt;/strong&gt;, we drastically lower the barrier to "let me just try something." That's the entry point that turns a non-engineer's "I want to build this" into actual operational improvements.&lt;/p&gt;

&lt;p&gt;I hope this is useful for anyone designing internal platforms.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>typescript</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Still Measuring Initiative Impact Manually? How We Used Graph RAG + MCP to Make It Explorable</title>
      <dc:creator>Ryosuke Tsuji</dc:creator>
      <pubDate>Mon, 20 Apr 2026 15:27:35 +0000</pubDate>
      <link>https://dev.to/ryantsuji/we-built-a-custom-graph-rag-to-let-ai-answer-did-that-initiative-actually-work-3oda</link>
      <guid>https://dev.to/ryantsuji/we-built-a-custom-graph-rag-to-let-ai-answer-did-that-initiative-actually-work-3oda</guid>
      <description>&lt;p&gt;Hi, I'm &lt;a href="https://x.com/ryantsuji" rel="noopener noreferrer"&gt;Ryan&lt;/a&gt;, CTO at airCloset.&lt;/p&gt;

&lt;p&gt;In my previous posts, I introduced &lt;a href="https://dev.to/ryantsuji/democratizing-internal-data-building-an-mcp-server-that-lets-you-search-991-tables-in-natural-1da5"&gt;an MCP server that lets you search all company databases in natural language&lt;/a&gt; and showed &lt;a href="https://dev.to/ryantsuji/we-built-17-mcp-servers-to-let-ai-run-our-internal-operations-3lk2"&gt;the full picture of our 17 internal MCP servers&lt;/a&gt;. This time, I'm diving deep into what I briefly mentioned as "Biz Graph."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the story of how we represented the relationship between business initiatives and KPIs as a graph structure, enabling AI to answer "Did that initiative actually work?"&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Graph RAG?
&lt;/h2&gt;

&lt;p&gt;To get more value from AI, what matters is not just feeding it data — it's conveying &lt;strong&gt;the relationships between data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your data volume is small enough, tools like NotebookLM can deliver great results. But you can't fit all your business data into a context window. Initiative reports, KPI spreadsheets, marketing weekly reports, logistics daily metrics — you simply cannot dump all of that into a prompt.&lt;/p&gt;

&lt;p&gt;That's why I believe the best available option right now is &lt;strong&gt;Graph RAG&lt;/strong&gt;: making the right data searchable at any time, along with its relationships. When AI is asked "What metrics are related to this initiative?", it can traverse the graph and extract only the information it needs — because that structure was built in advance.&lt;/p&gt;

&lt;p&gt;But there's a catch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Non-Graph Data Into a Graph
&lt;/h2&gt;

&lt;p&gt;Many of you have heard of "knowledge graphs" and "GraphRAG." But when you actually try to build one, most people hit the same wall:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business data doesn't naturally form a graph.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With our DB Graph project, things were different. Tables had foreign keys. ORMs had &lt;code&gt;@JoinColumn&lt;/code&gt; and &lt;code&gt;belongsTo&lt;/code&gt;. &lt;strong&gt;Relationships already existed in the data&lt;/strong&gt; — we just had to parse and convert them.&lt;/p&gt;

&lt;p&gt;But the relationship between "initiatives" and "KPIs" has none of that.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A meeting slide says "SNS ad campaign launched"&lt;/li&gt;
&lt;li&gt;A spreadsheet records "This week's new members: 1,234"&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;There's no FK between these. No join key.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"The SNS campaign affected new member signups" — that relationship &lt;strong&gt;exists only in someone's head&lt;/strong&gt;. It's nowhere in the spreadsheet.&lt;/p&gt;

&lt;p&gt;This is what "business data doesn't form a graph" means. The relationships between entities aren't self-evident — &lt;strong&gt;you have to design the graph structure itself&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: "Did That Initiative Actually Work?"
&lt;/h2&gt;

&lt;p&gt;Every week, our company reports initiative progress in all-hands meetings and group-level standups.&lt;/p&gt;

&lt;p&gt;"We launched the spring SNS ad campaign"&lt;br&gt;
"We improved the recommendation engine"&lt;br&gt;
"We're raising our CS SLA achievement rate"&lt;/p&gt;

&lt;p&gt;— Dozens of initiatives reported weekly. Hundreds per year. &lt;strong&gt;Over 5,000 total&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Meanwhile, a separate spreadsheet tracks 200+ metrics daily and weekly: member count, new signups, retention rate, satisfaction scores, acquisition CPA...&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem: these two worlds are completely disconnected.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"How much did last month's SNS campaign contribute to new member acquisition?"&lt;/p&gt;

&lt;p&gt;Answering this requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Confirm the initiative's execution period (which slide was that again?)&lt;/li&gt;
&lt;li&gt;Find KPI data for that period (which sheet, which tab?)&lt;/li&gt;
&lt;li&gt;Align timeframes and compare numbers (week-over-week? month-over-month? year-over-year?)&lt;/li&gt;
&lt;li&gt;Check if other initiatives were running simultaneously (confounding factors?)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This manual analysis takes 30-60 minutes, &lt;strong&gt;happening every week for multiple initiatives&lt;/strong&gt;. Realistically, most initiative effectiveness reviews end with "it probably worked, I think."&lt;/p&gt;
&lt;h2&gt;
  
  
  Biz Graph: The Big Picture
&lt;/h2&gt;

&lt;p&gt;We built &lt;strong&gt;Biz Graph&lt;/strong&gt; to solve this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tgsnnpkv7f4w4pnnxkv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tgsnnpkv7f4w4pnnxkv.png" alt="System Overview" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Scale
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: The numbers below differ from actual values but convey the order of magnitude. In any case, this is far too much data to fit in an LLM's context window.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nodes&lt;/td&gt;
&lt;td&gt;~10,000 (14 types)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edges&lt;/td&gt;
&lt;td&gt;~71,000 (22 types)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Initiatives&lt;/td&gt;
&lt;td&gt;~5,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KPI Metrics&lt;/td&gt;
&lt;td&gt;~4,000 (members/signups/retention/satisfaction/UX/marketing/logistics)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marketing Channels&lt;/td&gt;
&lt;td&gt;~100 (SEM/LINE/email/CRM etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Sources&lt;/td&gt;
&lt;td&gt;9 tables/spreadsheets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Three Components
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Biz Graph Transformer&lt;/strong&gt; — Weekly graph rebuild from all data sources (Cloud Run Job, every Friday 22:00)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Biz Graph MCP Server&lt;/strong&gt; — Graph search + time series analysis accessible from AI (Cloud Run)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Biz Data Loader&lt;/strong&gt; — Daily auto-import of marketing/logistics data (Cloud Run Job, every morning 6:00)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  The Core Design: The Week Node
&lt;/h2&gt;

&lt;p&gt;Here's the heart of this article.&lt;/p&gt;

&lt;p&gt;How do you connect "initiatives" and "metrics" in a graph? The obvious first thought is direct edges:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Initiative("SNS campaign") ──AFFECTS──→ Metric("new_members")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This design breaks down.&lt;/strong&gt; Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Edge explosion&lt;/strong&gt;: 5,000 initiatives × 4,000 metrics = up to 20 million edges&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Causal uncertainty&lt;/strong&gt;: "SNS campaign affected new members" is a hypothesis, not a fact. Direct edges make it look like a confirmed relationship&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing temporal info&lt;/strong&gt;: There's no way to express &lt;em&gt;when&lt;/em&gt; the impact occurred&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead, we designed &lt;strong&gt;Week nodes as shared anchors for indirect connections&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadq8pn3dl2yuqeekx4be.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadq8pn3dl2yuqeekx4be.png" alt="Week Anchor" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Initiative("SNS campaign")     ──ACTIVE_DURING_WEEK──→  Week:2026-03-03
Metric("new_members")          ──HAS_DATA_AT──→         Week:2026-03-03
QualityMetric("avg_rating")    ──HAS_QUALITY_DATA_AT──→ Week:2026-03-03
MarketingChannel("SEM brand")  ──HAS_MARKETING_DATA_AT──→ Week:2026-03-03
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Initiatives and metrics aren't directly connected — they're &lt;strong&gt;indirectly linked through the same week&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Works
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Prevents edge explosion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Initiatives only connect to "weeks they were active." Metrics only connect to "weeks that have data." Instead of a cross-product, each connects independently to Week nodes — edge count grows linearly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Expresses co-occurrence, not causation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"Initiatives that were active the same week as metric fluctuations" — this isn't asserting causation, it's a structure for &lt;strong&gt;discovering causal candidates&lt;/strong&gt;. It leaves room for human or AI judgment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Edge types distinguish data sources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Same Week node, but &lt;code&gt;HAS_DATA_AT&lt;/code&gt; (business KPIs), &lt;code&gt;HAS_QUALITY_DATA_AT&lt;/code&gt; (service quality), &lt;code&gt;HAS_UX_DATA_AT&lt;/code&gt; (UX metrics), &lt;code&gt;HAS_MARKETING_DATA_AT&lt;/code&gt; (marketing), &lt;code&gt;HAS_LOGI_DATA_AT&lt;/code&gt; (logistics) — "what kind of data" is embedded in the edge type itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Time series traversal is natural&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Week nodes are connected by &lt;code&gt;NEXT_WEEK&lt;/code&gt; edges. "How did metrics change in the 3 weeks before and after initiative start?" can be expressed as graph traversal.&lt;/p&gt;

&lt;h2&gt;
  
  
  MetricDomain: Bridging Worlds Without Join Keys
&lt;/h2&gt;

&lt;p&gt;Week nodes tell us "what happened the same week," but not &lt;strong&gt;which metrics are relevant to a given initiative&lt;/strong&gt;. There's no point looking at logistics data when analyzing an SNS ad campaign.&lt;/p&gt;

&lt;p&gt;However, there's &lt;strong&gt;no join key&lt;/strong&gt; between initiative categories ("Marketing (Advertising)") and metric groups ("New Acquisition"). The knowledge that "ad initiatives relate to new acquisition" is tacit — it exists only in people's heads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MetricDomain&lt;/strong&gt; (6 domains) structuralizes this tacit knowledge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds6dldm4urh8qrp0grln.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds6dldm4urh8qrp0grln.png" alt="MetricDomain" width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Connected metric types&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;acquisition&lt;/td&gt;
&lt;td&gt;New acquisition&lt;/td&gt;
&lt;td&gt;Marketing channels, new member count, registration CV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;retention&lt;/td&gt;
&lt;td&gt;Retention / churn prevention&lt;/td&gt;
&lt;td&gt;Member count, churn rate, plan transitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;service_quality&lt;/td&gt;
&lt;td&gt;Service quality&lt;/td&gt;
&lt;td&gt;Satisfaction, ratings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;operations&lt;/td&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Selection, shipping, returns, logistics KPIs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ux&lt;/td&gt;
&lt;td&gt;UX experience&lt;/td&gt;
&lt;td&gt;Sessions, funnels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;revenue&lt;/td&gt;
&lt;td&gt;Revenue / purchases&lt;/td&gt;
&lt;td&gt;Purchase CV, upsell&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These 6 domains aren't fixed — they can be freely added or split as the business grows and the organization evolves. Domain definitions are just mapping tables in code, so the cost of expansion is nearly zero.&lt;/p&gt;

&lt;p&gt;By &lt;strong&gt;humans defining&lt;/strong&gt; the mapping between initiative categories and MetricDomains, and between metric groups and MetricDomains, we enable "automatically show acquisition-related metrics when viewing a marketing initiative."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Category("Marketing ads") ──CATEGORY_IN_DOMAIN──→ MetricDomain("acquisition")
                                                           ↑ IN_DOMAIN
                                                  MetricGroup("New Acquisition")
                                                  MarketingChannel("SEM brand")
                                                  UxMetric("registration_completed")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Pass &lt;code&gt;domain: "acquisition"&lt;/code&gt; to &lt;code&gt;compare_metrics&lt;/code&gt;, and the initiative overlay automatically filters to acquisition-related initiatives only.&lt;/p&gt;

&lt;h2&gt;
  
  
  SIMILAR_TO: AI Answers "Have We Done Something Like This Before?"
&lt;/h2&gt;

&lt;p&gt;Another unique design element: &lt;strong&gt;SIMILAR_TO edges&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Initiative text (title + description) is vectorized to 768 dimensions using Vertex AI's gemini-embedding-001, then BigQuery's VECTOR_SEARCH auto-detects similar pairs with cosine similarity &amp;gt;= 0.75.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;VECTOR_SEARCH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;cortex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;biz_graph_nodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'embedding'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cortex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;biz_graph_nodes&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;node_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Initiative'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;distance_type&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'COSINE'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;  &lt;span class="c1"&gt;-- distance &amp;lt;= 0.25 = similarity &amp;gt;= 0.75&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Currently &lt;strong&gt;~13,000 SIMILAR_TO edges&lt;/strong&gt; exist. Up to 5 similar initiatives are pre-computed for each one.&lt;/p&gt;

&lt;p&gt;"Didn't we run a similar SNS campaign last summer? How did that one perform?" — traverse similar initiatives on the graph instantly, then compare KPI changes during weeks those initiatives were active.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Usage Examples
&lt;/h2&gt;

&lt;p&gt;Here's how exploration works via MCP tools.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All tool execution examples below run through MCP from an AI coding agent. The response format matches the real system, but numbers are dummy values and content is simplified.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  "Find marketing initiatives that drove acquisition"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;search_initiatives(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SNS advertising for new acquisition"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"acquisition"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dateFrom"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-10-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dateTo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response (excerpt):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;5 initiatives found (by vector similarity):

1. SNS Ad Spring Collection Campaign (2026-03-09)
   Category: Marketing (Advertising)
   Similarity: 892/1000

2. Instagram Reels Ad Test (2026-02-23)
   Category: Marketing (Advertising)
   Similarity: 845/1000
   ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  "Show me the impact of that initiative"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;get_initiative_context(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"initiative_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Initiative:2026-03-09:SNS Ad Spring Collection Campaign"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metric_window_days"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response (excerpt):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Initiative Context&lt;/span&gt;

Title: SNS Ad Spring Collection Campaign
Execution Period: 2026-03-01 to 2026-03-31
Category: Marketing (Advertising)
Target Domain: acquisition

&lt;span class="gu"&gt;## Similar Initiatives (SIMILAR_TO)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Instagram Reels Ad Test (similarity: 0.82)
&lt;span class="p"&gt;-&lt;/span&gt; 1-Month Free Trial Campaign (similarity: 0.78)

&lt;span class="gu"&gt;## KPI Changes During Initiative (30-day window)&lt;/span&gt;
| Metric | Pre-avg | Post-avg | Change |
|--------|---------|----------|--------|
| new_regular | 50 | 60 | +20.0% |
| new_lite | 30 | 35 | +16.7% |
| monthly | 1,000 | 1,050 | +5.0% |

&lt;span class="gu"&gt;## Service Quality Metrics&lt;/span&gt;
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| avg_rating | 3.50 | 3.60 | +2.9% |

&lt;span class="gu"&gt;## UX Metrics&lt;/span&gt;
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| total_sessions | 10,000 | 12,000 | +20.0% |
| registration_completed | 100 | 130 | +30.0% |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This is the power of the Week node design.&lt;/strong&gt; Identify the weeks an initiative was active, then automatically pull all metrics (KPIs, quality, UX, marketing, logistics) from those same weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Visualize new acquisition YoY with initiative overlay"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;compare_metrics(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"new_regular"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"new_lite"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"new_monthly"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dateFrom"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-10-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dateTo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"granularity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weekly"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"overlay_initiatives"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"acquisition"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Time series data with acquisition-domain initiatives overlaid on the same timeframe. KPI spikes become instantly attributable to "that initiative's timing."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Build Pipeline: 9 Phases
&lt;/h2&gt;

&lt;p&gt;The graph is constructed in 9 phases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Initiative nodes + Category/Business/Team&lt;/td&gt;
&lt;td&gt;Initiative, Category, Business, Team&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Daily KPIs (50 metrics)&lt;/td&gt;
&lt;td&gt;Metric → MetricGroup (10 groups)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Business KPIs + Departments&lt;/td&gt;
&lt;td&gt;Department → Metric (DEPT_TRACKS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Week nodes (shared anchors)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;HAS_DATA_AT + ACTIVE_DURING_WEEK + NEXT_WEEK&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Service quality metrics (~50)&lt;/td&gt;
&lt;td&gt;QualityMetric → Week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;UX metrics (~40)&lt;/td&gt;
&lt;td&gt;UxMetric → Week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Marketing channels (~100)&lt;/td&gt;
&lt;td&gt;MarketingChannel → Week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MetricDomain (semantic bridge)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6 domains + IN_DOMAIN + TARGETS_DOMAIN&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Logistics KPIs (~10 categories)&lt;/td&gt;
&lt;td&gt;LogiMetric → Week&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Phases 4 and 8 are the &lt;strong&gt;key design points&lt;/strong&gt;. Other phases simply "turn data into nodes" — these two "structuralize relationships that don't exist."&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 4: Week Node Generation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Convert initiative execution period to ISO weeks, generate ACTIVE_DURING_WEEK edges&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;initiative&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;initiatives&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;weeks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getISOWeeksBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;initiative&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;executionStartDate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;initiative&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;executionEndDate&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Cap at 52 weeks (guard against long-running initiatives)&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;week&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;weeks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;52&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;edge_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ACTIVE_DURING_WEEK&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;initiative&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;target_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Week:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;week&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Generate HAS_DATA_AT edges for weeks that have metric data&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;metricWeek&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;metricWeeks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;edge_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;HAS_DATA_AT&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Metric:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;metricWeek&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;target_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Week:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;metricWeek&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;week&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// NEXT_WEEK edges for time series traversal&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sortedWeeks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;allWeeks&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;sortedWeeks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;edge_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;NEXT_WEEK&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Week:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;sortedWeeks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;target_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Week:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;sortedWeeks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Phase 8: MetricDomain Generation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Category → Domain (semantic mapping defined by humans)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;CATEGORY_TO_DOMAINS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Marketing (Advertising)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;acquisition&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CRM / Retention&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;retention&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Quality / Service Improvement&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;service_quality&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Operations Improvement&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;operations&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;New Feature&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ux&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;revenue&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// Initiative → TARGETS_DOMAIN (main business only — limited to where KPI data exists)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;initiative&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;initiatives&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;initiative&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;business&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;MAIN_BUSINESS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;CATEGORY_TO_DOMAINS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;initiative&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;domain&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;domains&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;edge_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;TARGETS_DOMAIN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;initiative&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;target_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`MetricDomain:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Not a Dedicated Graph DB or OSS Libraries?
&lt;/h2&gt;

&lt;p&gt;We implemented the graph using &lt;strong&gt;BigQuery alone&lt;/strong&gt;, without Neo4j, Amazon Neptune, or OSS like Microsoft's GraphRAG.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not a dedicated graph DB?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Dedicated Graph DB&lt;/th&gt;
&lt;th&gt;BigQuery&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Graph traversal&lt;/td&gt;
&lt;td&gt;Fast (native)&lt;/td&gt;
&lt;td&gt;Fast enough (~10,000 node scale)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector search&lt;/td&gt;
&lt;td&gt;Requires separate service&lt;/td&gt;
&lt;td&gt;VECTOR_SEARCH built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time series analysis&lt;/td&gt;
&lt;td&gt;Weak&lt;/td&gt;
&lt;td&gt;Native (window functions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operating cost&lt;/td&gt;
&lt;td&gt;Always-on instances&lt;/td&gt;
&lt;td&gt;Serverless (pay per query)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joining other data&lt;/td&gt;
&lt;td&gt;ETL required&lt;/td&gt;
&lt;td&gt;Same project, instant JOIN&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For Biz Graph, "graph structure + time series analysis + vector search combined" matters more than "deep graph traversal." BigQuery handles all three in one engine.&lt;/p&gt;

&lt;p&gt;Additionally, BigQuery has announced &lt;a href="https://cloud.google.com/bigquery/docs/graph-overview" rel="noopener noreferrer"&gt;Graph capabilities&lt;/a&gt; — once GA, native graph queries on node/edge tables will be available. Currently we traverse with SQL JOINs, but we expect to migrate to faster, more intuitive queries in the future.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not OSS libraries / SaaS?
&lt;/h3&gt;

&lt;p&gt;OSS like Microsoft GraphRAG and various Graph RAG SaaS products focus on &lt;strong&gt;automatically extracting entities and relationships from text documents&lt;/strong&gt;. Great for research papers or news articles, but not for our use case.&lt;/p&gt;

&lt;p&gt;The reason is simple: &lt;strong&gt;we need to design the graph structure itself&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The concept of Week nodes as "temporal anchors" doesn't exist in generic tools&lt;/li&gt;
&lt;li&gt;MetricDomain "semantic bridging" reflects our specific business structure&lt;/li&gt;
&lt;li&gt;The Initiative → Week → Metric indirect connection pattern won't emerge from LLM entity extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Generic tools "auto-generate graphs from text." What we needed was "design the graph schema ourselves and integrate heterogeneous data sources." Fundamentally different problems.&lt;/p&gt;

&lt;p&gt;Internal query example (&lt;code&gt;get_initiative_context&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Get weeks the initiative was active&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;active_weeks&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;target_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;week_id&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cortex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;biz_graph_edges&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;source_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;initiative_id&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;edge_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ACTIVE_DURING_WEEK'&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="c1"&gt;-- Get metrics that have data in those same weeks&lt;/span&gt;
&lt;span class="n"&gt;co_occurring_metrics&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;metric_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;edge_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;week_id&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cortex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;biz_graph_edges&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
  &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;active_weeks&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;week_id&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;edge_type&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'HAS_DATA_AT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'HAS_QUALITY_DATA_AT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'HAS_UX_DATA_AT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'HAS_MARKETING_DATA_AT'&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;co_occurring_metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Graph traversal and time series data retrieval complete in a single SQL query. With a dedicated graph DB, you'd need to pass traversal results to another service for time series queries — an extra hop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Initiative Data Ingestion: Auto-Extraction from Meeting Slides
&lt;/h2&gt;

&lt;p&gt;Graph quality depends on source data quality. Initiative data comes from all-hands and group meeting slides.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;All-hands&lt;/td&gt;
&lt;td&gt;pptx in Drive → Slides conversion → text extraction&lt;/td&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Group standups&lt;/td&gt;
&lt;td&gt;Google Slides (cumulative, latest week appended)&lt;/td&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Text is extracted from meeting slides and structured by AI into the initiative table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;InitiativeRow&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;meetingDate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// Meeting date&lt;/span&gt;
  &lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;            &lt;span class="c1"&gt;// Source (all-hands / group standup etc.)&lt;/span&gt;
  &lt;span class="nl"&gt;business&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// Business unit&lt;/span&gt;
  &lt;span class="nl"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// Marketing (Ads), New Feature, ...&lt;/span&gt;
  &lt;span class="nl"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;             &lt;span class="c1"&gt;// Initiative title&lt;/span&gt;
  &lt;span class="nl"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// Detailed description&lt;/span&gt;
  &lt;span class="nl"&gt;team&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;              &lt;span class="c1"&gt;// Executing team&lt;/span&gt;
  &lt;span class="nl"&gt;executionStartDate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Execution start date&lt;/span&gt;
  &lt;span class="nl"&gt;executionEndDate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// Execution end date&lt;/span&gt;
  &lt;span class="nl"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;           &lt;span class="c1"&gt;// JSON format numeric metrics&lt;/span&gt;
  &lt;span class="nl"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;            &lt;span class="c1"&gt;// planned / in_progress / retrospective&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Critical: &lt;code&gt;executionStartDate&lt;/code&gt; / &lt;code&gt;executionEndDate&lt;/code&gt;. The meeting date (&lt;code&gt;meetingDate&lt;/code&gt;) differs from when the initiative actually runs. "We started the SNS campaign last week," reported on 3/9, means &lt;code&gt;executionStartDate&lt;/code&gt; is 3/1. This distinction is essential for accurate Week node connections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operating Cost
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vertex AI Embedding (weekly)&lt;/td&gt;
&lt;td&gt;~$0.05/run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code (initiative extraction)&lt;/td&gt;
&lt;td&gt;Within monthly plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BQ storage&lt;/td&gt;
&lt;td&gt;A few GB (negligible)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Run Jobs&lt;/td&gt;
&lt;td&gt;Nearly free (1x weekly + 1x daily)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;Nearly free (Cloud Run min-instances=0)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;A few dollars per month&lt;/strong&gt; to maintain a 10,000-node, 71,000-edge graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison With Typical Knowledge Graphs
&lt;/h2&gt;

&lt;p&gt;Let's take a step back and see how this design differs from conventional approaches.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Typical Knowledge Graph&lt;/th&gt;
&lt;th&gt;Biz Graph&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Node design&lt;/td&gt;
&lt;td&gt;Entities mapped directly to nodes&lt;/td&gt;
&lt;td&gt;Deliberately designed temporal anchors ("Week")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge semantics&lt;/td&gt;
&lt;td&gt;Relationships described as-is&lt;/td&gt;
&lt;td&gt;Edge types encode data source classification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intermediate nodes&lt;/td&gt;
&lt;td&gt;Taxonomies for classification&lt;/td&gt;
&lt;td&gt;MetricDomain as semantic bridge (structuralized tacit knowledge)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graph construction&lt;/td&gt;
&lt;td&gt;Relationships extracted from existing data&lt;/td&gt;
&lt;td&gt;Deliberately designed graph from data with no inherent relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use case&lt;/td&gt;
&lt;td&gt;Primarily search and navigation&lt;/td&gt;
&lt;td&gt;Goes further into causal candidate exploration for initiative impact&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Similarity search&lt;/td&gt;
&lt;td&gt;Text-based search&lt;/td&gt;
&lt;td&gt;Pre-computed SIMILAR_TO edges via Embedding&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;In one sentence:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our DB Graph "made existing relationships discoverable." Biz Graph "designed and created relationships that didn't exist."&lt;/p&gt;

&lt;p&gt;The former is an analysis problem. The latter is a &lt;strong&gt;design problem&lt;/strong&gt; — designing the graph structure from scratch and integrating heterogeneous data sources (meeting slides, spreadsheets, BQ tables) into a single explorable structure. That's the essence of Biz Graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Graph RAG Over Flat RAG
&lt;/h2&gt;

&lt;p&gt;Let's revisit the "why Graph RAG?" question from the introduction.&lt;/p&gt;

&lt;p&gt;For initiative effectiveness analysis, consider what happens with standard vector search (flat RAG). Ask "What was the SNS campaign's impact?" — flat RAG returns text chunks similar to the initiative description. You get info about the initiative itself.&lt;/p&gt;

&lt;p&gt;But it won't return &lt;strong&gt;concurrent KPI changes&lt;/strong&gt;. It won't return &lt;strong&gt;results from past similar initiatives&lt;/strong&gt;. It won't return &lt;strong&gt;related domain metrics&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;These are information connected "through the graph," not by "text similarity." You can only reach them by traversing Week nodes. This "need to follow relationships" use case is exactly where Graph RAG has a clear advantage over flat RAG.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Honesty: Not Asserting Causation
&lt;/h2&gt;

&lt;p&gt;One thing I was conscious of in this design: &lt;strong&gt;not asserting causation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Many BI tools and AI analyses want to declare "this initiative impacted this KPI." But in reality, there's no such certainty. Multiple initiatives may have been running simultaneously, it could be seasonal, it could be external market changes.&lt;/p&gt;

&lt;p&gt;Week node indirect connections simply "lay out what happened in the same period." Causal judgment is left to human or AI reasoning. I believe this is a statistically honest approach.&lt;/p&gt;

&lt;p&gt;"A structure for discovering causal candidates" — not "a structure for asserting causation." This distinction matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations: The Designer's Tacit Knowledge Is the Bottleneck
&lt;/h2&gt;

&lt;p&gt;Let me be honest about the weaknesses of this approach.&lt;/p&gt;

&lt;p&gt;MetricDomain mappings ("Marketing Advertising → acquisition domain") are hardcoded by humans. If this design is wrong, the entire graph's exploration results are skewed.&lt;/p&gt;

&lt;p&gt;This is simultaneously the answer to "why build it yourself." Off-the-shelf graph tools can't reflect your business structure — which initiative categories relate to which metric groups. Structuralizing this tacit knowledge requires someone who knows the business.&lt;/p&gt;

&lt;p&gt;Going forward, we're considering having AI propose these mappings with humans reviewing them. Full automation is hard, but an "AI suggests, humans approve" workflow could reduce the maintenance cost of domain knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Turning business data into a graph is more of a &lt;strong&gt;design challenge&lt;/strong&gt; than a technical one.&lt;/p&gt;

&lt;p&gt;There's no FK between "initiatives" and "KPIs." No join key. But by deliberately designing two structures — &lt;strong&gt;temporal axis (Week nodes)&lt;/strong&gt; and &lt;strong&gt;semantic domains (MetricDomain)&lt;/strong&gt; — it becomes an explorable graph.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Week nodes&lt;/strong&gt;: Indirect connections via "same week" instead of direct initiative-metric edges. A structure for discovering causal candidates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MetricDomain&lt;/strong&gt;: Semantic bridge between initiative categories and metric groups. Structuralized tacit knowledge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SIMILAR_TO&lt;/strong&gt;: Pre-computed similar initiatives via AI Embedding. Instant answers to "have we done this before?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a result, questions like "Did that initiative work?", "Find initiatives that drove acquisition", "Show metrics YoY with initiative overlay" — AI can now autonomously explore the graph to answer these.&lt;/p&gt;

&lt;p&gt;Graphs aren't something you "find" — they're something you &lt;strong&gt;design&lt;/strong&gt;. Especially for business data.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>bigquery</category>
      <category>typescript</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
