DEV Community: Ryosuke Tsuji

AI-Native Redesign: The Principles Don't Change — Only the Machinery Does

Ryosuke Tsuji — Mon, 20 Jul 2026 23:39:49 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset (a fashion-rental subscription service based in Japan).

"Everything changes with AI" is the prevailing mood. My experience building and then running an internal AI platform (cortex) points the other way. The principles don't change at all. Only the machinery does. This post is about what I've come to treat as principle, what I've concluded should be broken, and the thinking behind that split.

Disclaimer: "cortex" in this article is the internal codename for the AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.

I've written about the individual pieces before: code-graph, product-graph, db-graph, biz-graph, AI-Observability, the auto-review harness, and Self-Healing. This post isn't about any of them. It's about the design principle sitting behind all of them, one abstraction level up, more essay than build log.

The principle, in one sentence: how do we make accurate information accessible? It's an old question. Libraries, legal case books, encyclopedias, search engines — every era has had its own answer using whatever tools that era gave it. Even the technology revolutions people call "paradigm shifts" mostly just changed the means. The underlying question didn't move.

Now AI has arrived, and my read (probably not a controversial one) is that its shift is at least on the scale of the internet, possibly larger. As with every previous paradigm shift, the means of answering "how do we make accurate information accessible?" will get redesigned from the ground up. That's what this post is about: AI-Native Redesign — a view where you rebuild the whole design with AI treated as a given, instead of bolting an "AI tool" or a RAG layer onto a design that was optimized for humans doing everything.

The Underlying Principle

Splitting the Principle into Three Nodes

The frame I use: split the principle into three nodes — creation / maintenance / consumption. Someone (or something) creates it. Someone maintains it. Someone consumes it.

A library: creation = the catalog and classification system, maintenance = adding new titles and updating the shelves, consumption = someone borrowing a book. Legal case books and lawyers: creation = courts writing their rulings, maintenance = new cases getting added to the corpus, consumption = a lawyer looking up a matching case for a client. Engineering docs at a company: creation = writing the spec or README, maintenance = updating it when the code changes, consumption = another engineer reading it while implementing. In every case, each of the three nodes has someone owning it, and the whole thing only works when all three keep turning.

As long as the three of them are held together by a chain of trust and incentive, the system holds itself up. If one of them drops out, the whole thing slides into a negative spiral and quality decays. I'll come back to the specific mechanism later in this chapter.

"Information" here changes shape depending on the domain — knowledge in books, legal precedent, internal documentation at a company, runtime behavior of a service, customer trends, the reasoning behind a decision. What they all share is this: as long as it lives only in one person's head, the moment scale kicks in the whole system stops working. It only starts to work at scale once the information is stored in a form you can look up, and reachable when someone needs it.

What Has Changed, What Hasn't

The means have changed a lot. Clay tablets, oral tradition, manuscripts, dictionaries and indices, the printing press, legal codes and lawyers, library classification, then in the last few decades: wikis and paper docs, APM and structured logs, searchable knowledge bases, distributed tracing, web-scale search engines, BI dashboards, RAG. Each era tried to optimize the three-node balance with whatever tools it had.

Wikis fit the 90s because "humans write, humans update, humans search" was the only shape available. APM appeared in the 2000s because storage got cheap enough to hold the telemetry that machines generate. Each generation had its own subject and its own tooling to answer with.

But the underlying question hasn't moved. Every era in every domain was solving the same problem — "make accurate information accessible" — with that era's tools. And in some sense, each generation was consciously designing the three-node balance. But once humans are in the loop, they cut corners on writing, forget to update, get things wrong, eventually stop maintaining, and the whole thing breaks. Some systems have held up (library systems, legal frameworks, established academic disciplines), but most have fallen into the negative-spiral side because the human limit was the binding constraint.

What makes AI different is that it dramatically widens the range where creation and maintenance can run without much human labor. Automating creation and maintenance is itself old news — deterministic systems have covered a huge amount of it, so widely that we don't even notice anymore. It's not just engineering infrastructure (APM, CI/CD, log collection, schema validation). Media services run on the same shape: articles get created, updated, and deleted through a system, then distribution pushes them to the app, the website, and the print edition. Bank transactions, e-commerce catalogs, routing data in a map app, the timeline in a social network. Deterministic automation is everywhere, quiet enough that it's easy to forget it's there.

But deterministic automation has a ceiling. There's a class of work it can't touch: qualitative judgment, articulating design context, updating documentation to follow a code change, adding annotations — anything that needs interpretation and context. AI is the first thing that reaches into that zone. And it can even sit on the consumption side (in the APM example, this means AI running the "look at the dashboards, close the improvement loop" work that humans used to do). Once the human requirement drops structurally, many domains can start running a positive spiral for the first time. But how to wire this expanded capability into your own organization's loops is still a human design call. That's what this post — AI-Native Redesign — is about.

Where this post lands
AI isn't a replacement for deterministic automation. It's a new capability that brings the zones deterministic tools never reached into the automation envelope — but redesigning your organization's loops around that expanded capability is what turns it into a change on the ground.

A Closer Look at Each Node

Creation. Putting information into a form that can be looked up later. Writing the structure of your code down as docs. Instrumenting the runtime to emit logs and traces. Leaving the reasoning behind a decision as a design doc. Turning a customer trend into a KPI definition. Anything that leaves something behind in a form someone (or something) can come back to.

Maintenance. Keeping what you already wrote from drifting away from its source as that source changes. Code changes, the doc should follow. Service topology changes, the metric definitions should follow. Customer trends shift, KPIs should follow. Decisions get overridden, the record should follow. If creation is a one-shot act, maintenance is the standing chore that never ends.

Consumption. Actually reaching for what's stored and using it to decide or act. Humans reading. Machines querying. Alerts firing. AI agents pulling context. All of it counts.

These aren't sequential phases (creation → maintenance → consumption). They're a single system held together by mutual trust and incentive — you can't evaluate any one of them in isolation.

The Negative Spiral — Why It Collapses in Most Real Places

Concretely, the interdependence looks like this:

If it doesn't get consumed, no one is motivated to create it. Nobody keeps writing docs no one reads. No engineer polishes a dashboard no one looks at.
If it doesn't get maintained, it becomes unfit to consume. A doc from six months ago gets a "probably stale" tag in someone's head and is skipped. A dashboard whose metric definition drifted since the last product change quietly seeds wrong decisions.
If it doesn't get created, there's nothing to maintain in the first place. Information that was never stored in a queryable form can't be tracked as it changes.

As long as all three sides believe "my part is worth doing," the loop keeps turning. But if any one of them gets too expensive, the other two lose their incentive too. "Nobody's going to read it, and it'll rot anyway," "I updated it but no one cared," "when I search it's stale or wrong" — three separate excuses that reinforce each other, and the negative spiral picks up speed.

How this collapses in most places
If any one of the three nodes gets too expensive, the other two lose their reason to invest. The reason documentation cultures don't stick, monitoring stacks go stale, and knowledge bases quietly hollow out isn't a tool problem — it's structural. The balance between the three nodes has to hold, or the whole thing decays in silence.

That's why patching a single node never fixes it. Many people have seen the "our Confluence has 100k pages and no one reads or updates them" version of this. Not a Confluence problem — what happens when at least two of the three nodes (creation and maintenance) stay expensive, and the fix that gets applied is a search feature on the consumption side. The loop never closes.

The Rest of This Post

Everything from here on works off the same principle: make information accessible, and run it as a three-node loop.

Chapter 2 lines up concrete cases from around us (internal docs, monitoring, business data, schema management, security) and shows how each one has been carrying its own version of the three-node problem.
Chapter 3 argues why AI is the pivot, from the angle of "the first thing that brings zones deterministic automation couldn't reach into range of automation."
Chapter 4 walks through the AI-native implementations in cortex (code-graph / db-graph / biz-graph / product-graph / Observability) side by side.
Chapter 5 goes into the difference between "adding an AI tool" and AI-Native Redesign, and why keeping pre-AI design in place while adding AI is a losing pattern.
Chapter 6 sketches what widens between organizations that have multi-layered self-sustaining loops and organizations that don't — the evolution-speed gap I think grows non-linearly over time.

Each chapter reads independently, but read in order, they form a single chain: unchanging principle → common applications → what's special about AI → concrete implementations → the redesign argument → what's next.

Where Deterministic Automation Couldn't Reach

I said earlier that deterministic automation has already reached almost everywhere. And I said that deterministic automation still has zones it can't touch. This chapter breaks those remaining zones down.

If you look at each domain and split it into the three nodes, two categories show up.

Type 1: Domains Where One Node Is Still Human-Only

Domains where deterministic automation covers some nodes but not others. Laying them out side by side:

Domain	Creation	Maintenance	Consumption
Internal docs	Manual (specs / design docs / runbooks)	Manual (keeping them current)	Manual (reading them)
Observability	Automatic (APM / logs / metrics)	Automatic (metric def changes still manual)	Manual (dashboard triage, improvement cycles)
Business data (BI reports)	Automatic (SQL / ETL)	Automatic (schema changes still manual)	Manual (analysis, interpretation)
Schema management (basic CRUD)	Automatic (ORM generation)	Manual (migration decisions)	Automatic (validation)
Security posture	Automatic (logs, anomaly detection)	Automatic (rule updates still manual)	Manual (threat assessment, alert triage)

Every remaining human dependency is in the parts that need qualitative judgment or contextual interpretation. In most of these, the bottleneck settles on the consumption side — "the information is there, but no one uses it" is the classic symptom. Humans interpreting a room full of dashboards and log streams at once hits a cognitive ceiling, and that ceiling has been the structural bottleneck. Internal docs is a special case where all three nodes are human, which is why it's the most obvious pain point.

Type 2: Domains Where Building the Artifact Wasn't Worth the Return

Separate from Type 1, there's a deeper category: the zones we never tried in the first place. The individual data and code are there, but the cost of building the "system that structures the relationships between them" never matched the payoff. This covers two subtypes — relationships that already exist but aren't visible to humans (like connections across code, or across tables) and relationships that were never even defined (like the causal link between a marketing initiative and a KPI). I'll come back to Type 2a and Type 2b in a moment.

What's interesting is that the underlying work — static analysis, SQL aggregation, graph-building pipelines — is often perfectly writable in deterministic logic. The reason nothing existed here wasn't "we couldn't build it." It was "the ROI didn't clear the bar."

Break down why no one built it and you get problems on both sides — a double whammy:

Cost side. Even if deterministic logic can build it, standing up the system to do so (the static analyzer, the extraction pipeline, the graph substrate) is real engineering effort.
Payoff side. Even if you build it, humans can't consume it usefully. A human traversing a graph node-by-node isn't a real workflow, and no semantic search layer existed to sit on top.
When cost is high and payoff is thin, no one signs up.
So the thing that would have been useful just... didn't exist.

Type 2a: Making existing relationships visible (an analysis problem).

Relationships that are already there in the code or the data, but that humans can't see or follow. For example:

Boundary connections across a codebase (which API is called from other repos, and where)
Semantic relationships between tables (which set of tables represents the same business entity)
Log correlation across distributed services (the causal chain of logs across services)

These are all "the information already exists, but no human can analyze it" failure modes.

Type 2b: Designing new relationships (a design problem).

Relationships that don't exist anywhere yet — you have to design a conceptual model first, then extract against it. For example:

The causal link between an initiative and a KPI (which campaign moved which number)
The pairing of a decision to its outcome (how the call in a design doc played out after implementation)
The structure connecting customer segments to behavior patterns

These are a deeper failure mode: "the information doesn't exist at all — you have to design it before you can extract it." The classic example is relationships that individuals hold in spreadsheets or in their heads, which don't scale up to an organization.

In my earlier biz-graph post (Making Initiative Impact Analysis Explorable with Graph RAG + MCP), I put the difference this way:

db-graph made existing relationships discoverable. biz-graph designed relationships that didn't exist yet and produced them. The first is an analysis problem, the second is a design problem.

Type 1 vs. Type 2

Even though they're both "zones deterministic automation couldn't reach," they behave differently:

Type 1: "this piece is human, so it's slow or stuck" — a bottleneck in an existing workflow.
Type 2: "this didn't exist to begin with" — opening a domain that was never charted.

Type 2 has the larger ceiling. Removing a bottleneck makes an existing workflow faster. Charting new territory makes decisions and insights possible that weren't possible before.

The Pattern Underneath Both

What Type 1 and Type 2 share: the parts that resisted rule-based encoding.

Type 1's leftover work: "is this dashboard anomaly a real incident or a known false positive?" "when and how do we run this migration?" "what should this doc say to its reader?"
Type 2's leftover work: "which pieces of code represent the same boundary?" "which initiative moved which KPI?" "how do you lay out a conceptual schema like Week × MetricDomain?"

The common thread is qualitative or contextual judgment, or semantic connection. Those can't be written as rules, which is why they've been left standing.

That framing sets up the next chapter — why AI is the pivot. AI:

covers the remaining qualitative-judgment nodes from Type 1, and
shifts both sides of the Type 2 double whammy (it lowers the cost of building deterministic extraction systems, and it becomes the missing consumer that can pick up the resulting artifact semantically).

Not only "the zones deterministic automation couldn't reach," but also "the zones deterministic automation could reach but wasn't worth building" — both come into range of automation for the first time. That's what makes AI new.

Why AI Is the Pivot

The previous chapter split "zones deterministic automation couldn't reach" into two types: the leftover qualitative-judgment nodes in Type 1, and the cost-vs.-payoff bind in Type 2. This chapter goes into why AI is the first thing that can address both.

What's Genuinely New About AI

Every automation technology before AI stayed inside the region where you can write deterministic logic. Enumerate rules, put in branches, match patterns. It's a very powerful stack, and as I said earlier, it's threaded through nearly everything in modern life.

But it hit a class of work it couldn't touch — the same class the previous chapter arrived at from a different angle: qualitative judgment, contextual interpretation, semantic connection. A judgment that a human "kind of understands" explodes into an unmanageable condition tree the moment you try to write it as code. What Type 1's leftover human-only nodes and Type 2's un-built structuring systems had in common was that neither survived rule-based encoding.

What AI brings for the first time is making that whole class machine-processable. An LLM handles judgments that don't survive being turned into code by treating them as statistical patterns. This isn't an extension of the deterministic stack. It's a new capability, orthogonal to it.

Three Directions of Change

This new capability shifts systems in three distinct directions. They map onto the Type 1 / Type 2 split from the previous chapter: direction 1 addresses Type 1, and directions 2 and 3 address Type 2.

Direction 1: automate the remaining qualitative-judgment nodes in Type 1.

In Type 1 domains, the human-only node — dashboard interpretation, threat assessment, documentation updates, KPI analysis — is now something AI can carry.

Direction 2: cut the cost of building the structuring system.

The Type 2 side "deterministic logic could build this, but the effort was too much." AI assists at each step of building the system (spec design, code generation, testing, debugging), which drops the barrier to standing up these systems. Extraction pipelines that used to take months now regularly land in days. As a concrete data point, the initial version of cortex's biz-graph (the MCP server that handles initiative × KPI causality) went from implementation to Pulumi deployment in one day.

Direction 3: consume structured data semantically.

This one dissolves Type 2's "even if we build it, no one can use it." A graph a human couldn't traverse by hand, AI walks with a mix of graph traversal and semantic search. A question like "which of last month's marketing initiatives contributed most to new-user acquisition?" gets answered with both structures working together.

Deterministic-First

Design principle: deterministic-first
If a thing can be written deterministically, write it deterministically. Keep the surface where AI does inference as narrow as necessary — this is what "containing hallucinations" actually means in practice.

Let AI touch parts that deterministic logic could have handled, and you widen the hallucination surface for nothing. The "containing hallucinations" phrase I've used across earlier posts is really this call — where to draw the line between deterministic and AI.

Leaning deterministic also gets you side benefits:

Idempotency. Same input, same output, every time. Critical for testing, auditing, and reproducibility.
Cost, by orders of magnitude. Inference calls are token-metered. Deterministic execution is typically 10× or more cheaper.

Where AI goes is a decision, not a default. "AI can do it" isn't the criterion. Deterministic where deterministic works. AI only where deterministic doesn't. That division of labor sits at the core of AI-Native Redesign.

Some concrete pairings:

Static analysis, SQL, pipelines → deterministic (Type 2 creation side).
The structured artifact those produce → AI consumes it semantically (Type 2 consumption side).
Wherever qualitative judgment is what's actually needed → AI takes it (Type 1).

Why AI Requires Whole-System Redesign

With earlier automation tools (APM, CI/CD, monitoring, BI), the standard move was to drop them on top of existing operations. They stood alone and didn't disturb the existing workflow, so adding them was enough.

AI is different. "Add an AI tool to the existing system" won't solve what I described above. The reason is that AI's value doesn't come from any single tool feature — it comes from shifting the balance across all three nodes.

Replace just the human node in Type 1 with AI, and if the downstream consumption workflow doesn't line up, no one uses the AI's output.
Stand up new Type 2 structure, and if consumption is still shaped around a human doing the judging, the artifact never actually gets used.
If you don't redesign the boundary between "AI handles this" and "deterministic handles this," ownership gets ambiguous fast.

This is why "add an AI tool" isn't enough. The whole system has to be redesigned with AI as a given. That's the AI-Native Redesign thesis, and I go into it in more detail later on.

Examples from cortex

This chapter walks through the concrete systems running in cortex and shows how the principles I've laid out play out in each of them. There's a dedicated post for each system that I'll link inline. Here I'm keeping the focus on the same three questions: where's the deterministic layer, where's AI, and how was the division decided.

code-graph

code-graph surfaces the code connections across 46 repositories (API boundaries, DB boundaries, event boundaries) into a single knowledge graph. Type 2a — making existing relationships visible.

Creation. Deterministic (tree-sitter static analysis). Function, class, and import call relationships get extracted as written.
Maintenance. Static analysis re-runs on code changes. Maintenance stays deterministic.
Consumption. AI over MCP, combining semantic search and graph traversal.
Hallucination containment. Boundary nodes (API endpoint, DB table, event topic) are materialized explicitly during static analysis, so AI's inference range is scoped to "stop at the boundary."

db-graph

db-graph covers relationships between database tables and the business context around them (Type 2a). Both the ORM-level JOIN relationships and the business-entity-level semantic relationships get graphed.

Creation (structure). Deterministic (static analysis of the ORM, schema extraction) + human review as the guarantee.
Creation (business context). AI generation + human review as the guarantee.
Maintenance. ORM and schema changes get detected deterministically; drift in the business context is caught in human review.
Consumption. AI answers natural-language questions like "which tables are involved in the customer purchase cycle" by combining graph traversal and semantic search.
Hallucination containment. Structure stays deterministic. Business context is AI-generated but gated by human review as the final check. The reason db-graph goes through human review while cortex-product-graph runs on AI review comes down to blast radius: wrong DB structure or wrong business context feeds directly into organizational decision-making, so the risk of being wrong is much larger than for a code error. The call isn't just "if AI is accurate enough, let AI handle it" — it's paying the cost of human review whenever the downside of being wrong is large.

biz-graph

biz-graph covers the causal relationship between initiatives and KPIs (Type 2b — designing new relationships). Unlike db-graph, there's no "JOIN target" sitting between initiatives and KPIs to begin with. The relationship has to be designed by a human first.

Creation. AI (extracting structure from initiative slide decks) + deterministic (KPI data extraction, embedding-based similarity edges) + human schema design (someone defines conceptual anchors like Week and MetricDomain).
Maintenance. New initiative decks and KPI updates keep flowing in to keep the graph current — slide parsing on the AI side, the value pipeline deterministic.
Consumption. An AI agent handling "what's the causal relationship between last month's social campaigns and this week's new-user counts?" traverses Initiative → Week → co-occurring KPIs on the graph.
Hallucination containment. The human-designed schema (conceptual anchors like Week and MetricDomain) is the deterministic frame around AI. AI can only reason inside that frame. KPI extraction, similarity edges, graph construction — all deterministic. AI is confined to the consumption side and to the judgment calls during slide extraction.

cortex-product-graph

cortex-product-graph is cortex's main knowledge graph, unifying cortex's own code, DB schema, docs, and Pulumi IaC. AI is used heavily in cortex development itself, and this system is a good example of how the AI-Native Redesign principles land in a working setup.

Creation (structure). Deterministic (ts-morph extracts @graph-* JSDoc from code and Pulumi IaC, then merges with the documentation and with db-graph).
Creation (code + annotation). Developer + AI assist (Claude Code / Codex generate code and the annotations at the same time).
Maintenance. cortex's AI review runs per-PR and checks code logic, doc consistency, and annotation drift together, filing REQUEST_CHANGES when they don't line up.
Consumption. AI over MCP, combining semantic and structural search.
Hallucination containment. cortex's AI review looks at code + annotation + docs together on every PR. Even PRs that get merged without a human reviewer go through AI as a review layer. Because the code and its @graph-* annotations sit next to each other in JSDoc (code and intent as an SSoT inside the same file), AI spots the gap between the code change and its intent immediately, which is why AI review stays accurate. This is also a bet on AI-Readability — writing code in a form that's not only readable by humans but also structurally parseable by AI agents.

Observability + Self-Healing

AI-Observability handles the four monitoring axes (Application / Infrastructure / CI / LLM), and the loop from there to Self-Healing is where AI-Native Redesign is at its clearest. Type 1 — removing the consumption-side bottleneck.

Creation + maintenance. Deterministic (OpenTelemetry, metrics, logs, traces, deterministic alerts).
Consumption (detection). Deterministic alert thresholds fire incidents.
Consumption (judgment and response). AI cross-references log and trace context (pulled via Grafana MCP in practice) with the relevant source code / tables / docs (found by traversing cortex-product-graph), produces a root-cause hypothesis, and then a fix PR.
Maintenance loop. The generated fix PR is quality-checked by the AI review flow that runs on top of cortex-product-graph before it merges.
Hallucination containment. AI never touches production directly. AI's output is always a PR — something reviewable. cortex-product-graph + AI review + auto-merge chain acts as the final gate.

This is the fully-formed loop of AI-Native Redesign: monitoring → detection → AI inference → PR generation → AI review → merge. The whole loop turns, and cortex ends up in the "fixed before we notice" state (the title of the Self-Healing post).

The Pattern Underneath All Five

Building these five systems, some design calls kept recurring by choice.

Creation side stays deterministic by default. Anything that reads "the data as written" — static analysis, ORM, OTel, SQL, extraction pipelines — leans deterministic.
AI is confined to consumption and to the meaning layer on top of structured artifacts. Graph traversal, semantic search, annotation generation — the judgment work that doesn't survive being written as rules — is where AI sits.
There's always a hallucination-containment mechanism. Boundary nodes, AI review, human review — the specific mechanism differs, but every system has some form of lid on AI output. AI's free-writing zone is kept narrow, and even inside that zone, its output goes through a review layer.
AI review quality depends on the context foundation. AI can do comprehensive PR review in cortex because cortex-product-graph exists as the structured context foundation. Without it, AI would only see local information from the PR diff, and couldn't judge consistency with the rest of the codebase or the docs. Before "where and how do we use AI," the question that comes first is: "what context can we give AI to reason over?"
The review choice has a risk-profile component. Even when AI review would be accurate enough, if the downside of being wrong is large, human review can still be the right call. As I mentioned with db-graph, it's a comparison between the cost of being wrong and the cost of human labor.
Human design judgment sits above the whole thing. Schema design (biz-graph), guideline definitions (auto-review), monitoring target selection (Observability) — these are human calls.

The pattern to notice: the same AI-Native Redesign principles land in different shapes depending on the subject and the goal.

AI-Native Redesign vs. "Adding an AI Tool"

So far the argument has been: AI is the first thing that reaches into the zones deterministic automation couldn't, and making that actually work needs whole-system design — a context foundation like cortex-product-graph, or a closed loop like Observability + Self-Healing.

This chapter goes into why the "just add AI to the existing system" approach — the approach that skips the whole-system redesign — falls short.

What "Adding an AI Tool" Usually Looks Like

Over the past year I've put a range of AI tools into the organization myself. Here are the shapes I've tried at least once:

AI summary on dashboards. AI reads the dashboard and gives you the takeaway.
AI-generated docs. Docs get produced from code.
AI PR review. AI reads the PR and comments.
AI search on the internal knowledge base. Natural-language queries against internal knowledge.
AI-assisted coding. Claude Code, Cursor, etc.

Each one was useful in isolation. But honestly, the results topped out around 1.x — worth the deployment cost, but nowhere near a paradigm shift.

That was the point where I had to step back and ask what it would take for the organization to use AI better. What came out of that were the three failure modes below, and the AI-Native Redesign direction that follows from them.

Failure Mode 1: The Three-Node Balance Stays Optimized for Humans

Existing systems were shaped, within the capabilities of their era, around "humans handle every node." That assumption is baked into things you don't usually see as design choices:

Dashboards: density and count set for what a human can consume.
Documentation: structure and granularity set for what a human can read.
Code: conventions and granularity set for what humans write and review.

Add AI on top and AI has to operate inside "the balance optimized for humans." AI's actual strengths — watching a room full of dashboards at once, traversing docs structurally, verifying code exhaustively — get suppressed because the surrounding two nodes stay shaped for humans.

Failure Mode 2: AI Is Asked to Judge Without a Context Foundation

As the earlier examples showed, AI review works comprehensively only when there's a context foundation like cortex-product-graph. Ask AI to judge without one, and it only sees local information, and its actual value doesn't come out.

PR review AI: seeing only the PR diff, all it can do is comment on coding style.
Dashboard AI summary: summarizes the numbers on that one dashboard; the relationship to the rest of the system is invisible.
Doc AI search: keyword match or semantic search, each a local result.

The feeling "AI is shallow" is usually not about the AI. It's about the missing context foundation the AI was supposed to reason over.

This is adjacent to what's now being called context engineering. I've written about the retrieval-side design in an earlier agentic Graph RAG post, but in the three-node frame, retrieval quality is only the consumption node. The failure here is upstream — the creation side never produced a form that could be pulled as context.

Failure Mode 3: The Creation Side Doesn't Shift into a Form AI Can Consume

Of the three directions I laid out earlier, Direction 1 (automate the leftover qualitative-judgment nodes from Type 1) can be reached by adding AI to an existing system. But Direction 2 (cut the cost of building structuring systems) and Direction 3 (consume structured data semantically) both require the creation side to change into a form AI can consume.

JSDoc @graph-* annotations on code express structure and business intent as an SSoT (Single Source of Truth) → AI can understand structure and intent together.
Logs emitted as structured events, correlated with traces and metrics from other services → AI can follow causal chains across a distributed system.
Docs restructured from standalone files into data tied to code, design, and domain → AI can pull them as context.

These are creation-side design changes. Adding a feature to the existing system doesn't produce them. "Add AI on the consumption side" alone caps AI's ceiling at whatever the input-side constraints are — a rough format, implicit context, purely local data.

What AI-Native Redesign Actually Is

Flip the three failure modes and you get what AI-Native Redesign is:

Rebalance all three nodes around "AI + human" as the assumption. Which node gets AI and which stays human is redesigned from scratch. Not "add AI to a human-balanced system" — draw a new balance where AI-carried nodes are first-class parts of it.
Build the context foundation first. The structured context AI can reason over (something like cortex-product-graph) gets built first, and AI review and self-repair go on top of it. The opposite order — "put AI in, then notice context is missing" — is what fails.
Change the creation side. Reshaping the creation side into a form AI can consume — annotations, structured events, fine-grained docs — is part of the redesign. Consumption-side additions alone aren't enough.

The five cortex implementations from earlier all did these three. code-graph shaped structure through static analysis into a form AI could reach (creation-side change). cortex-product-graph became the context foundation for judgment itself. Observability + Self-Healing redesigned all three nodes with AI in the mix, from monitoring through to auto-repair.

This is the underlying reason the evolution-speed gap I sketch in the next chapter widens over time. The gap between "add an AI tool" and "AI-Native Redesign" is bigger than a linear-vs.-exponential ROI gap, from what I've seen.

Life After AI-Native Redesign

What Building cortex Has Actually Felt Like

Some things I couldn't see back when we were just deploying individual AI tools have come into view while building and running cortex.

Fix PRs now ride straight into AI review and auto-merge with no human in the path — 115 of them via Self-Healing alone in the last 30 days (details in the Self-Healing post).
Drift between docs, annotations, and code gets repaired automatically in places that used to sit uncorrected.
The cost of chasing "how was this actually built again?" through past decisions has dropped.
Individual writing speed hasn't changed much. What did change: the quality bar for what actually ships, and the fact that people other than me can now contribute. Monthly merged PRs going from the 10–23 range through March to 518 in April 2026 came from the workflow switch (main push → PR + AI review + auto-merge), not from writing more. The number is really the shape of "the ceiling of 'reviewed manually by me' came off, so this scale and this quality bar can now be sustained by more people than just me" (data in the harness intro post).

This isn't "add an AI tool, get 1.x." It's what happens when multiple self-sustaining loops start turning across layers. Qualitatively a different kind of change from a one-off productivity bump.

What Multi-Layered Self-Sustaining Loops Actually Look Like

Each of the cortex systems has its own self-sustaining loop.

code-graph. Every code change updates the graph, AI reviews using the updated graph.
cortex-product-graph. Every PR keeps annotations and code aligned, and AI review accuracy tightens with each pass.
Observability + Self-Healing. Monitoring detects an incident, AI produces a fix PR, AI review checks it, and it merges.
biz-graph. Initiative-to-KPI relationships get extracted continuously and stay in a form usable for decision-making.

These loops run independently, but they're connected through cortex-product-graph as a shared context foundation. The output of one loop becomes the input to another — that shape of connection.

Once this kind of multi-layer loop starts running inside the organization, it changes how time gets spent at a fundamental level. Not "AI helps" — closer to "a substantial share of daily work completes inside the loops."

Where This Structure Could Rot

If the three-node symmetry is the axis, then AI-Native systems' own negative spiral is a question that has to be asked too. The circular dependency I presented as a virtuous cycle (AI review maintains cortex-product-graph, cortex-product-graph supports AI review) turns into a self-amplifying error loop the moment errors get into the foundation. AI reasoning confidently and consistently wrong on top of a contaminated context foundation is a real, symmetric failure mode of this design, not a hypothetical.

The defenses split three ways. The reason db-graph puts human review at the final gate is entry-side containment — narrowing the flow of contamination into the foundation wherever the downside of being wrong is large. Materializing boundary nodes explicitly through static analysis is inference-range containment — narrowing where AI is allowed to reason. And keeping cortex-product-graph in a form that can always be regenerated from deterministic extraction + annotations + docs is recoverability — a way back once contamination is detected. All three are held together by "the foundation is never something AI alone writes into." The signal that this failure mode is starting to show is drift in AI review's own accuracy metrics (REQUEST_CHANGES rate, false positive / negative rate) that no one can explain — which is the symmetric-side extension of what I meant in the AI-Observability post when I argued LLMs should be the fourth monitoring axis.

How I Read the Evolution-Speed Gap

The rest is my read. I don't know how much of this generalizes to other organizations.

One objection I want to head off: "this sounds like something you need a dedicated platform team for." airCloset doesn't have one. I built cortex solo, alongside my CTO duties, and now people from the business side, not just engineering, are shipping on top of that harness. The fact that this was reachable at all is itself Direction 2 (AI cutting the cost of building structuring systems) in action.

Between organizations running AI as individual tools and organizations that have assembled multiple self-sustaining loops across layers, my sense is the evolution-speed gap widens over time.

In the first, AI is a useful tool, but decision-and-implementation speed itself doesn't change much.
In the second, the full cycle — decision → implementation → detection → repair — gets an order of magnitude faster.

Compound that gap over time and how much you can get done in the same period starts to diverge — gradually or sharply. "Exponentially" would be overstating it, but at least the way I see the gap widening, "linear" doesn't describe it either.

What This Post Was Trying to Say

Take the timeless question — "how do we make accurate information accessible?" — and redesign for it with AI as a given capability. That's the idea at the center of AI-Native Redesign.

AI isn't a replacement for deterministic automation. It's a new capability that brings zones deterministic automation couldn't reach into the automation envelope for the first time.
Adding AI in isolation doesn't work. The positive spiral only kicks in when all three nodes are rebuilt together.
Doing that requires a context foundation AI can reason over (something like cortex-product-graph), built ahead of the AI layer.
The whole thing is a stack of self-sustaining loops, and the more of them turn together, the more the organization's evolution speed changes.

One last thing. I scoped this post to information access, but the creation / maintenance / consumption loop shows up far more widely than that. Cultural transmission, the survival of a business, the growth of an academic field, a living language itself — all of them turn on the same structure, where creation dries up without consumption and quality decays without maintenance. Dead languages, lost traditions, failed companies, the internal wiki no one reads — push far enough and they hollow out through the same mechanism. Which raises a question I can't answer here: how far past information systems does "AI structurally lowers the cost of creation and maintenance" actually reach? I'll leave that one open.

The cortex build is one instance of trying this out. Different organizations and different subjects will land somewhere else, but the underlying question — "how do we make information accessible?" — should be the same. If this post is useful as material for laying an AI-Native answer over other contexts, that's what I was hoping for.

Observability Design for the AI Era — Reconciling PII Protection With AI Searchability, and Driving Self-Healing

Ryosuke Tsuji — Mon, 13 Jul 2026 23:50:57 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

In Part 1, I walked through the four monitoring axes (application / infrastructure / CI / LLM) and the deliberately different shape each one ends up in. That's the write-side of the observability stack, more or less wrapped up.

But shaping the write side isn't the end of the story. The moment production data flows through the stack, you have to block the path PII can take to slip in — and that's true with or without AI. It's the kind of classic observability problem where, if you cut corners, you walk straight into a leak incident.

Historically, the set of people who could read logs mostly overlapped with the set who could read the DB. For engineers with DB access, logs weren't an additional path to personal data — which put log-side defenses in a position where hardening them didn't meaningfully move the overall defense line for most organizations.

AI breaks that premise. Non-engineers pulling logs over MCP don't have DB access. Logs became, for the first time, a path where someone without DB access can reach personal data. On top of that, log content now flows into AI's input, which introduces new exposure surfaces: transmission to the model, and re-surfacing in the model's output. Log PII protection has shifted from "hygiene worth doing" to "required as a trust-boundary redesign." That's the premise this post starts from.

And on top of that, if the observability stack isn't queryable by AI, the whole "AI-consumable observability" goal from Part 1 falls apart.

Part 2 is about how I reconciled these two — protecting PII while keeping searchability for AI — and how that combination ends up driving Self-Healing from CI failure to PR proposal.

The Observability Stack Is a Natural Path for PII

App emits a log → it lands in Loki → AI queries it through MCP. Stand up this naive flow and you get:

Customer email addresses and phone numbers in error logs
Order response payloads riding inside traces
DB query logs that emit full table rows

Plain-text PII pooling in the observability stack means AI can search it directly. This isn't really an AI problem, it's an observability problem: the stack itself becomes a PII conduit. At the same time, if you scrub PII completely, you lose "I want to investigate Customer A's support ticket" as a query, which is a normal support workflow.

cortex (the internal AI platform) had to reconcile both. The key principle was: don't make "block the PII path" and "search by PII" mutually exclusive.

Note: "cortex" here refers to airCloset's internal AI platform codename. Unrelated to Snowflake Cortex, Palo Alto Networks Cortex, etc.

Multi-Layer PII Design — Six Layers

cortex's PII handling is six layers, each with a different role:

Layer	Purpose	Mechanism
Write: BQ Policy Tag	Column-level access control	`pii_high` / `pii_medium` / `pii_low` three-tier taxonomy. Without fine-grained reader on the column, SELECT errors out with `Access Denied` (pure CLS (Column-Level Security) — no dynamic masking)
Write: ETL DLP	Strip plain-text PII from derived tables	Cloud DLP redacts during transforms (customer support data, etc.). Placeholders like `[EMAIL_ADDRESS]` / `[PHONE_NUMBER]` preserve the structure
Write: log hashing	Plain text never reaches Loki	App-side hash via `hashEmail` (HMAC-SHA256 → 12-char prefix; key lives outside the observability stack) before log emit
Search: same function on both sides	Look up a specific customer's logs without ever touching plain text	Query-side runs the same `hashEmail` before sending to Loki
Output: MCP masking	Mask when AI consumes	Column-name detection masks the local part (e.g. `r***@air-closet.com`), keeping `@domain` so first-response triage can still tell which domain the account belonged to
Identity separation	Internal staff email is handled in a separate track from customer PII	HMAC-signed by Edge Router as auth attribution; not part of the masking pipeline

The fourth row — search with the same function on both sides — is where the security / usability tradeoff gets really tight.

I'll use email as the running example, but the six layers guard more than email. PII spans names (including phonetic readings), phone numbers, addresses, postal codes, dates of birth, card and bank details, external-service IDs, and more. The anonymization technique varies by the nature of the field — same-function hashing to preserve correlation (email, phone), partial masking (names, addresses), full redaction (card numbers, tokens) — and that call is made per field. What stays constant is the structure: which of the six layers guards it, and how. That's the reusable part of the design.

And this anonymization isn't confined to observability logs (Loki) either. An MCP tool that queries a service DB, for instance, pulls customer names, addresses, and phone numbers into its result set, so the same PII anonymization rules run before anything is handed back to the AI. The consistent rule is "anonymize PII on every data path that reaches the AI," applied across data-source types, not just one.

Hash on Both the Write and Search Sides

Naively "remove PII from logs" and you can no longer answer "let me look up Customer A's logs." But if you hash at write time and store that hash in the log, the search side can run the same hash function over the input and find the matching record. Plain-text email never touches either end.

Concretely:

Write side:

// Application code
logger.info("Subscription updated", {
  user: hashEmail(user.email), // → '7a3f9c2e0b1d' (HMAC-SHA256 12-char prefix)
  plan: "monthly",
});
// → Only the hashEmail result ends up in Loki

Search side (when you want to pull a specific customer's logs):

Here's the awkward part. "Pull up Customer A's logs" — the naive way to build it hands the raw email to the AI, which then passes it to an MCP tool to search. But that means handing plain-text PII to the AI (the model, and the vendor behind it). Guard the inside of Loki with hashes all you want; it leaks at the search input, one step earlier.

So in cortex the search tool takes a non-PII ID, resolves it to an email inside the MCP server, hashes it there, and returns only the hash. The email exists only inside the MCP server and never reaches the model:

// MCP tool resolve_email_hash (runs server-side)
// Input is an ID (non-PII). The email is never returned to the caller = the AI.
const email = await resolveEmailById(userId); // resolved from the DB, server-side
const hash = hashEmail(email, secret);        // same function, same key as the write side
// → the AI gets back only the hash, never the email

The AI takes that hash and searches Loki via Grafana MCP as {service_name="subscription"} |~ "${hash}". Both the write side and the search side run the same hashEmail with the same key, so logs from the same customer collapse to the same hash. Meanwhile:

Plain-text email never enters Loki
The query string Loki sees doesn't contain plain-text email either (only the hashed value reaches it)
And the AI (the model) never receives plain-text email either. All it touches is a non-PII ID and hashes that already live in Loki. The plain-text email never leaves the trust boundary of the MCP server.
Enumeration resistance comes from keeping the HMAC key outside the stack. Email is a low-entropy, enumerable input space, so a bare one-way hash (plain SHA-256, etc.) is breakable. The hash function is public, so once logs leak, an attacker just hashes a list of likely emails on their own machine and matches against the leaked values, no key required. HMAC folds a secret key into the hash computation itself, so an attacker who doesn't have the key can't even turn a candidate email into "the same shape as the leaked hash." They never get onto the brute-force field. Keep the key only at the write side and the search tool, never in Loki itself, and you get "a log leak alone doesn't expose the plaintext unless the key leaks too", one more condition an attacker has to satisfy
Truncating to a 12-char prefix (48 bits) means collisions are possible in theory, but negligible at customer-base scale. By the birthday problem, the 50% collision point sits around 20M records (≈ 2^24.5), and below that the expected collision count stays tiny. More to the point, a collision wouldn't leak plaintext anyway: this hash is a correlation key for identifying a customer's logs, not a security boundary, so the worst case is "another customer's logs occasionally land on the same hash", a degradation of correlation accuracy, not a disclosure

This reuses the property "same input → same hash" of hash functions in the form "the same function on both sides makes search work." The security / debug usability tradeoff compresses cleanly.

And of course, this is all just the app log layer. The BQ side is protected by Policy Tag-based column-level access control as its own layer (rows 1–2 of the table above). The whole thing is multi-layered.

What makes the "take an ID, resolve and hash inside" shape work is that plain-text email never crosses the trust boundary of the MCP server. The easy implementation (hand the AI a raw email, let the tool search) leaks the plaintext to the model at the search input, no matter how well you guard the inside of Loki. You could argue "the vendor's terms say it won't leave," but that's a dependency on terms, and it's weak under audit. Take an ID and hash inside, and you keep plaintext away from the model structurally, not contractually. When I said up top that PII protection has become "a trust-boundary redesign," this is the kind of design call I meant.

An aside: when I was working this out, I asked an AI for help, and it suggested building an admin screen where a human manually turns emails into hashes. That's one way to keep PII away from the model, sure, but it doesn't fit autonomous operation — a human has to step in before any investigation can start. cortex is built to run all the way through to "fixed before anyone notices" self-healing, so a solution that inserts a human isn't on the table. "Take an ID, hash inside the MCP server" came out of that constraint. What counts as an acceptable solution was, in the end, a design judgment on my side.

Integration Surface — "Humans = Web, AI = MCP" on the Same Backend

Three backends (Prometheus / BigQuery / Loki) now carry the observable data, and PII is handled. The next question is who queries them, and how. The common trap is to build "human dashboard aggregations" and "AI data feeds" separately. The moment you do:

Two implementations chasing the same question
Numbers drift between them
It becomes unclear which is canonical
Aggregations for AI and for humans update on different schedules

cortex's choice: share one observability backend; only the consumer-facing interface differs.

Human side: AI Operations Portal

There's an internal portal (codenamed PI Lab) that aggregates dashboards by monitoring target:

Claude Code usage (the cc-usage screen from Part 1)
MCP tool usage (by server / tool / user / team)
Infrastructure cost (Gemini / GCP / AWS / GitHub on one screen)
Alert state, deploy history, etc.

Here's what the MCP usage dashboard actually looks like:

Over the past 30 days, service-product-graph had 37,946 calls (with 7,106 errors), gws had 19,350, db-graph had 17,297 — and that's just the top. Which MCP is used how much, where the failures are showing up — all visible at a daily glance. (The "high error rate" some servers seem to have is partly typed errors counted in — expected rejections like "permission denied" — so the interpretation needs care.) The "annotation graph MCP, ~50,000 calls / 73 users" figure from the previous series came from this same view.

These pages on the React side pull from BQ / Prometheus / Loki through an internal API. The aggregation logic lives at the API layer.

AI side: MCP

When AI agents need the same data, they go through purpose-specific MCPs:

Grafana MCP — LogQL / PromQL queries against Loki / Mimir / Prometheus / Tempo. Natural-language questions like "What time window had the most errors on Service X last week?" are the agent's job to translate into LogQL / PromQL before they go over MCP
BQ MCP (via cortex-product-graph) — SQL queries against claude_usage.claude_usage / cortex.mcp_tool_calls

The design pivot: the human dashboard and the AI MCP share the same backend. No separate "AI aggregation table" and "human aggregation table." Build the observability backend once, then provide a consumer-specific interface layer (web dashboard / MCP) on top.

In DDD terms, MCP and the web dashboard are both just presentation layers — different I/O channels into the same domain (the observability backend). Treating MCP as "something special" leads to duplicate implementations; treating it as one presentation layer form keeps the design clean.

That's exactly why "the observability stack is visible to AI" actually holds. Build the backend, but without an AI-facing presentation layer (= MCP), AI can't query it. MCP is the piece that makes "hand it to AI" actually work.

The Real Driver of Self-Healing

The layer that keeps the observability stack from being "just a screen to look at" is Self-Healing. I covered the full picture in AI Harness Series Part 4, so I'll skip the details here, but from the observability side, the start and end of the chain are clear:

The flow:

Detect — Production alert / CI failure fires a Loki LogQL alert
Deliver — POST to event-relay (the internal webhook hub)
Launch — auto-review bot starts up (= an agent backed by Claude Code)
Gather context — The bot pulls full logs via Grafana MCP, traces related PR / commit / code via Product Graph MCP
Propose — File a fix PR
Verify — If CI passes, the bot auto-merges; if not, another bot reviews

So the starting point of Self-Healing is whether the observability stack can hand "what broke" to AI in the right shape. If errors aren't recognized / stacktraces aren't preserved / related code (PR / commit / graph) isn't reachable — any of those missing and the chain stops cold. (The specific failure modes are in the next section.) Put another way:

The quality of observability is the ceiling for AI autonomous operation.

That's the central claim of Part 2. Reframe the observability stack as "input that drives AI," not "monitoring infrastructure," and the priorities of your design decisions shift accordingly.

What's Still Open — Defining "What Counts as an Error" and the Stacktrace Design

The biggest remaining issue, honest version.

You can polish the observability stack to a mirror finish, but if the design of what counts as an error and whether the stacktrace survives falls apart, all of it is wasted. I touched on this earlier in AI Harness Series Part 2 in the context of cortex's internal knowledge graph, and it shows up on the observability side too.

Concretely, here are the failure modes:

try ~ catch swallows the error without logging → nothing reaches the observability stack
catch does log, but at console.log-equivalent info level → not recognized as an error
Error gets emitted, but only error.message is written; stacktrace is dropped → AI can't trace back to the original code
An async error goes unhandled and the process falls over

These are all problems at the code that creates the observability entry point, not at the observability stack itself. No matter how polished the stack is, if the faucet at the entry point is broken, nothing flows out.

What's in place today is three layers, none of them complete:

lint (static) — The no-silent-catch rule blocks empty catches and .catch(() => null)-style swallows. But once there's any function call inside the catch, lint is satisfied — so patterns like "demote to logger.info(err.message)" or "log only error.message and drop the stacktrace" slip through statically
Guideline document — Rules like "use serializeError(error) to store stacktrace as a structured field" and "dropping stack via logger.error(err.message) is a Major violation" are written down in the internal guidelines. But static checking can't enforce these; they rely on human / AI review
AI auto-review — The PR auto-review bot does look at test coverage including "are error cases being tested," but it has no observability-specific checklist, so it can't systematically catch stacktrace design quality

In other words: "There's a guideline, lint catches some, AI review catches some, but it's not airtight" is the honest description. The real gap is that at the moment new code is being written, there isn't a harness that proactively suggests / completes "this should be treated as an error, this should keep its stacktrace." Auto-review picks things up at PR time, but a proactive harness for the observability entry-point design itself isn't built yet.

"Observability stack: done. Observability target design: still on humans." That's the honest picture. Closing that gap with a harness is the next step.

Closing — Static Edition + Dynamic Edition Are Lined Up; Merging Them Is the Next Series

The code-graph series was about reshaping a static analysis graph so AI could query it — handing the structure of code as fact. This two-part series was about handing what's happening in production right now, also as fact.

	Shape	What's Handed Over
Static edition (code-graph + db-graph + annotation graph)	3-graph parallel + SAME_ENTITY	Code and meaning
Dynamic edition (Part 1 + this post)	Prometheus / BQ / Loki + MCP	Production behavior and cost

The honest part: these two still sit side by side, not joined. For cortex's stated principle of "don't let AI infer — hand it facts" to truly reach completion, the next step is to pour dynamic data into the static graph and merge them. This is the exact same gap I flagged as the "absence of dynamic analysis" open issue at the end of code-graph Part 2: putting "how often is this edge actually used in production?" on the static graph's nodes. That's when "hand it as fact" reaches its final form.

Layer Self-Healing on top of static + dynamic and you get "AI autonomously operates," which works today. But merging the two editions into one graph is still ahead — that's the next series.

And one more time, observability target design (what counts as an error, whether stacktrace survives) is what really sets the ceiling. Harness-ifying that is the next homework item.

Thanks for reading this far.

Observability Design for the AI Era — Application / Infrastructure / CI / LLM, Each in Its Own Shape

Ryosuke Tsuji — Mon, 06 Jul 2026 23:44:23 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

In the previous series, code-graph deep dive (Part 2), I wrote about making a 46-repo codebase semantically searchable for AI. The final issue I left open in that piece was the absence of dynamic analysis:

What lives on the graph is the fact that "this edge exists statically." How often that edge actually gets used in production isn't recorded.

A graph that gives you static facts is one thing. Telling AI what's actually happening in production right now is a separate problem. So the same shaping discipline I applied to the static graph needs to apply to the observability stack too.

This post is the first half of that story. I split it into two: Part 1 (this post) covers how I shape four different monitoring surfaces (application / infrastructure / CI / LLM). Part 2 covers PII handling, the integration surface, and Self-Healing.

What Does "Observable to AI" Even Mean?

The biggest lesson from the code-graph series was: the data has to be shaped before AI can consume it. Throwing 46 repositories of source at a model blows past the context window and invites hallucination. So we shaped it — static analysis into a graph, boundary nodes given meaning, SAME_ENTITY joins between graphs — and only then handed it over.

The observability stack has the exact same problem. Throw raw production logs at AI and you get:

Sheer log volume that drowns the context window
No way for the model to tell errors from noise
Metrics, logs, and traces that don't link to each other
Questions like "what are we spending right now" that raw logs don't answer at all

In other words, logs have to be reshaped before AI can use them. Same problem, different domain.

The catch is that the right shape depends on what you want AI to answer. At cortex (the internal AI platform), I split the monitoring surface into four axes and let each one settle into its own form:

Note: "cortex" here refers to airCloset's internal AI platform codename. Unrelated to Snowflake Cortex, Palo Alto Networks Cortex, etc.

Monitoring target	What you want AI to answer	Shape
Application	"What's happening in production right now?" (exploration)	log + trace
Infrastructure	"Do we have enough resources? Anything down?" (time series)	metric
CI	"What broke? Since when?" (alert + history)	log + alert
LLM	"How much are we spending? Who's using how much?" (real-time + structured aggregation)	metric + structured records

"Just push everything through OTel and dump it all in Loki" is an option. But the moment you do, you're asking one backend to answer wildly different kinds of questions — real-time "what's spending right now" alongside "monthly cost broken down by team via SQL" — and one of them is going to suffer. Splitting by purpose is the choice I made.

Let me walk through each of the four axes. Application and infrastructure are the foundation, so I'll keep those brief. CI and LLM are where the AI-era design judgments actually surface, so I'll dig into those.

Application — OTel + Loki + Tempo, the Standard Stack

The foundation is unremarkable. Every cortex application is instrumented with OpenTelemetry, with traces going to Tempo, logs to Loki, and metrics to Mimir — the standard Grafana Cloud setup.

There's no special trick here. What matters is the discipline: every app emits logs and traces in the same shape. That uniformity is what lets AI later run something like {service_name="<service>"} |~ "error" through MCP and investigate across services.

I covered the actual instrumentation in AI Harness Series Part 4 (Self-Healing), so I'll leave the details there. The point worth repeating is: a standard OTel stack, properly laid down, is the precondition for everything AI-driven that comes later.

Infrastructure — Cloud Run / BigQuery / Pub/Sub Metrics, All Into Mimir

cortex runs on GCP and stitches together Cloud Run, Cloud Run Jobs, BigQuery, Pub/Sub, Cloud Tasks, and the usual suspects. Each GCP resource's metrics (CPU, memory, execution count, latency, queue dwell time, etc.) flow through Cloud Monitoring into Mimir.

Nothing special here either — just standard GCP metrics, all gathered into one Mimir instance. But that "one place" property pays off later: AI can answer "which service used the most CPU last week?" or "is there a worker with a clogged queue?" naturally, because everything is queryable from a single store. MCP picks it up from there.

That's it for the foundation. Standard observability stacks are well-documented elsewhere; go read Grafana's and OpenTelemetry's docs if you want the details.

The interesting AI-era design judgments are in the next two axes — CI and LLM.

CI — Ship Logs to Loki via Post-Hoc Pull, Not Webhook Push

cortex runs CI on GitHub Actions, and I ship every CI log into Grafana Loki.

"Why? GitHub Actions has a perfectly good UI for that" is a reasonable question. The reasons are concrete:

Having AI hit the GitHub Actions API on every investigation is slow and auth-heavy. Ingesting into Loki once means AI can query it ad-hoc
One Loki instance holds CI logs and application logs together, so you can cross-query them
LogQL alerts turn CI failure into a structured signal
AI can ask "any tests that have been broken since last week?" in natural language

But the shipping mechanism is unusual. The choice cortex made:

Don't push logs from inside the CI run. After the run finishes, pull them from the GitHub API.

Concretely:

When the Test job ends, a workflow_run event fires
A separate workflow dedicated to log shipping triggers
That workflow pulls logs from the GitHub API (/repos/.../actions/jobs/.../logs)
Ships them to Grafana Cloud as structured JSON (job / status / ref / pr / commit / output, etc.) via OTLP /v1/logs

Filter on {service_name="ci", ref="main", status="failure"} and you get just the main-branch CI failures, cleanly.

Why pull instead of push:

CI execution and observability decouple. If shipping fails, the test run is unaffected. You can also retry / replay shipping independently
No path for PR code to touch the API key. The shipping workflow runs in the default-branch context and uses base-repo secrets, not whatever a fork PR brought. The test workflow itself never touches the Grafana API key — that's a structural guarantee, not a "we trust it won't leak"
Shipping failure becomes observable. If shipping lives inside CI, a shipping bug means the observability stack goes silent — and you don't notice. Split them, and the shipping workflow's success / failure is itself something you can alert on

The moment a main-branch failure shows up, a LogQL alert fires and Slack gets pinged. That's the trigger for Self-Healing, which I cover in Part 2.

LLM — Gemini and Claude Code, Two Different Shapes

The last axis is LLM observability. cortex uses both Gemini API and Claude Code (Anthropic's official CLI) heavily, and since both cost money, I want visibility into how they're used (though the billing models differ — Gemini is pay-per-use, Claude Code is a subscription, and that difference matters later). The reason I shape them differently isn't really about "what kind of question" — it's about where you can instrument — the instrumentation locus:

Gemini — I own the calling code, so I can wrap every call with a common helper and emit metrics inline. Prometheus is the natural fit.
Claude Code — It's an external CLI; I can't wrap its calls from the inside. Usage shows up as records after the fact. A structured store (BigQuery) is the natural fit.

The "real-time vs SQL aggregation" framing of the question is a consequence of where you can instrument, not the cause. With that clarified, here's how each one plays out.

Gemini — Prometheus, Cost Visible in Real Time via Client-Side Estimation

cortex uses Gemini everywhere: db-graph table description generation, code-graph field type inference, general context generation. What I want to see is what's expensive right now, with no lag. If a runaway prompt or batch job kicks off, I don't want to wait until tomorrow's billing report.

So every Gemini call goes through a common wrapper (traceGeminiCall) that emits four metrics per call:

gemini.tokens.total — cumulative tokens (labels: model / service / type=prompt|completion)
gemini.requests.total — request count (labels: model / service / status)
gemini.request.duration — latency histogram
gemini.cost.usd — estimated cost (labels: model / service)

The design choice that splits opinions is: who computes the cost? Two options:

A. Pull from Google Cloud Billing API after the fact — accurate, but billing lags by hours to a day, and there's no per-task cost granularity
B. Compute client-side from token counts × a price table — instant, with per-task granularity attached by you, but the price table needs upkeep

I picked B. The price table lives in a constant called GEMINI_PRICING and gets manually bumped whenever Google moves prices. Just gemini-3-flash / gemini-3-pro with input/output unit prices each. Nothing fancy.

The real reason for B is real-time visibility:

Billing lags by hours to a day. A runaway prompt or batch bleeds cost all night before tomorrow's billing surfaces it. Computing client-side, tokens times price right after the call, lets you see "what's expensive right now" at the service level (code-graph / gcs-transformer / db-dictionary and so on, app/pipeline-grained) within minutes. That's a speed billing can never match.
Price table maintenance is light (Google doesn't change prices often), so the upkeep cost is trivial.
Cloud Billing API authentication, fetching, normalization, fan-out is its own pipeline of weight you'd have to maintain.

Then I emit gemini_cost_usd_USD_total as a cumulative Prometheus counter (the doubled usd_USD comes from OTel meter name gemini.cost.usd combined with the unit USD during Prometheus exporter conversion) and PromQL can answer "how much did we spend in the last hour" directly: sum(increase(gemini_cost_usd_USD_total[1h])). Alert fires at $1/hour, info severity, into Slack. In practice this is less an aggregation surface I query after the fact and more a tripwire: the threshold-crossing Slack alert is how a runaway gets caught.

One line worth drawing here: the gemini.cost.usd counter carries exactly two labels, model and service, and service is coarse (a bounded set of app/pipeline names). Try to push call-site-level identity onto the label, "what did that one prompt cost," and the label combinations blow up across many repos and inference types until the time-series DB can't absorb them. So the Prometheus side stays a tripwire: coarse service granularity, immediate alerting, nothing finer. The per-prompt attribution question, "which prompt burned the most this week," isn't a time-series question at all, it's a SQL one. That wants the token records in BigQuery with as much call-site context as you care to attach, which is the same reason Claude Code goes to BQ below. "I can instrument this call" and "this should live as a time series" are separate claims, and the fine-grained aggregation is where Gemini and Claude Code converge back onto the same backend.

Prometheus is what you want when the question is "right now."

Claude Code — Send to BigQuery, Built for SQL Aggregation

Every developer at the company uses Claude Code. But the economics differ from Gemini: it's a subscription, so token usage doesn't translate straight into a dollar figure. What I'm after here is less the cost itself and more the usage picture — who's using how much, how many tokens per repo, how well the cache is landing — so I can turn it into better usage.

The question that split opinion: "Should Claude Code usage go to Loki too?"

The answer: No, into BigQuery.

Why? Because Claude Code usage is, fundamentally, a structured ledger:

email — the user
repository — which repo it was used in
timestamp — when
input_tokens / output_tokens
cache_creation_input_tokens / cache_read_input_tokens — prompt-cache effectiveness included

And the questions you want to ask look like:

"Last week, what's the cumulative spend for Team A members?"
"How much did edits on Repo X cost over the past month?"
"What's the prompt-cache hit ratio difference between teams?"

All of these are SQL aggregation questions. LogQL aggregation and joins on Loki are painful. BigQuery, with a DAY partition and email as the primary key, just writes naturally.

So the Claude Code → BigQuery pipeline runs in four stages:

Emit — A bundled analyzer in Claude Code POSTs UsageInput (token info only, no email) to an internal endpoint
Auth proxy — A Cloudflare Edge Router worker validates CORTEX_API_KEY and stamps the user's email onto the request as X-Cortex-User-Email
Ingest — A Cloud Run API dedupes and publishes to Pub/Sub
Persist — A Cloud Run worker pulls from Pub/Sub, validates the schema, and streaming-inserts to BigQuery

Two structural points worth calling out:

Identity authority lives at the Edge Router. User identity is resolved exactly once, there. The emit side (Claude Code) never holds the email. This shuts down whole classes of client-side id-spoofing and social-engineering paths structurally
Pub/Sub gives async decoupling. Ingest and worker are separate, so backpressure on the worker doesn't affect ingest response times. On failure, Pub/Sub DLQ retries up to five times

What sits in BigQuery is visible day-by-day through the internal portal I'll cover in Part 2. Here's what it actually looks like:

The numbers are interesting enough to mention briefly: in the last 30 days, 78.0B tokens / 384K messages / 47 users / 79 repositories. The one to focus on is Cache Read Input at 75.1B (96% of total) — prompt-cache is dramatically effective. On a subscription this doesn't show up as a dollar figure, but cache read tokens carry roughly 1/10 the effective input rate, so if you were paying per-token API pricing for the same usage, this works out to roughly 7× more efficient at the blended input level versus the cache-less counterfactual. Being able to see usage efficiency as a concrete number like this is the point of the visualization; "aggregation-shaped backend matched to the question" is the design choice that makes this kind of metric fall out of SQL naturally and show up daily. Doing the same thing in LogQL would be a battle.

As a side note: MCP tool-call logs end up in BigQuery too (cortex.mcp_tool_calls), but via a simpler path — each MCP server just writes records directly, no OTel in the loop. The "annotation graph MCP used ~50,000 times by ~73 people" figure from the previous series came from this exact table.

The core point of this layer is: don't dogmatically force everything through OTel — match the tool to the qualitative nature of the aggregation.

To Be Continued

That's the four axes (application / infrastructure / CI / LLM) and the design judgments behind each. The write-side of the observability stack is wrapped up.

But shaping the write side isn't the whole story. The moment production data flows through the stack, PII becomes a constraint you have to design around. And the data has to actually be consumable by AI through MCP, with a thoughtful integration surface for both humans (web dashboards) and AI (MCP). Connect all of that, and the real driver of Self-Healing comes into focus from the observability side. That's the Part 2 story.

Thanks for reading. Part 2, "Observability Design for the AI Era — Reconciling PII Protection With AI Searchability, and Driving Self-Healing," is out now. Read on.

Making the Context Across 46 Repositories Semantically Searchable for AI

Ryosuke Tsuji — Mon, 29 Jun 2026 23:50:05 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

In Part 1, I wrote about unifying 46 repositories of production code into a single knowledge graph via static analysis. The graph itself got built, but I closed the post with four open issues: no semantic search, node explosion, having to open the file to actually know what a function does, and the cost of writing a new parser every time a new boundary pattern showed up.

This Part 2 is about how I solved the first one — the entry-point problem (no semantic search). The other three are left exactly as Part 1 described them — I'll come back to them at the end, together with the new issues that surfaced once the entry-point problem was out of the way.

The reason to start with the entry-point problem is simple: if the graph exists but the only way to reach it is grep, the model ends up inferring anyway. The whole point — "give the model verified facts, not inference" — falls apart. So the entry-point problem had to be solved before the others.

The Hint Was in db-graph

Months earlier, I'd already solved the same structural problem in a different domain — the db-graph project.

Internally, we had a large number of DB tables spread across many services, and no single person had the full picture. Different people knew different pieces well, but the whole map didn't fit in anyone's head. So I built db-graph: extract schemas statically from ORM definitions, generate per-table descriptions with Gemini, embed them as 768-dimensional vectors in the graph, and make the whole thing semantically searchable in natural language.

At the time of that article it covered 991 tables. Today it spans 21 schemas / 1,133 tables / 10,815 columns, and finding data in natural language without knowing table names is just how people work now.

The pattern that proved out there:

Static-analysis graph + AI-generated context = natural-language semantic search works.

Bringing the Same Pattern to code-graph

If it worked for db-graph, it should work for code-graph. The moment that thought landed, I noticed something:

code-graph already contains "DB table nodes" as boundary nodes — they're one of the boundary node types I covered in Part 1.

So if I just join code-graph and db-graph, code-graph automatically inherits db-graph's semantic context. Without writing a single annotation, the existing assets alone make the graph meaningfully richer.

That's where the idea of "joining graphs" first came up — not treating each graph as its own island, but designing the joins between them.

But API / Event / Page Still Need Meaning — and Annotating Every Function Is Off the Table

Joining db-graph took care of DB context. But the remaining boundaries (API / Event) and the graph's entry-point type (Page) still need meaning attached. Static analysis alone can't pull intent out of those, so context has to come from somewhere else.

The choice was clear: write the intent directly into the code via annotations (the same approach used by cortex's internal knowledge graph, which I covered in AI Harness Series, Part 2).

The catch: you can't annotate all the functions across 46 repos. There must be tens of thousands of them. Asking established teams running an existing production codebase to retroactively annotate everything is just not realistic.

But here's the second realization:

What matters is just the boundary nodes. So if I only annotate around the boundaries, that's enough.

When an AI agent asks "what breaks if I change this code" or "what other repos call this API," what it needs isn't a per-function logic explanation. It needs boundary intent — what is this screen for, what does this API return, what milestone in the business does this Event mark.

= Minimum annotations, maximum meaning. That became the heart of the design.

Designing the annotation graph

Putting it together (internally we call this annotation graph service-product-graph, or SPG):

Three graphs sit as peers, joined by SAME_ENTITY edges. There's no hierarchy — you can start from any graph and reach the others.

code-graph (structure) — functions / classes / boundary nodes from static analysis (46 repos)
db-graph (DB context) — 1,133 tables, semantically described
annotation graph (intent) — @graph-* tags written only around boundaries

The entry point for AI agents is a single MCP server that traverses all three graphs. AI agents never hit db-graph directly — the annotation graph's MCP server proxies db-graph calls on their behalf.

The annotation graph has 7 node types: Page / Section / Dialog / Field / Action / Api / Task. The early version was screen-focused and called screen-graph, but once it grew to cover backend Api / Task, it was renamed to service-product-graph.

An Annotation Example

Here's what an annotation looks like (fictional, but close in shape to the real ones):

/**
 * @graph-page /home
 * @graph-business Main screen. Members can see what they're currently renting, buy items, and initiate returns.
 * @graph-label Home Screen
 * @graph-has-section banners, wearing-items, wearing-return, delivery-status
 * @graph-has-dialog buying-modal, return-modal
 * @graph-navigates-to /return-procedure, /checkout, /my-karte
 * @graph-calls GET /api/v1/wearing
 * @graph-reads admin_delivery_orders, admin_rental_items
 * @graph-flow styling-loop
 * @graph-status monthly-member
 */

Two things matter here:

@graph-business carries the intent text (in our actual codebase it's written in Japanese). This is exactly what gets vectorized — it's the substance of semantic search.
@graph-flow / @graph-status carry where this sits in the member lifecycle (free signup → monthly subscription → styling loop → cancellation, etc.) and which member segment it's for. They add a second dimension of meaning: "this screen shows up inside the styling loop for monthly members."

There's also @graph-case (the conditional pattern tag that test cases derive from), but that's for another time.

Running Annotations Without Interfering With the Day-to-Day Dev Workflow

This is where it gets practical.

Once I committed to building annotation graph, here were the constraints:

Engineers run normal product dev with human code review
AI review isn't wired up on every repo yet — cortex's fully automated review (covered in AI Harness Series, Part 6) only works inside the cortex monorepo
Asking humans to review annotations on top of their normal review load is a non-starter
Even a split like "humans review the code, the AI reviews the annotations" inside the same PR mixes two review streams together and just confuses everyone

In other words: don't mix humans and AI inside the same PR.

The solution was to physically separate annotations onto their own branch.

Leave main untouched; engineers' normal flow stays exactly as it was
Stand up a separate annotation branch that's the AI's exclusive territory
When main changes, a webhook fires
The annotation branch handles generating the diff annotations and reviewing them — the AI does both, end-to-end
From the engineer's side, they only touch main and don't even need to know annotations exist

This is the "every line of code passes through an AI gate" ideal from AI Harness Series, Part 6, adapted to the constraints of an existing organization. cortex (the internal AI platform) is a monorepo I assemble from scratch, so "every commit passes the AI gate" actually holds there. For the 46-repo production system, that precondition doesn't hold. So instead of giving up on the ideal, I split it: engineers' workflow on one branch, AI's annotation workflow on another, both running in parallel.

Protecting Cross-Graph Consistency With an SLO

Just running the annotation pipeline doesn't guarantee the quality of the joins between the three graphs (code-graph / db-graph / annotation graph). So there's a set of SLOs that automatically check the consistency across the entire graph.

The main rules:

API chain connectivity — at least 95% of HANDLES_API handlers must have downstream function calls (= no handlers that receive an API and then do nothing)
DB access completeness — at least 80% of DB read/write edges must be joined to db-graph column nodes (= code-graph's DB boundaries are connected to db-graph's meaning)
Event field resolution — at least 70% of Event edges must carry field-level information
No ambiguous edges — name-resolution-ambiguous edges must be 0 (severity: error)

These are really just a naive question — "shouldn't the boundaries connect to each other?" — turned into an SLO. If anything drops below threshold, an alert fires, and the trustworthiness of the whole graph gets defended every day.

The daily boundary-analysis cron from Part 1 (5% connection-rate drop = alert) was code-graph-only. This is a cross-graph SLO — it guards the joins between graphs themselves. Add a parser to one repo, write a new annotation, change a schema — whatever happens, by the next morning a quality drop in any join becomes visible.

Joining the Static Graph and the Annotation Graph via SAME_ENTITY Bridges

I've been writing "join" casually, but the actual joining wasn't that straightforward.

Static-analysis API / Page / Task nodes and annotation graph API / Page / Task nodes are created as separate nodes. They mean the same thing, but their names / paths / identifiers don't match by themselves — there's nothing automatic about lining them up.

To connect them, we generate a separate edge type called SAME_ENTITY. There are three bridges:

API bridge — API path normalization with a 4-stage fallback
1. Per-repo prefix conversion (e.g., normalize console-side /console/api/ to /api/)
2. Version stripping (/v1.x/ → /)
3. Parameter normalization (unify /:id, /{id} to /:dynamic)
4. Exact match → tolerate trailing ? → strip trailing :dynamic? → finally fall back to a dynamic-dispatch boundary :dynamic, loosening progressively
Page bridge — 6 strategies applied in priority order (URL direct match, component path match, itemId match, PascalCase normalization match, parent-directory linking, strip dynamic segments and match parent URL)
Task bridge — 8 per-repo patterns

There was also one operational footgun. The first implementation used INSERT NOT EXISTS to avoid duplicates. But BigQuery's streaming-buffer visibility lag let duplicates slip in — in one repo the edges doubled from 106 to 214 overnight. We fixed it by rewriting to MERGE INTO to make the operation idempotent.

The Result: Entering the Graph from "the subscription-fee calculation"

With all of this in place, the entry-point problem from the end of Part 1 was finally solved:

"the subscription-fee calculation for members seems off"

Throw this natural-language query at annotation graph and vector search returns the related nodes (Page / Api / Function / DB table) as facts. From there, SAME_ENTITY takes you over to code-graph functions, including callers and callees in other repos. From the DB boundaries in code-graph, you can cross into db-graph and pull the relevant columns.

The entry point can be anywhere — "what calls this table?" starts from db-graph, "what's the blast radius of this function?" starts from code-graph, both walk the same connected network. From a single natural-language query, or from a specific node, you can now traverse all three graphs and get every relevant piece of code plus every relevant DB schema.

The Part 1 lament — "the graph is there but the entry point is missing" — could finally be put to bed.

Real Usage Numbers

From 2026-04-16 (first production deployment) to the time of writing — about 2.5 months — the annotation graph's MCP server has handled ~50,000 calls from ~73 users. The breakdown:

Engineers (PI Division + QA + relevant engineering teams) — ~47,000 calls / 51 users
Non-engineers (stylists / customer support / mall operations / executives / administration) — ~2,800 calls / 21 users

The interesting line is the second one. "Search the codebase in natural language" is usually an engineer's tool — but once the entry-point problem was solved, people outside engineering started using it too, asking things like "how does this feature actually work?" or "what's in this DB?" in their own words.

This is adjacent to the "non-engineers writing specs with AI" trend I covered in AI Harness Series, Part 5 — a graph that can be queried by meaning starts to matter org-wide. Call volume is overwhelmingly dominated by engineers, of course. The interesting thing is the range of job roles starting to pick it up. That's the real impact of solving the entry-point problem.

MCP as the Single Front Door

The MCP server is the cross-graph entry point. It exposes six tools — service search / service detail / API detail / data-flow tracing / impact-radius tracing / business-rule full-text search — and that's the only entry point AI agents ever touch.

One design choice worth calling out: AI agents never talk to db-graph directly. The annotation graph's MCP proxies db-graph calls. From the agent's side, the mental model stays simple: "ask one MCP and get everything back."

That makes the full chain — "Screen → API → Code → DB → Column" — traversable in a single MCP tool call.

April–May Timeline of Trial and Error

Same approach as Part 1 (pulling commits from Jan–Mar). For Part 2, the key commits are from April–May.

April: Expansion and the First Bridges

2026-04-14 ─ refactor(graph): rename screen-graph to service-product-graph — declaration that the scope expands from screen-only to whole-service
2026-04-15 ─ feat(graph): add Api and Task node types to service-product-graph parser — Api / Task node types added
2026-04-15 ─ feat(mcp): add cross-graph tools to service-product-graph MCP — cross-graph tools land (the single front door across all three graphs)
2026-04-15 ─ feat(graph): add SAME_ENTITY bridge edges between service-product-graph and code-graph — first bridges
2026-04-18 ─ feat(graph): resolve Redis keys to code-graph boundary nodes — boundary resolution through Redis
2026-04-19 ─ feat(service-product-graph): add EventBridge EMITS_TO support + SAME_ENTITY bridge
2026-04-20 ─ feat(code-graph, service-product-graph): improve SAME_ENTITY boundary bridge coverage — 4-stage fallback locked in
2026-04-21 ─ feat(auto-review): SPG annotation auto-maintenance pipeline — AI auto-maintenance pipeline (= what Part 1 hinted at with "humans alone can't, but AI can")
2026-04-22 ─ feat(service-product-graph): add Task SAME_ENTITY bridge to code-graph — all three bridges in place

May: Stabilizing and Expanding

2026-05-01 ─ Annotation generation moves from local execution to a Cloud Run Job; operation stabilizes
2026-05-05 ─ feat(spg): add mall repos to SPG indexing — mall repos indexed
2026-05-06 ─ feat(spg): add Go-aware parser — Go support
2026-05-06 to 08 ─ Page bridge strategies expanded to six, connection rate hits 100%

What This Timeline Says

April 15 was the day "expansion + cross-graph tools + bridges" landed in close succession. Over the next week, "Redis / EventBridge / Task bridges / annotation auto-maintenance" stacked up week over week.

In particular, the annotation auto-maintenance pipeline on April 21 is where the "humans alone can't do this, but AI can" promise from Part 1 got cashed in. From that point on, annotation shifted from "humans grind through writing them" to "design the whole operation assuming AI writes them."

What Still Isn't Solved

Solving the entry-point problem didn't make everything clean. A few issues remain.

1. Maintaining Annotation Coverage

The frontend side is annotated heavily. Backend / Go / batch are still thin. Some nodes will always be missing annotations — that's structural, and you can't drive it to zero. It's an ongoing operational issue.

2. Bridge Mis-Joins Aren't Fully Eliminated Structurally

The Page bridge in particular has cases where multiple annotation Pages map to the same boundary — that's structural and unavoidable. Adding more strategies got coverage to 100%, but guaranteeing "every join is correct" 100% is hard.

3. No Dynamic Analysis

The graph only carries the fact that "this edge exists statically." How often that edge actually gets used in production isn't recorded. Piping production execution counts back into the static graph and surfacing dead-code edges as a separate signal — that's still untouched.

4. Onboarding Cost When a New Repo Joins Production

Every time a new repo enters production, the bridge normalization rules and per-repo patterns need adjusting. This is the annotation-graph-side version of Part 1's fourth issue (the cost of adding a new parser for every new boundary pattern).

Closing: Not "Thrown Away," but "Evolved"

In Part 1's closing note, I touched on the fact that the cortex side (the internal AI platform) bailed out of the code-graph approach early and bet on an annotation-based knowledge graph instead. The bail-out was fast enough that calling it "thrown away" wouldn't be wrong — but looking back across this whole series, the more accurate word is "evolved."

What it evolved into, in the end, is three graphs joined as peers:

code-graph (structure)
db-graph (DB context)
annotation graph (boundary intent)

Joined by SAME_ENTITY, served to the agent through MCP. The thing static analysis alone couldn't deliver — querying by meaning — became workable by reusing the db-graph success pattern and adding minimal annotations only at the boundaries.

And one more framing: paired with the AI Harness Series, Parts 1–6, this series sits as:

AI Harness series — how to live with AI when you're assembling the system from scratch yourself
code-graph-deep-dive series (Part 1 + Part 2) — how to live with AI inside an existing organization's running production system

= the same philosophy (design without trusting AI), implemented under two different sets of constraints.

Thanks for reading this far.

Got the Top 7 Badge — honestly thrilled 🙌

Ryosuke Tsuji — Wed, 24 Jun 2026 00:21:42 +0000

Cyberpunk cat RPGs and robot personalities

Jess Lee for The DEV Team

Jun 23

Building One Knowledge Graph Across 46 Repositories With Static Analysis

Ryosuke Tsuji — Mon, 22 Jun 2026 23:54:01 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

This post is about unifying a production codebase spanning 46 repositories across multiple services into one knowledge graph, using static analysis.

Internally we call it code-graph, and I built it between January and March of this year.

Three things I want to write down:

Why "just letting AI read the code" isn't enough, and why I had to chase down the connections that cross repository boundaries
How I extracted boundaries across 46 repos and a zoo of frameworks (jQuery / AngularJS / Express / NestJS / TypeORM / Redux Axios ...)
What 3 months of trial and error solved, and what it didn't

This is Part 1, covering the construction of code-graph itself, the painful parts, and the issues that remained. Part 2 is about service-product-graph (SPG) — a layer I built on top of code-graph to compensate for what static analysis couldn't do alone.

What Was This For?

A long-running production codebase usually looks something like this:

Multiple services and multiple teams touching it
Each era's framework still alive and mixed in
Dependencies via API, DB, and Event are tangled — not clean 1:1 front-to-back relationships:
- The same API gets called from multiple repositories (= n:1 callers)
- The same DB table is written to and read from across multiple repositories (= n:n)
- For Events, just looking at the emit side doesn't tell you how completely the subscribe side is covered — it's practically untraceable

The starting point was wanting to ask AI: "show me the blast radius," "tell me what breaks if I change this," — across this entire codebase.

The naive answer is: "just hand all 46 repositories worth of code to AI and let it analyze."

But that doesn't work, for two reasons:

Context window: 46 repositories × years of accumulated code is just not a size you can hand to an AI in one shot
Hallucination: even if you could, "read everything and extract the relationships" is an inference task. It misses things, it makes mistakes. That's not usable for impact analysis on a production system

So the first idea I landed on was: build a knowledge graph externally, via static analysis. That's the starting point of code-graph.

Scale: 46 Repositories

The target splits into two graphs:

air-closet graph (37 repos): a graph that spans multiple services like airCloset, Men's, WMS, and more
mall graph (9 repos): airCloset Mall and related

So 46 repositories in total.

The thing to notice is that this isn't "one service with 37 repos." It's a collection of multiple services that adds up to that scale. Making the dependencies that cross service boundaries visible as cross-repo edges is exactly what the boundary nodes discussion below is about.

Why Boundary Nodes Matter

This is the heart of the article.

When you want AI to understand code, getting it to "read what's in front of it, plus what's next to it" is honestly not hard. grep, open the file, hand it to the model — that works fine.

For a small codebase, that's enough. But at scale, you hit the context window and hallucination problems mentioned above. I suspect most readers can relate.

One way to improve this is to statically analyze the codebase, convert it into a knowledge graph, and serve it to AI through MCP. That's the approach.

The first step was static analysis with tree-sitter (an OSS library that parses source code into syntax trees — it supports a lot of languages and is what VS Code and similar editors use for syntax highlighting; I genuinely recommend it if you want to build something in this space). It's a great tool, but on its own it doesn't solve everything.

What it doesn't solve is tracing relationships that cross boundaries — APIs, databases, and so on. tree-sitter can extract the relationships between variables, functions, and other in-language constructs. But it can't extract those boundaries.

The thing that humans and AI alike get stuck on, in practice, is exactly that — code connections that cross boundaries:

The same API is being called from another repo you weren't looking at
- The frontend in repo A and the nightly batch in repo C might both hit /api/v1/users/me
- Looking at just one of the repos, AI has no way of knowing
The same DB table is being read or written by some batch process you don't know about
- When you're modifying service-side code, some batch in a different location might be reading and writing the same table
- Misjudge the blast radius and you get data inconsistency
The subscribers for this event might not be fully accounted for
- With distributed pub/sub, looking only at the emit side doesn't let you cover the subscribe side
- Something runs somewhere you don't know about

In short: getting AI to understand the code on the other side of a boundary, without hallucinating. That's the goal.

If you have boundary nodes, AI can answer "this API is also called from repo X" as a fact. Instead of asking AI to infer, you hand it a fact that's already been resolved.

Yes, there is inference during the extraction phase — TypeScript Compiler and Gemini both contribute. But the results are persisted as confirmed values in the graph, and a daily boundary-analysis cron (covered below) lets us notice drift the next morning. By the time AI consumes the graph, only verified facts flow to it.

AI has a tendency to answer "with whatever it can see" rather than saying "I don't know." That's where silent hallucinations creep in — wrong answers that neither AI nor the human catches. Boundary nodes are what physically prevents that. They give AI a verified place to stand.

Construction: tree-sitter Base, With TypeScript Compiler and Gemini Where Needed

Normal code structure (function calls, class inheritance, imports) is relatively straightforward to extract with tree-sitter. Walk the AST, turn functions / methods / classes / fields into nodes, connect references with edges. Just grind through it.

The catch is that while tree-sitter is great at building syntax trees, it's weak on type information and scope resolution. To accurately follow a field access chain like user.preferences.theme, you need to resolve what type the variable user is and where it's defined. tree-sitter alone can't reach that.

So for field-access resolution we use TypeScript Compiler API and Gemini in combination. tree-sitter extracts the structure → TypeScript Compiler resolves variables and types → for the dynamic cases that even that can't reach, Gemini infers. Three stages with distinct responsibilities, which is how we push field-access accuracy up.

We define 21 edge types:

CALLS (function call) / EXTENDS (inheritance) / IMPLEMENTS (interface implementation), etc. — the basic structure tree-sitter can give us
CALLS_API (caller) / HANDLES_API (handler) — API boundary
EMITS_TO (emitter) / SUBSCRIBES_TO (subscriber) — Event boundary
WRITES_TO / READS_FROM — DB boundary
and more

The real battle starts when you try to extract the boundary edges (CALLS_API / HANDLES_API / EMITS_TO / SUBSCRIBES_TO / WRITES_TO / READS_FROM).

Extracting and Joining Boundary Nodes: 3 Months of Trial and Error (Jan–Mar)

Unlike normal code, boundaries (API endpoints, DB tables, Event topics) are written in wildly different ways depending on the framework, language, technical area, library, repository, and the person who wrote it.

Take "define an API endpoint": is it Express? NestJS with a @Get() decorator? A Fastify route? Each one produces a completely different AST shape. And the same repo can contain multiple patterns simultaneously.

And it's not just extraction that's hard. Joining the extracted boundaries on the graph is its own headache. For the same API path or DB table name, you get:

Casing variation: camelCase / snake_case / PascalCase
Trailing-slash variation (/users/me vs. users/me)
The boundary name itself is a variable (${baseUrl}/users/me)

…all mixed together. Normalizing all of that and correctly joining caller to handler, emitter to subscriber, writer to reader was genuinely the painful part.

And this had to happen across all 46 repositories × the framework zoo.

Looking back at the actual git history from that period, you see new parsers and detectors being added almost every week, noise filters going in, and concept renames landing. Here are the main commits from January through March, in order (the commit prefix starts as graph-rag — the stack was originally named after the "knowledge graph + RAG for LLM consumption" framing — and is renamed to code-graph on February 15; a few late-February commits still carry a short-lived graph prefix from that transition):

January: Starting Out, and Realizing tree-sitter Alone Isn't Enough

2026-01-15 ─ feat(graph-rag): add TypeScript parser with tree-sitter — the starting commit
2026-01-15 ─ feat(graph-rag): add graph builder with BigQuery storage — graph data is written to BigQuery
2026-01-19 ─ feat(graph-rag): add TypeScript Compiler-based variable resolution for field extraction — realized that tree-sitter alone couldn't resolve variable types, brought in the TypeScript Compiler API alongside it

February: Framework Diversity, Fighting Noise

2026-02-02 ─ feat(graph-rag): add frontend parser for jQuery/Vanilla JS codebase — jQuery / Vanilla JS frontend code
2026-02-03 ─ feat(graph-rag): add AngularJS Page detection for frontend BFS — AngularJS page detection (older framework, still very much running)
2026-02-15 ─ refactor(code-graph): consolidate 18 MCP tools into 5 with deep subgraph traversal — the toolset had ballooned to 18, consolidated to 5 (also the moment the stack was unified under the name code-graph)
2026-02-18 ─ fix(code-graph): reduce graph noise by filtering Type nodes, external lib CALLS, and Storybook files — noise reduction: filter out Type nodes, external library CALLS, Storybook files
2026-02-19 ─ fix(code-graph): extract path aliases from tsconfig paths in addition to make-symlink + fix(code-graph): resolve @alias path imports for CommonJS symlink patterns — the path-alias pain: tsconfig paths, make-symlink, and on top of that the CommonJS symlink pattern — three different mechanisms to support
2026-02-19 ─ feat(code-graph): add stop_at=boundary option to trace_connections — option to stop traversal at boundary nodes (explicit traversal scoping / node-explosion mitigation)
2026-02-21 ─ feat(graph): add typeORM JOIN detection, NestJS decorator parsing, Fetcher API detection — TypeORM JOINs / NestJS decorators / Fetcher API support
2026-02-21 ─ fix(graph): pass fullFileCode to Redux Axios variable resolver for scope-based extraction — Redux Axios variable resolver fix

March: Concept Cleanup and Precision

2026-03-08 ─ refactor(code-graph): rename __external__ to __boundary__ — concept cleanup: standardize on "boundary node" rather than "external resource"
2026-03-16 ─ refactor: remove db-dictionary from code-graph stack — split the DB schema dictionary (the layer that lets you look up table / column definitions) off into its own graph to evolve independently
2026-03-24 ─ fix(code-graph): infer table names from dynamic variable names — table-name inference from dynamic variable names
2026-03-24 ─ feat(code-graph): add orphan boundary node cleanup script — cleanup script for orphan boundary nodes

What This Timeline Tells You

Every single week there's a new framework or pattern being handled. The work of "extracting boundary nodes" is, fundamentally, adding parsers for each new way people write the boundary.

Just listing the frameworks / mechanisms that showed up:

tree-sitter (TypeScript / JavaScript / Go / Dart (Flutter))
TypeScript Compiler (variable resolution)
jQuery / Vanilla JS
AngularJS
Express / Koa / Fastify
NestJS (decorator parsing)
TypeORM (DB JOIN detection)
Fetcher API
Redux Axios (variable resolver)
3 different path-alias schemes (tsconfig paths / make-symlink / CommonJS symlink)

This isn't a "TypeScript / JavaScript / Go / Dart static analysis" story you can wrap up in one sentence. The air-closet codebase is a collection of long-running production systems where every era's framework still coexists. We had to pick up, from the AST, the era-specific meaning of "here's an API endpoint," "here's a DB call," "here's an Event subscription."

Why I Was So Particular About Accuracy

90% is completely unusable.

Take "list every piece of code that calls this API." If you recall only 90% of the callers, then 10% of the relevant code is invisible to AI. When you're using code-graph for blast-radius investigation, that invisible 10% is what causes the incident. That's single-hop recall.

And it gets worse the further you walk. For multi-hop graph traversal, every hop multiplies in: at 0.9 per hop you get 0.81 at 2 hops, 0.729 at 3, ~0.59 at 5, ~0.35 at 10 — after just a handful of hops you're at less than half. Push it to 0.99 and you get 0.98 at 2 hops, 0.95 at 5, ~0.90 at 10. Whether the system is usable in practice is decided by that single-digit difference between 90% and 99% — and it bites you on both axes: single-hop recall when you're enumerating, multi-hop confidence when you're traversing.

So every time a new boundary pattern showed up, we'd add a new custom parser, aiming to keep the boundary connection rate above 99%. We can't measure extraction recall directly — there's no ground-truth "every boundary that should exist" denominator — so the indicator we actually measure daily is "what fraction of callers / handlers are correctly connected on the graph" = the connection rate. The next section is about how that's monitored.

Boundary Analysis Is Running Today

The code-graph we built is still running daily.

Concretely, a boundary-analysis cron runs at JST 7:00 every morning. What it does:

API boundaries: match CALLS_API (caller) with HANDLES_API (handler), and aggregate cross-repo connection rates
Event boundaries: match EMITS_TO (emit) with SUBSCRIBES_TO (subscribe)
DB boundaries: aggregate cases where WRITES_TO and READS_FROM from different repositories touch the same table (= implicit cross-repo DB dependency)

The day-over-day numbers get compared, and if the connection rate drops by more than 5%, we get a Grafana alert.

This whole thing only makes sense because we have boundary nodes to compare against. We're monitoring the quality of the extracted boundaries themselves on a daily cadence. The kind of drift the connection rate catches by the next morning: "a parser fell behind a new pattern and a class of boundaries went invisible," "the repository layout changed and path aliases stopped resolving." There are failure modes the connection rate alone can't see — a caller-side parser regression that drops callers entirely will leave the surviving handlers still looking "connected" to whatever callers remain, and the missing ones slip out silently. That's a separate axis we cover with day-over-day absolute node counts per repo / pattern.

What Still Doesn't Work

Even after all that, a handful of issues remain that I can't solve at the root.

1. No Semantic Search (an Entry-Point Problem)

The search MCP tool only does LIKE-based substring matching.

If you're in the middle of development and want to follow connections starting from a function you're already looking at, that's fine — you can pull it up by function name or filename directly.

The problem shows up when you're investigating a production bug or a customer support ticket. You have no idea what filenames or function names are involved at the start. When the input is "the subscription-fee calculation for members seems off," and you want to walk to the related code from there, no natural-language query into the graph means you can't find the entry point in the first place.

The intent was: "instead of grepping the whole codebase, navigate relevance via graph RAG." What we ended up with is a structure where you have to grep at the entry point and infer your way in.

2. Node Explosion

If you naively turn the AST into a graph, every builtin function, anonymous function, and internal utility becomes a node. The map call you don't care about, the internal helper you don't care about — they're all nodes.

Trigger a traversal starting from one node, and within a few hops you're dragging in helpers, types, and primitives until the node count explodes. There's no axis built into the graph structure for "filter by relevance."

We work around it with explicit controls like stopping traversal at boundary nodes, but that's a workaround, not a root fix.

3. To Know What a Function Actually Does, You Still Have to Read the File

The graph tells you "something is here," "this calls out to another repo." But what the function actually does still requires opening the file.

That makes the graph slow on its own. The codebase-investigation tool we built later uses the graph to narrow down candidate files and then hands those to Git Server MCP to actually read — but the underlying graph-only resolution limit doesn't go away.

4. Operational Cost of Adding Parsers for Every New Boundary Pattern

Every time a new framework or library enters the codebase, we have to learn "how do they write boundaries in this thing" and add a new parser.

The parser directory already has 10+ custom detectors / extractors. There's no sign of the maintenance and extension cost going down — every time a new tech stack enters the codebase, the same work repeats.

Side Note: A Different Call Elsewhere — cortex

Note: "cortex" in this section is the internal codename for an AI platform I've been building in-house at airCloset. Unrelated to existing commercial products like Snowflake Cortex or Palo Alto Networks Cortex.

Setting code-graph aside for a moment: I also have a separate project — cortex — where I'm building an in-house AI platform from scratch (currently a single monorepo with 100+ apps).

On that project I did initially try the same approach as code-graph, but bailed out early and went with an annotation-based knowledge graph instead:

It's a monorepo I'm assembling myself, so I can realistically annotate everything at once
Use JSDoc tags to write intent directly into the code, and build the graph from that
Vectorize that intent and store it on the node, so semantic search works

The decision to "write intent into the code and graph it" — and the trial and error that led to it — I covered in detail in a separate series. If interested: AI Harness Series, Part 2 (The Knowledge Graph at the Heart of cortex)

Annotation-Based Won't Work for Production Systems

And no, you can't take the same approach for the production-side codebase that code-graph deals with:

Annotating all 46 repos at once isn't realistic
Long-running production systems, touched by multiple teams, with mixed frameworks
The precondition "put annotations into the code" doesn't hold

So the choice was: keep code-graph (static analysis) as the base, and evolve by layering on additional graph layers to compensate.

How we're trying to solve the issues above, I cover in Part 2.

To Be Continued

That's it for Part 1. Part 2 covers how we try to get past the issues above.

The real story is less "thrown away" and more "evolved."

Thanks for reading this far.

AI Isn't Something to Trust — It's Something to Design

Ryosuke Tsuji — Tue, 16 Jun 2026 00:02:03 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.

Across the five posts of this series I've worked through how cortex's harness is put together, one piece at a time: the overall picture, the knowledge graph, Auto Review, Self-Healing + Recurrence Prevention, and non-engineer PRs. Having walked through all of them, I want to step one level down for the wrap-up. Why am I building this thing in the first place? That's what this post is about.

The five posts might look independent, but the root is one thing, and the series doesn't close cleanly without that one thing being put into words. Together with the philosophy, I want to look back at the failures that don't show up when you only write about what worked — what I threw away, where I tripped — as a reference point for anyone trying something similar.

Series Index

#	Theme	Key scene	Article
1	Series intro: cortex's harness	PRs auto-merge / incidents self-heal before you notice	ai-harness-intro
2	Product Graph (cpg)	Code, docs, DB, infra unified into one graph	cortex-product-graph
3	AI PR review	webhook → AI review → auto-fix → squash merge	cortex-auto-review
4	Self-Healing + observability + auto-added guardrails	Alert → AI investigates → fix PR + new lint/type gate → auto redeploy	cortex-self-healing
5	Democratizing the maintenance phase	Domain experts open PRs to production; the harness owns the quality gate	cortex-non-engineer-prs
6	Series wrap-up	The underlying philosophy plus a retrospective on the failures and lessons	This post ← you are here

Origin — What I Was Thinking About in 2025

When I started building cortex, there was one question I wanted to answer:

How do I get AI to understand the system accurately?

If AI could understand the system accurately, then PR review, bug investigation, and fixes could all be delegated, and even non-engineers could open up their own development. Conversely, as long as I was stuck on "understand it accurately," everything downstream was sitting on unstable ground. So I spent a lot of time on the prerequisite layer before any of the individual mechanisms.

The two obvious approaches both hit walls.

Wall 1: The Context Window Limit

The first reflex is "just give it all the information it might need." Stuff the codebase, docs, DB schema, infra definitions all into the prompt, and AI gets the whole picture.

That fails on size. Codebase + docs + schemas + infra at our company doesn't come close to fitting into any realistic context window.

"Surely context windows will keep growing, and this'll work eventually?" — the more I thought about it, the less of a future I saw in that direction.

Even with a model whose context window is very large like Gemini, behavior gets unstable when you push it close to the limit. Middle information gets dropped, irrelevant tokens skew the conclusion sideways. This isn't a model-selection problem; it's a structural attention problem. The more unrelated tokens you mix in, the more the attention ratio toward relevant tokens drops mechanically. This is the documented "lost in the middle" phenomenon (information placed at the start and end of long inputs gets used; information placed in the middle is effectively ignored), and stuff the context window full and you routinely end up in a state where the information you thought you handed over isn't actually visible to the model.

"Lost in the middle" itself may get mitigated as long-context models improve, so I treat it as empirical supporting evidence rather than the core argument. The real wall is the recursive one beneath it: even if "size" is solved, you immediately need a higher-level context to judge which tokens are necessary and which aren't. That problem is recursive and can't be resolved by context window size, in principle. Information has to be structured, or AI doesn't make correct judgments. That's true of humans too — but humans are a notch better off, because LLMs don't notice they don't know, and they answer with confidence anyway. Silently wrong is worse than visibly stuck.

The keep-growing-context-windows path didn't have a real resolution in sight.

Wall 2: Don't Lean on Learning Either

The other obvious move is to make AI itself learn. Fine-tune per organization, teach it our codebase, our docs, our business. I considered it. Currently not doing it.

Two reasons. One: getting learning into actual production was still research-phase (in 2025 then; still in 2026 as I write this) and the road to real deployment is still long. The other is thornier: even if you could learn it, "forgetting" is extremely hard.

A business system has to reflect "the current truth." When the design changes, the DB schema changes, the business rules change, you want to actively erase old knowledge. But "delete just this piece of what's baked into the LLM weights" is unsolved at the research level — there's even a field name for it, machine unlearning, which tells you how hard it is. And on top of that, teaching the model new things also destroys unrelated existing knowledge (called destructive interference / catastrophic forgetting). Lean on learning and both hit at once: the cost of keeping things consistent explodes.

Rather than treating "doesn't learn" as a downside, I came around to: because it doesn't learn, swapping out the external knowledge is enough to reflect the current state, and the consistency story is much simpler. That was the call at the time.

The Way Out — GraphRAG + MCP

With no future in the context-window direction or the learning direction, I came across the GraphRAG concept.

GraphRAG itself is widely discussed elsewhere; for me, what it meant was the framing: "supply only the context that's needed, at the moment it's needed." Combined with MCP (Anthropic's protocol for connecting LLMs to external tools), AI can go fetch what it needs on its own.

What was decisive was that this structure lets AI traverse the graph agentically. Rather than "read everything and find related parts by inference," AI gets to the node it needs and pulls the fact out. Which leads to:

Instead of making AI infer, supply facts as context.

That one sentence became the core of cortex's entire design philosophy.

The first thing I built was a static-analysis-based code-graph, which I then threw away after trial and error, and arrived at the annotation-based product-graph (cpg) — details in the trial-and-error section.

I Don't Trust AI to Begin With

The origin section in one line:

I don't trust AI to fill in the blanks for me.

"Don't trust" here is not the same as "have no faith in." This isn't doubting Claude / GPT / Gemini's generation quality. What I mean is:

It doesn't know context it wasn't handed.
It doesn't, on its own and without being told, produce the ideal state.

The first one is a truth no amount of model progress will change. Architecturally, LLMs can't know things that weren't in the training data and aren't in this session's context. "Surely smarter models will pick up on it" — I don't think that future is coming. Smarter is a real direction; smart alone doesn't compensate for not knowing.

The second one is about responsibility, and humans owning it. AI can't decide on its own what "ideal" means. When it tries, it lands on a generic best-practice answer slightly off from the actual situation. Ideal depends on the business, the organization, the moment in time — none of which is visible to AI unless a human verbalizes it and hands it over.

So that conviction is not underestimating AI's capability; it's a design decision to not let AI auto-complete the prerequisites.

Mastering AI is not about giving it freedom — it's about confining its output to a predictable range.

The mechanism for confining it is the harness this series has been describing.

So I Build Harnesses to Hold AI to Determinism

Reading each post through the lens of "don't make AI infer; lean on determinism" surfaces that the five of them are all the same conviction showing up in different layers.

Part 2 — Knowledge Graph: Instead of making AI search the codebase, this mechanism tilts toward making the codebase legible. With @graph-* annotations, code / docs / DB / infra are unified into one graph, so AI doesn't have to grep + infer to find related parts. This is the direct implementation of "supply facts as context" from the origin section. → cortex-product-graph

Part 3 — Auto Review Dimensions: Nine review dimensions (responsibility / severity / type SSoT / etc.) are fixed in advance. When AI does the review, what to check isn't something it gets to infer. "Looking at the PR as a whole" gives AI too much room for inference, so dimensions are split and each is judged as its own question. Dimensions = locked by the harness, evaluation = AI's job. → cortex-auto-review

Part 4 — Self-Healing + Recurrence Prevention: Alert → investigation → fix PR → redeploy. The flow itself is fixed. AI doesn't get to think through "how should we respond to incidents" each time. And Recurrence Prevention — adding lint / CI gates so the same trap can't be stepped on twice — is mechanical refusal at the gate, not trust-AI-not-to-do-it-again. Or put differently: I don't expect AI never to repeat a mistake. → cortex-self-healing

Part 5 — Non-Engineer PRs: If the harness weren't holding quality, business-side folks opening PRs directly to production wouldn't survive a single day. Conversely, with the three mechanisms above stacked up (context locked, dimensions locked, traps locked out mechanically), the person closest to the requirements can ship the change directly. The translation layer and the engineering priority queue disappeared as a downstream consequence of the determinism push. → cortex-non-engineer-prs

So what's covered across the five posts is "don't make AI infer; lean on determinism" implemented at different layers. The root is one conviction.

What "Don't Make AI Infer, Lean on Determinism" Actually Means

Let me sharpen this phrase that's come up a few times.

"Lean on determinism" does not mean "give AI zero room to infer." Code generation, judging review findings, hypothesizing root causes from error logs — these are domains where AI not inferring is the same as no work getting done.

Where I want to lean on determinism is in domains where variance isn't allowed. Specifically:

Which part of the codebase to look at — don't have AI guess by analogy; pull it deterministically from the knowledge graph
Which review dimensions to apply — don't let AI pick "the important-looking dimensions"; lock the dimension list in advance
How to respond to incidents — don't make AI think through the workflow each time; fix the alert → fix PR path
Not stepping on the same trap twice — don't ask AI to "try to be careful"; let lint / CI mechanically refuse it

What implements this line — where inference is allowed vs. where it isn't — is the harness. To borrow the metaphor from Part 5, the harness lays down rails you can't fall off. On top of the rails, AI runs free (inference works as inference); but it can't fall off the rails sideways.

Put differently, this is equivalent to the framing in Part 2 (cortex-product-graph): "where hallucination gets confined." Saying "no inference allowed" isn't quite right — the harness isn't a thing that makes hallucination go to zero. It's a thing that confines hallucination to places where hallucination is OK (i.e., the inference-allowed zone). The structure and facts about the codebase are pulled deterministically, so the retrieval process itself has no opening for hallucination; hallucinations on the judgment side get filtered downstream by tests / lint / dimension-by-dimension reviews. The places where hallucination is allowed and the places where it isn't are physically split by the harness. That's the continuation of the Part 2 framing.

Step back one more level and what the harness is really doing is shifting when inference happens. The annotations and descriptions on the graph were also written by AI originally — there is inference baked into them. But that inference is write-time — happens once, reviewed, then frozen — not read-time (happens every query, unverified at the point of use). The graph is frozen, reviewed inference, which is exactly why the read side can treat it as fact. "Leaning on determinism" can be rephrased as not letting unverified inference run on every query.

This is also the underlying basis for Part 1 (Series Intro)'s claim "models commoditize; harnesses differentiate." Model-side quality is converging across Claude / GPT / Gemini, but the harness is codebase-specific and business-specific, so this is where org-level differentiation actually comes from.

Worth flagging: the position of this boundary moves with model capability. As agentic search and reasoning get stronger, today's "must be deterministic" zone might be tomorrow's "inference is good enough" zone — and in fact cortex itself depends on AI's ability to traverse the graph agentically. But the boundary itself never disappears. How information is structured, where the line gets drawn between fact and inference — whether you hold that line explicitly as a design decision is what differentiates organizations, across every model generation.

A note: this framing isn't confined to cortex's harness. The same stance shapes db-graph MCP, the natural-language interface over internal DB schemas, and Sandbox MCP, which lets non-engineers safely publish AI-built apps. It's the through-line in any platform we build that's based on AI doing meaningful work.

One level more abstract: the individual features aren't where the value is. The value sits in the conviction itself. cortex / db-graph / Sandbox MCP are all that one conviction translated into our own use cases.

The way I think about "design":

Design is translating an abstract principle into a concrete implementation that fits your own use cases.

It's not drawing class diagrams, and it's not laying out architecture diagrams — it's the translation work of "how does this principle take shape under our business / codebase / constraints?" That's where each organization's distinctiveness lives, and that's the value that can't be copied.

Said the other way: another organization copying cortex's surface doesn't reproduce the substance. What gets asked of every org is how it translates this principle into its own use cases.

Trial and Error That Got Me Here

Everything I've written about above is the form that ended up working. Getting to that form involved a lot of throwing away. Two representative examples worth keeping on record, plus one shorter one.

I Spent Two Months on Static-Analysis code-graph, Then Threw It Out

The first thing I built was static-analysis-based code-graph: extracting AST data — imports, call graphs, type dependencies — and putting that into a graph DB. At a glance, the obvious implementation of "make AI understand the codebase."

Why two months? code-graph wasn't just cortex; it spanned our consumer-facing services and internal-system repositories too — over 40 repos in total (cortex being one of them). The mechanically-extractable AST data (imports / call graphs / type dependencies) was usable as-is via tree-sitter, but each repo had its own API endpoints / DB schema / event definitions / Pub/Sub topology, and extracting those boundary nodes (where an app meets the outside) goes beyond mechanical AST analysis and had to be implemented per-repo-type — that's where the time went.

So I spent two months out of the first three on this, and got something that worked end-to-end.

And then I threw it away.

Why: static analysis is great at capturing structure, but it can't traverse on intent or business context. Concretely, three things broke:

No semantic entry point for search — if I want to query the codebase with "show me the function calculating member subscription billing," I can't get there unless I already know the function name or file. A graph built only from static analysis has no semantic-tag entry pointing to "what is this code for?"
The graph contains only code — internal helpers / utilities / types / arguments all become nodes, so traversal from any function blows up within a few hops, dragging in helpers and primitives. There's no axis to filter on semantic relatedness
What I actually wanted was code + DB schema + docs + infra on one graph — given a function, I want to pull, in one query, the DB tables it touches, the docs where the design lives, and the linked business requirement. A code-only graph just can't do that

→ I switched to the annotation-based approach (@graph-* JSDoc tags write the business intent into the code, and that gets unified with DB schema / docs / infra into one graph). Searchable semantically, and when you traverse, only related stuff comes back. That's the current product-graph (cpg). Don't drag sunk cost forward and you'll get to the final form — discarding two months of investment instead of trying to recoup it was the foundation for everything that came after.

Setting Coverage 90% as a Solo Target Broke the Implementations

Test coverage is still gated at 90%+ (as covered in Part 3). That part hasn't changed. But there was a period when Coverage was treated as a standalone target, and during that period the implementation visibly got worse.

Specifically:

Heavy default-value use that hides branches: function(input = {}) style writes the missing-input branch out of the test path. Coverage goes up, protection against unexpected input is gone
Catch-and-swallow over throw: try / catch returning null. Don't throw → no need to test "doesn't throw," and Coverage is satisfied. Invalid state silently propagates
Early returns that flatten too much: dump complex conditions through an "early return" escape. Tests pass; what should have been validation just isn't there anymore

Result: Coverage 90%, quality lower than before. When you look at Coverage alone, the shortest path to "satisfy it" is a weaker implementation that passes the tests.

Two lessons:

Set a number as a target, and the number becomes the goal. Coverage is a "minimum floor" — not "a goal to hit"
Don't evaluate any metric alone. Coverage has to be evaluated alongside responsibility separation / exception design / boundary value coverage / etc.

Then, as the follow-up: I added linting that mechanically closes off the routes that let you weaken implementations to satisfy Coverage. Two specific examples:

no-silent-catch: AST-level ban on empty catch and silent-handler patterns like .catch(() => null). Catch bodies have to have a function call (logger included) / re-throw / new / await — otherwise it's an error. Catches the "weaken throws to satisfy Coverage but lose observability in production" pattern structurally. The violation message routes you to @cortex/otel/logger for structured logging, so the chain through Cloud Run OTel → Loki / Grafana stays intact
vitest-strong-matchers: bans weak matchers like toBeTruthy / toBeDefined / toContain / toBe(true|false) / expect.any / expect.objectContaining. Catches "any assertion that passes" patterns at the AST level, and points you instead toward toStrictEqual / toMatchInlineSnapshot that pin down the full output. This is one notch above Coverage — a test quality concern — but it lines up because the same reflection applies: don't let a number become the goal

On top of that, cortex's testing guideline opens with "Coverage is not the goal, just a supporting indicator," and threshold-lowering / istanbul ignore workarounds get bounced as Critical in Auto Review. So even when Coverage is satisfied, "this is intentionally deleting a branch" / "this is swallowing the exception" comes back as a Major finding.

From the lesson "a single metric warps implementation," we descended through guideline that states the principle → lint that mechanically rejects → Auto Review that evaluates as a dimension before Coverage 90% finally functioned as the "minimum floor" it should have been all along. This too sits in the lineage of the Recurrence Prevention mechanism from Part 4 (so the same trap can't be stepped on twice).

Parallel Sub-Agent → Sequential Evaluation

Third: an internal-structure call about Auto Review. Distribute the 9 dimensions to parallel sub-agents and evaluate concurrently — the plausible-looking design ("parallel = faster, parallel should also hold quality") I tried first and ended up throwing out.

What actually happened: time, cost, and accuracy all got worse.

Time got worse: each sub-agent has its own startup, its own context load, its own result aggregation overhead. "9-way parallel = 9x faster" didn't hold; there were even cases where sequential evaluation in one session ended up faster
Cost got worse: each sub-agent loads PR diff + guidelines + related code independently — common context loads ran 9 times. Token consumption measured at just under 4x — not the naive 9x (the context other than diff is shared across many dimensions, which is what kept it from blowing up to a full 9x)
Accuracy didn't hold: parallel sub-agents don't see each other's verdicts, so the same problem comes back as "APPROVE" from one and "REQUEST_CHANGES" from another. Duplicate findings show up too. Without a "what kind of PR is this as a whole?" pass to anchor on, dimensional findings drift toward local optima and the overall picture gets worse

Switching to sequential evaluation: same session goes through 9 dimensions in sequence, so context loads once, and each dimension's call has the previous dimension's verdict in front of it. All three — time, cost, accuracy — improve simultaneously.

Of course, sequential evaluation introduces order dependence between dimensions — earlier verdicts can shape later ones. That's a real trade-off, and I accepted it knowingly. Inter-dimension consistency at the cost of some order sensitivity is more useful as a 9-dimension review than fully independent dimensions that contradict each other.

The takeaway: the distributed-systems intuition that "parallel = faster, parallel = quality holds" breaks its own assumptions in an AI harness. Unlike parallelizing across CPU cores on your machine, with AI the context isn't shared memory; it's per-process state. Sequential evaluation in one session ends up better on speed, token efficiency, and inter-dimension consistency at the same time — a structural property that's easy to miss at design time.

What This Section Is Really Saying

The form I've described across the series is the result of a lot of trial and error. Not starting with the right answer and laying it out from there. The decisions of throwing things away with sunk costs included, the trap of letting a metric I chose turn into the goal, the distributed setup that looked natural but worked against me — those are the things I walked through before landing at the current shape.

Not easy. I don't pretend it was. But if you do walk through it, real results follow — that's the honest read on it now.

Closing the Series

What I most wanted to communicate across these six posts comes down to one thing:

AI coding is not about "how to use AI" — it's about designing the environment AI runs in.

Or, put another way:

AI isn't something to trust. It's something to design.

Assuming a large codebase: prompt engineering / model selection / tool selection — each matters individually, but polishing them alone doesn't get you to auto-merging PRs, auto-healing incidents, or non-engineer development. Getting there requires building a codebase / business flow / observability / repair cycle where AI doesn't need to infer. That's not an individual AI skill — that's an environment-design problem (conversely, for a small project of a few dozen files, today's AI models work fine standalone. Harnesses become essential when scale exceeds what one person can hold in their head.).

And the conviction at the root of environment design is, repeating myself, "I don't trust AI to fill in the blanks for me" — looking the reality in the face that context that wasn't handed over isn't known, and the ideal state doesn't happen without being told. Once you accept that premise, what to build clarifies naturally.

Looking back, four decisions ended up being the ones that mattered:

Locked the conviction first: putting words to the root ("AI isn't something to trust") gave priority order to every mechanism. If I'd started from technique, I don't think I'd have made it to the current form
Invested with throwing-out as the default: like I did with code-graph at the two-month mark, I went into things with "throwing this out is OK." Drag sunk cost forward and you can't move forward
Refused standalone numerical targets: the moment a metric like Coverage 90% becomes the goal on its own, implementations warp. Designed the system so it gets evaluated alongside other dimensions
Designed for "no inference," not around AI's capability: I prioritized building structure where AI doesn't have to infer, instead of relying on what AI can do. That's what made the system stable end-to-end, I think

If even one of these is useful to someone starting on something similar, that would be great.

Afterword — Where Engineering Careers Are Heading

Slipping off the wrap-up topic — this is something I've been turning over recently, written here in a "loosely held thought" tone, so feel free to skim.

As harnesses mature, I think engineering work splits along two directions.

One direction is value creation from problem identification and business design. In the world of Part 5 — where non-engineer PRs work — "writing code" stops being scarce, and the actual scarce thing becomes the ability to define what to build. The person closest to the requirements (a PMO, a business manager, a domain-deep engineer) ends up driving Claude Code through to the merged PR themselves. This direction looks less like "engineer" and more like a business designer who moves between domain and implementation.

The other direction is building the foundation that lets all of that happen safely and quickly. Non-engineers can open PRs to the production repo only because the harness underneath holds quality — knowledge graph, Auto Review, Self-Healing, Recurrence Prevention, lint, CI, tests, observability stack, all interlocked. Designing / maintaining / evolving that gets harder, not easier. As the house-builder side, rail-layer side, this demands deep infra understanding / security instinct / observability design / a feel for AI's architectural quirks.

I'm building cortex, so I'm spending more time on the latter; building "a foundation where the business can run its own changes" is genuinely fun for me. That said, I'm not the type who fully commits to one side — I move between listening to business questions and assembling the foundation, and the satisfaction from each is its own kind. This isn't a "which is more important?" question — the harness exists precisely so the former is possible, and the former being alive is what gives the latter meaning. They're mutually dependent.

Maybe the era of polishing just "coding ability" is shifting slightly. Where to put your value — or whether to move between both directions — becomes a question more engineers will need to choose into intentionally.

Six posts in, thanks for sticking with me to the end.

#	Theme	Key scene	Article
1	Series intro: cortex's harness	PRs auto-merge / incidents self-heal before you notice	ai-harness-intro
2	Product Graph (cpg)	Code, docs, DB, infra unified into one graph	cortex-product-graph
3	AI PR review	webhook → AI review → auto-fix → squash merge	cortex-auto-review
4	Self-Healing + observability + auto-added guardrails	Alert → AI investigates → fix PR + new lint/type gate → auto redeploy	cortex-self-healing
5	Democratizing the maintenance phase	Domain experts open PRs to production; the harness owns the quality gate	cortex-non-engineer-prs
6	Series Final	The underlying philosophy plus a retrospective on the failures and lessons	This post

Part 5 ("The Author Doesn't Have to Be an Engineer") has been generating sharp comments. Worth a read for the thread alone.

Ryosuke Tsuji — Thu, 11 Jun 2026 01:06:48 +0000

Self-healing guardrails for business-side PRs

Ryosuke Tsuji

Jun 8

The Author Doesn't Have to Be an Engineer: How the Harness Holds Quality

#ai #devops #engineering #github

16 min read

The Author Doesn't Have to Be an Engineer: How the Harness Holds Quality

Ryosuke Tsuji — Mon, 08 Jun 2026 23:32:30 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.

In Part 1 (Series Intro), I wrote about how cortex's harness has matured to the point where non-engineers (business-side managers, PMOs, and the like) can open PRs to the production repository. The harness here is the runtime foundation for AI in production -- the combination of the knowledge graph, Auto Review, Self-Healing, and Recurrence Prevention covered across Parts 1 through 4.

Part 5 is what comes next: that harness has now reached the layer of who actually writes the code.

"Surely an engineer is still checking afterward, right?" -- I expect a lot of readers will land here with that question. So this post leads with one concrete example before anything else.

Part 5 covers:

What kinds of PRs are actually shipping -- two recent ones in detail
What works and what doesn't -- the boundary between adding on top of an existing stack and standing up new infrastructure
Why this holds for non-engineers -- how the four mechanisms from Parts 1-4 carry it
What's next -- into toC services -- the direction of travel for consumer-facing scale

The deeper toC implementation story will live in a separate post; here you'll get the framing and the direction.

Series

#	Theme	Key scene	Article
1	Series intro: cortex harness	PRs merging unattended / incidents fixed before anyone notices	ai-harness-intro
2	Product Graph (cpg)	Code / docs / DB / infra unified into one graph	cortex-product-graph
3	Auto PR review	webhook -> AI review -> auto-fix -> squash merge	cortex-auto-review
4	Self-Healing + observability + auto-added guardrails	Alert -> AI investigates -> fix PR + new lint/type gate -> auto redeploy + same pattern auto-rejected from then on	cortex-self-healing
5	Democratizing the maintenance phase	Domain experts open PRs to production; the harness owns the quality gate	This article ← you are here
6	Series Final	The underlying philosophy plus a retrospective on the failures and lessons	cortex-philosophy

Start with one scene

A +1,742 line / 41 file PR lands on the internal dashboard web app. Title: "PL dashboard ver.2". The change opens up project visibility to managers and team leads across multiple business units, scoping what each person sees to their own division or team. It adds an SSoT in the shared types package, new routes on the API server with SQL involving INNER JOIN and LEFT JOIN, new pages and view-state on the web app, and a personal-settings surface -- the whole stack of things you'd expect for a real feature.

The point is, this isn't a typo fix or a string swap. Entities, repositories, API routes, screens, filters, personal settings -- every layer you'd normally touch for a feature got touched. A few days of work for an experienced engineer, in scale terms.

The review-fix cycle ran like this:

PR open (+1,742 / 41 files)
auto-review pass 1: Major finding (a permission-scope fall-through -- data from other divisions leaking into the view that shouldn't be there) plus a handful of Minor items
author bot push: closes the scope fall-through, addresses the Minor items
auto-review pass 2: Nit items remaining, plus a lint catch (no-empty-function)
author bot push: lint clean
auto-review pass 3: still some COMMENTED nits, not yet APPROVE
author bot push (iteration 2): hardens loading skeleton, reverts an unnecessary JSDoc tweak
auto-review pass 4: APPROVED → CI green + APPROVE both met → auto-merge → production

From PR open to merge: four review-fix rounds, three author-bot pushes, zero human reviewers in the loop. The reviews come from the auto-review bot, the fixes come from an author bot (an automated review-response agent that the PR author has running on their machine), the final APPROVE is submitted by the AI, and an auto-merge script picks it up the instant CI is green. Production lands with 56/56 shared type checks (SSoT), 2,284/2,284 API tests, 1,113/1,113 web specs, and 0 lint errors. (cortex splits the lint job between oxlint for general checks and a custom eslint plugin for the @graph-* rules.)

The second review pass is worth noting. "Scope fall-through" is a somewhat technical finding -- a hole in the permission filter meant data from divisions other than your own could leak into the view. This is an internal dashboard, so it's not an external-leak incident, but "only see what's relevant to you" is the whole point of a dashboard like this -- losing it doesn't just risk an information slip, it drowns the user in noise that they shouldn't be filtering through in the first place. That's the kind of issue that's easy to merge by mistake and painful to notice in production. The fact that auto-review caught it on pass one and bounced it back for the author side to fix is what makes this whole flow viable for non-engineers. Without that loop, a PR of this size from a non-engineer would be a bad bet.

And: the author of this PR is not an engineer. A business-side teammate handed a feature description to Claude Code, leaned on the knowledge graph (covered in Part 2) to pull in the relevant existing code, and the +1,742 line PR is what came back. The four review-fix rounds above are what happened next.

That setup lines up directly with the central claim of this post: the person who knows the business requirements best, instead of organizing them and handing them to an engineer, runs them through Claude Code to production themselves.

Quick clarification on "write." When I say "write" in this article, I don't mean typing line by line in an editor. I mean handing the business requirements to Claude Code, judging the resulting diffs and AI review comments with domain knowledge, and seeing it through to a production merge -- the whole arc. Most of the actual diff is written by Claude Code; review feedback is handled by the author bot. What the human does is three things: put what they want into words, make the judgment calls along the way ("does this fit, is this off"), and sign off when it's ready to merge. None of that is implementation work in the technical sense. That's what "write" means here.

There's still a learning curve, of course -- the prompts you give Claude Code, where to point it for context. But none of that is learning to program. What you need is the ability to articulate what you want clearly, not syntax or framework knowledge.

The harness covers quality, so even at +1,742 lines / 41 files, this works.

Out of scope: this post does not cover the path where non-engineers freely ship apps to a sandbox environment instead of opening PRs against the production repo. That's a different mechanism, covered in an earlier post: Bridging "I Want to Build" and "I Want to Publish Safely" for Non-Engineers with a Custom Sandbox MCP. This post is specifically about opening PRs against the production repo -- the front door that's traditionally been engineer-only.

When you need a change, you can make it yourself

The point of the previous section is this:

When you need a change, you make it yourself, without flagging an engineer.

When that holds, work like:

"I want a new metric on the dashboard"
"The aggregation filter doesn't match how the business actually operates"
"I want a small business-support feature embedded in the production app"

stops queueing behind whatever an engineer is in the middle of. The fix lands when the need lands.

Think about the old flow. Someone on the business side notices a small thing that needs to change. They write the requirements up. They open a ticket or a Slack thread for an engineer. The engineer is in the middle of something else, so it queues. When they finally get to it, the interpretation drifts from what the business actually meant, there's a back-and-forth, a review pass, and only then does it ship. Even a small change takes days to a week in wall-clock time.

That's the cost of a translation layer between business understanding and code, and it gets worse the busier the engineer is. The business's improvement cycle ends up paced by engineering's backlog.

When the person who knows the requirements writes the change themselves, that translation layer and that queue both disappear.

Here are two recent examples of that working.

Two non-engineer PRs that recently shipped

PR	Kind	Size	What changed
PR 1	Deep bug fix	+348 -177 / 7 files	The dashboard's actuals number was unfairly exceeding the target. Root cause: the "which teams to aggregate" definition was asymmetric between target side and actuals side. Fix lifts the shared "teams to include" list into its own file and points both sides at it. Tests added too.
PR 2	Feature build on top of existing stack	+1,742 -227 / 41 files	The PL dashboard v2 from the opening scene. Entities, repositories, API, UI -- all touched, but the web app itself (the stack) was already standing; this rides on top.

Different shapes, but both are non-engineer PRs that made it all the way to merge.

PR 1 -- a deep root-cause fix

This started from "the number looks wrong" on the business side, and the PR went all the way down to a data-integrity issue. The surface symptom: "the actuals number on the dashboard exceeds the monthly target, with the achievement reading 101% even though the team knows that's not real." The lazy fix would be a fudge factor or a clamp on the display. That's not what happened.

The author dug into the aggregation queries and pinned the real cause: the actuals side and the target side were reading from different tables, and the definition of "which teams count" wasn't symmetric between them. Teams that don't carry a target value (designers, PMOs, and so on) didn't show up on the target side but were getting counted on the actuals side, so the numerator was inflated against the denominator.

The fix is structural, not cosmetic. A single file defines "the teams in scope for this aggregation" as a shared list, and both sides reference it. No future drift between target-side and actuals-side definitions -- it's locked in by the shared constant.

The handling of "what data falls out of an aggregation" and "are the target and actuals sides really symmetric" is the kind of thing engineers miss too. A non-engineer working through it down to the structural level and fixing it there is what stands out about this PR.

PR 2 -- a big feature build on top of an existing stack

This is the PR the opening scene walked through. +1,742 / 41 files spanning entity, repository, API, and UI -- a scale of change that's well past what people usually mean when they say "modification."

What lets a non-engineer ship this size of change is that the web app itself (the stack) is already standing. Nobody's standing up a new app, no new Cloud Run service needed defining, no new dependency packages, no new directory structure. The change adds a route, a page, and a repository entry inside the existing structure that's already there. It rides on what's been built.

This is the "on top of an existing stack" range. That's where the boundary is, and the next section spells it out.

Note on terminology: "modification" in this article is broader than "small tweaks to existing logic." It includes adding new entities, new endpoints, and new pages on top of an existing stack. The line I'm drawing is between building on top of a stack vs. standing the stack up in the first place.

What works, what doesn't

The principle: standing up a stack is hard, building on top of one isn't

The cleanest dividing line for non-engineer development isn't "modification vs. new development." It's "on top of an existing stack" vs. "stand up a new stack."

Standing up a new stack (work that starts from infrastructure: a new web app from scratch, a new Cloud Run service defined from a Dockerfile, a brand-new BigQuery pipeline) → engineering work
Adding to an existing stack (a new page in an app that already exists, a new endpoint on an existing API, a new data source on an existing pipeline) → non-engineers can do this

All three of the example PRs above sit on the second side. The stack itself was already built (by me, for the most part), so they get to work inside it. "Stand up a new app from scratch" or "define infrastructure (Dockerfile / IaC) from zero" are still engineer territory.

Put another way: renovations and new rooms inside an existing house are open to anyone. Building the house itself is engineering. Get the structure wrong -- the load-bearing parts, the wiring, the plumbing -- and the cost of recovery is high. That's the part of stack design where there's still too much "if this is wrong, everything downstream breaks" risk to hand to AI.

What's left for engineers: laying the rails -- the stack and the harness itself

The flip side: the rail-laying work -- standing up a stack, and extending the harness itself -- is what non-engineers don't touch yet. Both require a different kind of knowledge:

Infrastructure: containers, IaC, the operational characteristics of cloud services. Cloud Run resource ceilings, cold starts, Pub/Sub at-least-once semantics, BigQuery partition / cluster design, how Pulumi stacks split. Get this wrong and a thing that compiles can still fall over in production
Authentication for external integrations: OAuth, webhooks, how you handle API keys and where they sit in Secret Manager. One small slip leaks credentials into the repo or lets a webhook fire something you didn't intend
Security fundamentals: what to never expose, where to sanitize, where the privilege boundary cuts. SQL injection, XSS, SSRF, broken authorization -- "it works" isn't enough here
Harness design and extension: adding a new Auto Review dimension, changing Self-Healing logic, writing a new lint rule (e.g. in eslint-plugin-graph), structuring guidelines. Decisions that require understanding how the whole flywheel hangs together -- the most meta layer

That last bullet -- harness extension -- has an important implication: for non-engineers to keep being able to ship to production, someone has to keep the harness evolving. Recurrence Prevention (Part 4) is the automatic loop that adds lint / CI guards / guidelines per trap. But the architecture of the harness itself -- the structure of dimensions, the calibration of judgment, the design of the Self-Healing flow, the shape of the knowledge graph -- those are a meta layer that still requires engineering judgment.

Concrete case: the current nine Auto Review dimensions ([Graph] / [Architecture] / [Security] / [Test] / [Doc] / [Impact] / [Observability] / [AI-Antipattern] / [Recurrence]) were designed by observing past incidents and fix patterns. When a tenth dimension becomes necessary -- say, a "breaking change check on dependency upgrades" axis -- decisions about responsibility splits with existing dimensions and where to set thresholds are made by looking at the whole structure. That's the kind of engineering work that stays.

The harness provides "rails you can't derail from." Laying those rails -- and laying the foundation those rails sit on -- is a different job, and it's still on engineering. Engineers lay the rails; anyone can run on them. That's the boundary today.

Why this works for non-engineers

This is a short recap, because everything that makes it work was already covered in Parts 1 through 4. Four mechanisms reinforcing each other -- that's what lets non-engineers operate safely on top of an existing stack.

① The knowledge graph pulls relevant code from "what you want to do"

cortex-product-graph from Part 2 -- the unified graph fusing code, docs, DB schema, and infrastructure into one knowledge base (implementation name: cpg) -- carries this layer.

Non-engineers don't need to know function names or repo structure. A natural-language question like "I want to add a metric column to the dashboard" goes to Claude Code, which hits the knowledge graph with a semantic search and gets back the relevant nodes -- the screen, the API, the DB, the docs -- in one or two hops. You can get started without knowing the technical vocabulary.

For PR 2: the author told Claude Code "I want PL dashboard v2 with division/team scoping for non-PI-Div PMOs and team leads," and the knowledge graph pulled the existing /projects route, project-repository.ts, FilterHeaders.tsx, and ProjectTable.tsx as the relevant nodes. The author never needed to know what file to edit. That's how the translation layer drops out.

② Auto Review enforces quality at the gate

The 9-dimension automated review from Part 3 is the next layer. [Graph] / [Architecture] / [Security] / [Test] / [Doc] / [Impact] / [Observability] / [AI-Antipattern] / [Recurrence] -- the AI returns REQUEST_CHANGES on what's missing and loops with the author bot until APPROVE -- the four-round example from the opening scene is exactly this in motion.

The point is this: the first PR doesn't have to be perfect. The author doesn't need to ship a completed, security-hole-free version on the first try. Push the initial PR and the rest gets sorted by the auto-review and the author bot bouncing off each other. The reason the author bot doesn't spin off into a loop of confused fixes is that the knowledge graph holds the full codebase context: changes are made with structural awareness of what they touch, so misreadings of the review feedback don't compound.

③ Self-Healing catches what slips through to production

Part 4 covered Self-Healing. If something does break in production, the AI starts from the alert, investigates root cause, opens a fix PR, and gets it auto-redeployed -- the entire loop runs without humans. Incidents triggered by a non-engineer's change recover on their own, hands-off. That's what makes the bar to opening a PR feel survivable.

This isn't "non-engineers are safe because nothing can go wrong." It's "even if something goes wrong, the harness has it covered." The system is designed to minimize damage, not eliminate failure. The three-layer construction (Observation → Repair → Strengthening) from Part 4 is what makes that net real.

④ Recurrence Prevention keeps the trap count from growing

The Recurrence Prevention loop from the back half of Part 4. Every trap that gets stepped on gets nailed down in the same PR, so the next attempt at the same pattern gets caught. The form depends: mechanizable traps become lint or CI guards; less-mechanizable ones become entries in the guideline docs (docs/gotchas, severity docs) that the AI reviewer reads. Either way, the catch happens before merge. Non-engineers contribute to this loop too -- when they hit a trap, the doc entry that prevents the next person from hitting it can come from them.

As this compounds, the rails get denser. Where there was once a loose "don't go that way" guideline, every incident adds another small rail saying "or this way, or this way, or this way," and the lane that's safe to walk gets clearer. The denser the rails, the safer non-engineers are in the lane.

→ The four pieces aren't independent components. Each one's output feeds the next one's input. This is the Guides + Sensors flywheel from Part 1 in action. I won't re-explain the details since they're in the prior posts, but non-engineers shipping to production is the result of all four wheels turning together. Take any one out and the level of upfront knowledge required to write to production jumps, and the whole thing collapses.

Next -- carrying the pattern to consumer-facing services

cortex is an internal AI platform, so the system as it stands can't be lifted into a toC production service as-is. The biggest issue is the difference in quality bar. For toC, "detect after user impact → Self-Healing fix" is too late. The requirement becomes: incidents don't happen, and when something is about to ship, there's review and testing on top of human sign-off.

That said, the shape of the harness -- a knowledge graph for context, 9-dimension AI review, an author bot responding to feedback -- carries over directly. The thing that changes is the final step: cortex's auto-merge becomes "AI does the prep, a human signs off." Not by giving up the AI's range, but by having the AI handle the heavy lifting (test writing, environment setup, test runs, the 9-dimension review) and leaving only the final APPROVE on a human. "If the human sign-off stays, engineer time doesn't really decrease, does it?" -- but historically engineers were spending the bulk of their time on the implementation, the test writing, the environment setup, the self-review, the back-and-forth on review. Sign-off itself is the smallest piece of that pie. With AI doing the prep work, what an engineer spends time on shifts from implementation labor to quality judgment.

A caveat on the knowledge graph: it only earns its keep at large codebase scale. If the codebase fits in one AI context window, a cross-repo graph is unnecessary. The reason cortex (100+ apps) and the toC side (40+ repos) need one is because the scale forces it.

The concrete plan is real (extending the knowledge graph across the toC side's 40+ repositories, designing the AI-prep flow, etc.), and the full version goes in a separate post.

Wrap-up

The person who knows the business requirements best, instead of writing them up for an engineer, runs them through to production directly. Quality is held by the harness, so what's required from the writer is domain knowledge and the ability to direct an AI well. Business asks stop queuing behind engineering, and the cycle speeds up
The four mechanisms from Parts 1-4 (knowledge graph / Auto Review / Self-Healing / Recurrence Prevention) form a reinforcing flywheel. The first PR doesn't have to be perfect, and what does break is repaired automatically. That's the design
The boundary: engineers lay the rails, anyone can run on them. Standing up the stack (infrastructure, authentication, security) and extending the harness itself (new lint rules, new review dimensions, Self-Healing flow design) stay on engineering
Carrying this to consumer-facing toC services, the knowledge graph (a 40+ repo cross-repo graph on the service side) covers the context layer, but the quality bar shifts, so auto-merge becomes "AI prep + human sign-off." Details in a separate post

In Part 6 I'll wrap the series with the philosophy at the foundation -- why this design, what got given up, what got kept. The series so far has been about "the parts that are working"; Part 6 puts the failures and the dead ends on the table too, including the gap between the philosophy and the actual implementation. A retrospective for myself, and -- I hope -- a reference for anyone heading down a similar road.

Fixed Before Anyone Notices, Stronger After Every Fix: Self-Healing + Recurrence Prevention

Ryosuke Tsuji — Mon, 01 Jun 2026 23:57:25 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.

In Part 3 I covered AI reviewing AI PRs -- the auto-review pipeline that defends quality at the PR stage.

This post is the other side: defending quality in production, via Self-Healing. A production alert fires, an AI investigates it, opens a fix PR, the PR goes through the same auto-review pipeline from Part 3, gets auto-merged and auto-redeployed. And the same fix PR is required to add a new Guide -- whether that's a lint rule, CI guard, type constraint, or guideline update -- so the same anti-pattern gets auto-rejected from then on. The guardrails grow every time.

"Incidents get fixed automatically" is catchy on its own, but on its own it's probably not enough in the long run. You have to close the recurrence class while you fix the incident -- self-healing plus self-strengthening -- before the quality gates start to compound over time.

Start with last month's numbers

115 Self-Healing PRs merged in the past 30 days.

Effectively all of them merged and deployed without human involvement.

Humans only step in when the AI judges "this is not something code can fix."

That's the current state of "incident response" at cortex.

Don't read "115 = 115 user-impacting incidents" though. Roughly:

About half (54) are Deploy Failed-style alerts -- CI / Pulumi deploy step caught a failure, the AI absorbed it before it shipped to production. Recently the [Recurrence] loop (covered later) has been piling up countermeasures here, so this bucket is trending down anecdotally
The remaining 61 are production-runtime alerts (Service Error Log Detected / Pipeline Failure / Generator Failure etc.) -- the service is running in production, but an error-log threshold or consecutive-failure threshold tripped. The AI absorbed them before they propagated to user impact

So it's less "incident response" than "production anomalies that monitoring caught, fixed 115 times by AI before anyone woke up." The number of incidents humans actually have to acknowledge is in the low single digits per month.

There's also a clear pattern of the same service firing repeatedly (one ETL-ish service alone accounts for 25 of the 61) -- which is exactly what the [Recurrence] loop covered later is supposed to eliminate by turning into lint or type gates. That's the back half of this post.

One more honest note: the recent month's number is slightly inflated. The codebase had a fair number of "silent catch" patterns -- catch blocks that swallow exceptions without logging anything. We added the no-silent-catch lint rule and swept the existing silent catches in batches, which exposed previously hidden production errors as alerts. So part of the spike is "monitoring caught up to reality." Once the [Recurrence] loop converts these into lint over time, the number should converge. "Things we couldn't see, we can see now" is a quality improvement -- what we're seeing is the catch-up phase.

One more thing worth saying: doing this by hand is utterly unsustainable. Running 115 manual cycles of "ack alert -> read logs -> context switch -> understand the code -> fix -> open PR -> review -> deploy" would bankrupt any team's engineering bandwidth. The system absorbs them without anyone noticing, and converts the fix into a new Guide (lint / CI guard / type constraint / guideline) at the same time -- that's the actual subject of this post.

The moment an alert fires, the AI starts an investigation, traces Loki / Product Graph / git blame to root cause, opens a fix PR, runs it through the auto-review from Part 3, APPROVE -> auto-merge -> auto-redeploy. One full loop.

Series

#	Theme	Key scene	Article
1	Series intro: cortex harness	PRs merging unattended / incidents fixed before anyone notices	ai-harness-intro
2	Product Graph (cpg)	Code / docs / DB / infra unified into one graph	cortex-product-graph
3	Auto PR review	webhook -> AI review -> auto-fix -> squash merge	cortex-auto-review
4	Self-Healing + observability + auto-added guardrails	Alert -> AI investigates -> fix PR + new lint/type gate -> auto redeploy + same pattern auto-rejected from then on	This article ← you are here
5	Democratizing the maintenance phase	Domain experts open PRs to production; the harness owns the quality gate	cortex-non-engineer-prs
6	Series Final	The underlying philosophy plus a retrospective on the failures and lessons	cortex-philosophy

Big picture -- the three layers: Observation, Repair, Strengthening

For Self-Healing to work, you need an Observation layer in front and a Strengthening layer (recurrence prevention) behind it. Self-Healing itself is the middle Repair layer. The "self-healing + self-strengthening" loop only spins up when all three are in place.

Prerequisites: The three layers only stand up on top of two prior pieces: cpg (the unified code / docs / DB / infra knowledge graph from Part 2) and the Observability stack covered in this post.

No Observability -> the observation layer is empty, nothing gets detected -> the repair layer never even fires

No cpg -> the AI cannot see "where else does this trap exist" -> the repair layer does symptom-level patching at best, and the strengthening layer's horizontal expansion stops working

Put differently: trying to copy this setup without those two will just multiply incidents. An AI that blindly looks at error logs and rewrites production code is just speeding up the rate at which gh pr create ships accidents. cpg and Observability are the minimum bar for being able to delegate auto-repair to AI.

Note also that cortex is a several-hundred-thousand-line codebase, and at that scale loading the whole codebase as AI context is impossible for the AI as well (let alone for a human). Tell the AI to trace impact with just grep and file reads, and it'll run out of context window before it finds anything. cpg is what lets it ask "which other code does this function's change ripple into" and get the answer in one hop. Small repos may not need this. Past a certain scale, cpg is not optional, it's required.

In Fowler's Guides / Sensors terms from Part 1, cpg and Observability are the substrate that supports both Guides (pre-execution controls like lint) and Sensors (post-execution gates like auto-review and Self-Healing). Observability feeds Sensors via firing alerts; cpg feeds the Guides side by supplying the auto-review with impact-scoping context. Neither belongs on one side only -- they're foundational to both, and Self-Healing and auto-review only function on top of this substrate. That's the structural claim this post is built around.

Repair -> Strengthening loop">

Layer	Role	Key components
Observation	Real-time detection of production anomalies	OTel SDK / Loki / Mimir / Tempo / Faro / Grafana / Pino logs with trace_id
Repair	AI receives the alert, investigates root cause, opens a fix PR, auto-review, auto-merge, auto-redeploy	Event Relay -> SSE -> `self-healing` mode script -> claude -p (worktree) -> gh pr create
Strengthening	The fix PR is required to add a new Guide (lint / CI guard / type constraint / guideline). The same anti-pattern can't reach production again	`@cortex/eslint-plugin-graph` (26 rules), `scripts/check-*.ts` (13 guards), `recurrence-prevention.md`, the `[Recurrence]` lens of auto-review

I'll walk through them in order.

Observation -- where do the alerts come from?

cortex's production observability is built on Grafana Cloud + OpenTelemetry:

OTel SDK (the shared @cortex/otel package) -- every service calls initOtel({ serviceName }) at its entry point. Trace / metric / log all go out via OTLP to Grafana Cloud
Loki (logs) -- Pino structured logs get trace_id automatically. trace and log are cross-referenced
Mimir (metrics) -- Cloud Run / pipeline / Gemini API token usage, etc.
Tempo (traces) -- distributed tracing
Faro (frontend) -- captures browser JS errors / performance / network failures
Grafana -- dashboards + Alert Rules + Notification Policy

We also have a strict definition of log levels, anchored on business impact:

Level	Definition	Examples
`warn`	Business-foreseeable, does not need immediate action (retryable / self-recovers).	Search query returned 0 results, optional field unset, short retry due to rate limit
`error`	Data recovery / re-run will definitely be needed afterward. Impact expected to be under 20%.	"User record that should exist isn't there," BigQuery insert failure, per-record enrichment failure
`fatal`	The feature as a whole fails for 20%+ of requests. Service-continuity broken, fatal config missing, full upstream outage.	OTel init failure, required secret missing at startup, full input data source outage for a pipeline

The key point is to not pick the level mechanically based on the exception class name like NotFoundError. Same "record not found" situation: "this record must exist and doesn't" is error / fatal; "user search returned 0 hits" is warn. The level is decided by business impact -- "does this require data recovery later," "is the whole feature down" -- not by the type. Without this discipline you simultaneously get monitoring fatigue and missed critical incidents. Self-Healing reacts mainly to error-threshold trips; fatal is the human-escalation side.

Alert Rules are managed declaratively in Pulumi, grouped by service into categories like BOT / Pipeline / Transformer / Generator / Gemini / CI / Deploy / Service Catch-All. When we add a new service, one line in infra code spins up the dashboards and alerts automatically.

This is "the infrastructure that lets the AI see the same things humans see." Self-Healing picks up alerts coming off this stack.

What Observability can't catch, Self-Healing can't fix either

Honest disclaimer: Self-Healing can only react to what the observation layer can detect as an anomaly. "Observability is everything" is literally true here.

What the current stack catches is roughly logic-level errors -- exceptions, error logs, deploy failures, external-API call failures, threshold-based metric anomalies.

What it doesn't catch:

UI errors -- the logic ran, no error logs, but the screen shows something different from intent / shows the wrong value. Faro catches client-side JS exceptions and network failures, but "the logic ran and the output is just wrong" never fires an alert
Silent data corruption -- aggregated values slowly drift, bad values get into a table. Unless it crosses a threshold or schema check, nothing detects it
Perceived UX degradation -- requests feel slow, the UX feels off. Only catchable once SLO / latency thresholds trip

So Self-Healing is "AI replacing the human in the loop for incidents the observation layer can catch." The coverage of the observation layer itself is the prerequisite. Holes in observation stay as blind spots that neither auto-review nor Self-Healing reaches.

This isn't really a limitation of Self-Healing -- it's the importance of growing the observation stack, which cortex keeps investing in continuously. (From Part 1, Observability is one of the "supporting foundations" beneath the flywheel.)

Repair -- the Self-Healing flow

MODE=self-healing runs the same webhook-server script as the auto-review setup from Part 3, but listening for Grafana firing alerts.

The textual flow looks like:

[Grafana Alert Rule firing]
   ↓ POST /webhook/grafana
[Event Relay (in-house)] -- persisted in Firestore
   ↓ SSE push (event: grafana-alert)
[self-healing mode script]
   ↓ throttle check (same fingerprint skipped for 4h)
   ↓ 👀 reaction in Slack to signal "I'm on it"
   ↓ git worktree add -b hotfix/auto-alert-{service}-{ts} origin/main
   ↓ run claude -p inside the worktree
     - search related code via Product Graph MCP
     - pull error logs from Loki via Grafana MCP
     - identify root cause and fix
     - update tests as needed
     - conventional commit
   ↓ git push + gh pr create
[fix PR]
   ↓ auto-review (the Part 3 pipeline)
   ↓ APPROVE -> auto-merge -> auto-redeploy
[recovered]
   ↓ ✅ in the Slack thread

What happens when the AI judges "this is not fixable in code"

Not every alert is fixable by code. The implementation has a rule: "if you judge it unfixable, exit without changing anything." In that case Slack gets a notification of the form "This alert cannot be addressed in code. Investigation: ..." -- including what the AI investigated.

Worth clarifying on the numbers side: the headline 115 is "Self-Healing runs that reached PR-created -> merged -> deployed." This "unfixable, exit clean" case is a separate bucket, happening several times a month (external transient outages, infra / config issues that aren't code, cases too complex for the AI to judge confidently). The "humans show up" bucket is this separate one -- it isn't a "some of the 115 failed" failure rate.

Here's what a real Slack message looks like (a case where a transformer service's GitHub PAT had expired):

ℹ️ This alert cannot be addressed in code. Investigation:

Investigation summary

Checked the error logs in Loki for the past hour. The single error source for the target service right now is expiration / revocation of the GitHub PAT (Personal Access Token).

What I found

The BQ snapshot pipeline is healthy: 7 consecutive [BQ] Snapshot rows inserted successes. The earlier autodetect type-drift problem was resolved in an earlier PR that put SAFE_CAST(... AS INT64) in place.

The single ongoing error (multiple occurrences in the past hour): GithubUnauthorizedError: GitHub API /commits/master unauthorized (401): PAT likely expired/revoked -- the /check-drift handler returns 401 every time Cloud Scheduler invokes it.

Root cause

The GitHub PAT held in Secret Manager for this service has expired or been revoked. The code already classifies the 401 with a dedicated GithubUnauthorizedError and is already sending the "PAT needs rotation" notification to Slack. The detection / notification design on the code side is working as intended.

Resolution: exit without changes

This is an external-credential operational issue (PAT expiry), and what needs to happen is a human rotates the PAT in Secret Manager. Not the kind of incident code can resolve. Therefore: no code change, no commit, no PR.

Operator actions:

Add a new version of the target secret in GCP Console / Secret Manager with a valid PAT (repo scope, read access to the target repository)

No Cloud Run revision redeploy needed (secretKeyRef version:latest is referenced)

At this level of detail, "what was investigated, why code can't fix it, what the human should do" all come out in one Slack message. Open the thread and the operator can act immediately. The productivity gap vs. "alerts just forwarded blindly" is significant.

Deduplication

A throttle ensures the same fingerprint (Grafana's unique alert identifier) is not re-processed for 4 hours. Without this, alerts that fire again before the fix PR has merged would spawn another worktree, another fix PR, and so on -- an easy infinite loop.

We also permanently skip any alertname containing credential. Credential incidents carry leakage risk if the AI touches them, so they're explicitly escalated to humans.

Self-Healing and Part 3 auto-review -- "the fixer AI" and "the reviewer AI" are independent

This is the most consequential design choice of the agent setup, so calling it out explicitly.

PRs opened by Self-Healing are not special PRs, just fix PRs. They go through the Part 3 auto-review pipeline under exactly the same conditions -- the 9 lenses (Graph / Architecture / Security / Test / Doc / Impact / Observability / AI-Antipattern / Recurrence) get checked in order. Critical / Major findings -> REQUEST_CHANGES; Nit-only / no findings + CI green -> APPROVE -> auto-merge.

The important bit: this is not a monolithic "AI fixing AI" loop. The fixer-side AI and the reviewer-side AI are fully independent:

Different process, different session: the self-healing-mode AI and the reviewer-mode AI are launched as separate claude -p processes. They do not share context
Different input sources: the fixer builds the problem from Grafana alert + Loki + cpg. The reviewer judges from the PR diff + cpg + review guidelines
Different objectives: the fixer is optimizing for "stop the incident." The reviewer is judging "does this violate the 9 lenses or the severity contract?" A deliberate separation of concerns where the two roles' incentives are intentionally misaligned

As a result, PRs the fixer dashed off get blocked by the reviewer (REQUEST_CHANGES -> back to the fixer). The AI does not approve its own output. "Just-make-it-work" fixes don't get through.

This is the often-debated review-independence problem in LLM-agent operation, solved here in the obvious way: split the work across separate agents.

A concrete example: meet subscription's 409 ALREADY_EXISTS

Take the alert from the Google Meet recording auto-fetch service I covered in the Meeting Intelligence post. On 2026-05-21, Self-Healing opened a fix PR titled fix(meet-xxx): auto-fix for Service Error Log Detected.

The trigger error from Loki:

Workspace Events API request failed: 409 Conflict
"Subscription associated with the resource already exists."

How the AI investigated:

Pinned the error in Loki -- ran {service_name="meet-xxx"} | json | level=~"ERROR|error|Error" via Grafana MCP, picked up the Failed to renew Meet subscription stack trace
Traced the call path in Product Graph -- identified renewSubscriptions -> createMeetSubscription
Cross-referenced past PRs -- the "opposite-direction inconsistency" (name in Firestore but missing from Google = 404) had already been self-healed in another PR with patchMeetSubscriptionTtl -> null fallback. The current direction (still on Google's side but missing from Firestore = 409) was the gap
Verdict: "the same pattern may exist elsewhere" -- a [Recurrence] decision matrix "horizontal expansion required" case

Instead of a quick patch, it implemented the same-direction self-healing symmetrically to the opposite-direction fallback that was already there:

Made createMeetSubscription idempotent
If POST returns 409, extract the existing Subscription name from the response and call patchMeetSubscriptionTtl
The caller writes the return value back into Firestore, so the next renewal converges to the normal PATCH path (self-healing)
Per the existing graph/no-silent-catch lint, JSON.parse failures are also logger.warn + serializeError for structured logging
Three tests added

This is what "Self-Healing pushing all the way to root cause and rolling the fix out horizontally" looks like in practice. "Close the recurrence class, don't just suppress the symptom" (the spirit of recurrence-prevention.md) executed autonomously by the AI.

Strengthening -- Guides (lint + guidelines) grow automatically

This is the layer that keeps Self-Healing from being just "auto-repair."

In Fowler's Guides / Sensors terms from Part 1, the Strengthening layer is the place where Guides grow -- i.e. the pre-execution controls that prevent AI from deviating in the first place. cortex's Guides come in two flavors:

Machine-read Guides: lint / type / CI guard / coverage thresholds / Prettier -- enforced at commit / CI time
Human-and-AI-read Guides: guidelines like recurrence-prevention.md, severity.md, ai-antipattern.md, etc. -- used as decision criteria by auto-review

The 9 lenses, severity contract, and no-downgrade rules from Part 3 are the latter; the auto-added lints in Part 4 are the former. Together they form the Guides surface. Lints are "formalized guidelines," guidelines are "lints that haven't been formalized yet."

The Sensors side -- Self-Healing and auto-review -- grow these Guides every time they run:

Self-Healing's root-cause investigation finds "the same pattern exists elsewhere" -> demands horizontal expansion + a new lint (= new Guide)
Auto-review's [Recurrence] lens blocks PRs that fix without adding lint
Both depend on cpg to see impact scope across the codebase

cpg is what lets the AI ask "where else does this trap exist." Self-Healing and auto-review (= the Sensors side) share cpg as a substrate, and each run thickens Guides by one notch.

What happens every time Self-Healing runs (the recurrence-prevention-first flow)

Every fix PR Self-Healing opens is checked for [Recurrence] by auto-review. The decision matrix:

Situation	Required action	Form
Same trap stepped on 2+ times	Lint required (custom ESLint rule / type constraint / CI guard)	Machine (new Guide)
Pattern may exist elsewhere	Horizontal expansion required (cpg traversal for similar nodes, fix all of them in this PR)	Investigation + fix
Cannot be machine-checked but worth formalizing	Add to an existing guideline	Guideline entry
One-off, no value in formalization	Nothing (bug fix only)	--

When the "stepped on 2+ times" situation applies, the fix PR can't merge without a new lint included. So every Self-Healing run produces:

Horizontal expansion via cpg -- not just the immediate fix target, every similar node enumerated
A new Guide added in the same PR -- ESLint custom rule / type constraint / CI guard / guideline entry, one of the four
All existing violations cleared in the same PR -- no warn-as-deferral, error on first introduction
Auto-review -> auto-merge -> auto-redeploy -- the regular Part 3 pipeline
Going forward, writing the same pattern gets mechanically rejected by CI / lint -- the recurrence class is structurally closed

"Add the guard while you fix the bug" runs as a self-sustaining loop driven by Self-Healing.

"We'll do it later" and "introduce as `warn`" are banned

A couple of important contract clauses from the guidelines:

"Plan to lint later," "lint when we refactor," "another PR will handle this" -- all banned. If it can be addressed in this PR, it must be
"Existing violations remain, so introduce as warn and promote to error later" -- not accepted. This is deferral in disguise. The responsibility for the warn->error promotion goes nowhere and the rule rots
If you add a lint rule, fix all existing violations in the same PR and ship at error

These extend the no-downgrade rules from Part 3 -- preempting the typical escape hatches.

The "step on it, mechanize it" lineage

Custom Guides currently piled up in cortex:

graph/no-silent-catch (ESLint) -- the source of the "inflated number" mentioned in the intro. Bans catch blocks that swallow exceptions
Stacktrace-preservation guideline (codified as a Major violation in observability.md, caught by auto-review) -- forbids logger.error(err.message) style logs that drop the stack and keep only the message string. Forces the err field to hold serializeError(error) so name / message / stack are preserved as structured fields. Observability is everything here, so logs that drop stack info are treated as inherently broken
cortex-quality/require-fetch-timeout (oxlint -- a Rust-implemented JS/TS lint that runs ESLint-compatible rule sets, dozens of times faster than ESLint due to the Rust impl. cortex uses oxlint for the standard ruleset and ESLint for custom rules that need AST-level work) -- mandates signal: AbortSignal.timeout(...) on external fetch calls. Born from a case where a no-timeout fetch hung indefinitely and triggered a Cloud Tasks redelivery storm
graph/no-bq-string-timestamp-param (ESLint) -- from a case where passing TIMESTAMP as a string to a BigQuery query parameter NULLed the value out through a serializer bug and silently failed every INSERT
graph/require-firestore-ignore-undefined (ESLint) -- forces ignoreUndefinedProperties: true on new Firestore(). From a case where a single NULL row caused a 100% failure rate in a sync batch
check-otel-env-injection (CI guard) -- the recurrence prevention for the Cloud Run OTel env injection case below
TypeScript type tightening (type level) -- tighter function signatures, branded types for ID disambiguation, exhaustive discriminated unions, etc. Patterns that can't be lint-caught but are catchable at the type level get closed from the type side

These aren't textbook-learnable rules -- they're "stepped on once, then mechanized." The number of traps the organization has stepped on translates directly into the number of Guides piled up (across ESLint / oxlint / CI guard / types).

How does the AI write a lint rule without breaking it?

Three structural things keep this sane:

Existing rules are the template: the custom-rule directory already holds 26 custom rules, each as .ts + .test.ts pairs. New rules follow the same shape, so the AI never has to write the AST-walking boilerplate from scratch
Tests first: violation / pass fixtures go into .test.ts first, implementation fills in TDD-style. Coverage threshold (90% statements + branches) is gated by the Part 3 auto-review, so a lint without tests cannot merge
lint / type / CI guard sit in the same "mechanize" bucket: the decision matrix in recurrence-prevention.md groups lint / type constraint / CI guard together as the "lint-required" row, and leaves the choice within that bucket (write it as a lint? express it at the type level? add a separate CI guard?) to the AI based on how much AST work is involved and whether runtime semantics matter. Traps that need AST inspection but actually hinge on runtime behavior usually end up as a type constraint (branded type / discriminated union / signature tightening) rather than a custom lint

So "AI writes a lint rule" is supported by existing rule corpus + the test harness + the mechanize-bucket selection criteria -- three together. The path where the AI hand-rolls raw ESLint API and bricks something is structurally closed.

A concrete example: Cloud Run OTel env injection -> promoted to CI guard

Multiple services hit this trap: when a Cloud Run Service / Job is defined in Pulumi, forgetting to inject OTEL_EXPORTER_OTLP_ENDPOINT and GRAFANA_CLOUD_API_KEY via secretKeyRef causes OTel init to be skipped in production, no trace/log reaches Grafana, and incidents become silently invisible.

The normal response would be "we'll be more careful next time." At cortex:

Incident surfaces -> Self-Healing opens a fix PR (adds the env injection to the affected service)
Auto-review's [Recurrence] decides "same trap stepped on -> lint required"
The same PR adds scripts/check-otel-env-injection.ts (CI guard) -- mechanically asserts OTel env injection across all Cloud Run resource definitions under infra/
All other existing services get their env injection added in the same PR
Merge -> deploy -> any future write of the same kind gets rejected by CI

That's what "the guardrails grow every time Self-Healing runs" looks like in practice. The trap is "stepped on -> mechanically checked from then on."

Where Guides stand right now (in numbers)

Snapshot of cortex's Guide inventory:

Category	Count	Notes
Custom ESLint rules (`@cortex/eslint-plugin-graph`)	26	`no-silent-catch` / `require-firestore-ignore-undefined` / `no-bq-string-timestamp-param` etc.
CI guards (`scripts/check-*.ts`)	13	`check-otel-env-injection` / `check-cloudscheduler-oidctoken-audience` etc.
Standard oxlint rules (set to `error`)	183	Base config ships everything at error
TypeScript strict gates (baseline)	9	`strict` / `noImplicitAny` / `strictNullChecks` / `noUncheckedIndexedAccess` etc.
TypeScript type tightening (per-recurrence)	grows over time	branded type / discriminated union / function-signature tightening etc. Patterns that can't be lint-caught but can be type-caught are closed from the type side
Test coverage thresholds	statements + branches 90%	Uniform across all packages
Prettier	1 config	Format auto-fix
Guidelines	the entire review-guidelines repo	Used as the decision basis by auto-review

The first two categories plus the type-tightening row -- Custom ESLint, CI guard, type tightening -- are the part that compounds over time through the [Recurrence] lens every time Self-Healing or auto-review runs. The guardrails grow with time. That's the substance of the Strengthening layer.

The whole loop, from the top

When you compose the three layers:

[production anomaly] -> Observation layer (OTel/Loki/Grafana) -> Alert firing
                                              ↓
                                       Event Relay -> SSE
                                              ↓
[Self-Healing mode script]
   - claude -p in worktree
   - root cause via cpg + Loki + git blame
   - commit fix
   - (if applicable) add new lint / type gate too
   - gh pr create
                                              ↓
[Auto-review (Part 3)] -- 9 lenses in order, especially [Recurrence] forces
                         recurrence-prevention action (lint / horizontal expansion / guideline entry)
                                              ↓
                          APPROVE + CI green
                                              ↓
[auto-merge -> Turborepo build -> Pulumi parallel deploy]
                                              ↓
[production recovered + same anti-pattern mechanically rejected from now on]

The loop completes without human intervention. Not just repair, but the quality gates that grow with every repair -- that's the "auto-recovery + auto-strengthening" substance at cortex.

That said, as the front of the article spelled out, the loop is only viable because cpg and Observability exist. cpg makes horizontal expansion possible; Observability turns production anomalies into structured data. With those two in place at the foundation, AI can stand on the side that does Repair and Strengthening. Self-Healing is not a standalone mechanism. It's a Sensor riding on top of cortex's Guides (cpg + Observability + lint + guidelines). That's the single most important framing in this post.

Self-Healing by the numbers

Breaking the headline down further.

Main firing categories

What kicked off Self-Healing in the past 30 days (with the mapping back to the front-of-post 2 buckets):

Category	Bucket
Service Error Log Detected (most frequent)	Production-runtime (61 side)
Pipeline Failure -- data pipeline failing a configured number of times in a row	Production-runtime (61 side)
Generator Failure -- AI generation jobs (embedding / annotation etc.) failing	Production-runtime (61 side)
Deploy Failed -- deploy step failures (Pulumi up / Cloud Run revision failed)	Deploy step (54 side)

Alert-firing to production-recovery time

Median 30 minutes to 1 hour. Roughly:

Alert firing -> AI investigation start: under 1 minute (Event Relay + SSE)
AI investigation + fix + PR open: 3-8 minutes
Auto-review (including the Part 3 10.8 review-fix iterations on average): 20-45 minutes
Auto-merge + deploy: 3-10 minutes

Many of these finish before anyone wakes up (alert fires early morning -> by the time people come in, there's just a ✅ in Slack).

What changed / Bridge to Part 5

We've now covered the cortex picture across Parts 1-4:

Part 1: the cortex big picture and harness-engineering framing
Part 2: Product Graph (cpg) -- the AI's "brain"
Part 3: auto-review -- defending quality at the PR stage
Part 4 (this post): Self-Healing + Observability + auto-added guardrails -- defending quality in production while growing the quality gates themselves

The engineering role has shifted, over the last half-year, from "write, review, fix, merge, deploy, incident-respond" -- all of that -- toward looking at the whole system from above and tuning it. human-on-the-loop, working at the Policy layer.

Part 5 covers the harness reaching the "who writes the code" layer. The center of it is domain experts (business-side managers, PMOs — non-engineers) opening PRs to production, with a concrete walk-through of a +1,742 line / 41 file feature PR that landed with zero human reviewers in the loop. What guarantees the quality is the harness stack built across this series — "whoever writes, the harness owns the quality gate" is the Part 5 framing.

The toC service expansion gets a brief mention at the end for direction, but the full implementation discussion lives in a separate post.

The actual series wrap-up is Part 6. The center of it is the underlying philosophy -- why I picked this design, what I gave up, what I kept. Alongside that, since the series so far has been mostly "what's working," I want to look back at the failures and dead ends behind that surface, and the gap between the philosophy and the implementation. A retrospective for myself, and -- hopefully -- a reference for anyone starting down a similar path.

Human-on-the-Loop: AI Reviewing AI PRs at cortex -- 769 PRs/month while raising the quality bar

Ryosuke Tsuji — Tue, 26 May 2026 14:35:43 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.

In Part 1 (intro) I covered the high level -- AI driving both PR reviews and incident response on top of cortex. In Part 2 (Product Graph) I went deep on cpg, the unified knowledge graph that fuses code, docs, DB schemas and infra into a single business-aware index.

This post is about the automated PR review pipeline -- AI reviews the PR, a separate AI applies the fixes, and the system merges automatically once policy gates pass. The usual critiques of AI-assisted development ("the reviewer becomes the bottleneck" and "AI code drops the quality bar") don't really apply here. The rest of this post unpacks why.

Series

#	Theme	Key scene	Article
1	Series intro: cortex harness	PRs merging unattended / incidents fixed before anyone notices	ai-harness-intro
2	Product Graph (cpg)	Code / docs / DB / infra unified into one graph	cortex-product-graph
3	Auto PR review	webhook -> AI review -> auto-fix -> squash merge	This article ← you are here
4	Self-Healing + observability + auto-added guardrails	Alert -> AI investigates -> fix PR + new lint/type gate -> auto redeploy + same-pattern writes get auto-rejected	cortex-self-healing
5	Democratizing the maintenance phase	Domain experts open PRs to production; the harness owns the quality gate	cortex-non-engineer-prs
6	Series Final	The underlying philosophy plus a retrospective on the failures and lessons	cortex-philosophy

Start with last month's numbers

769 PRs merged.

Median time to merge: 31 minutes.

Human review involvement per PR: near-zero.

That's a typical 30 days on cortex (Apr 21 -- May 21).

Every one of those 769 PRs had an AI reviewer as the first reviewer, with an average of 10.8 review-fix loop iterations per PR (max 56). 1 in 5 merged within 10 minutes, roughly half within 30 minutes. What humans do now is look at review outcomes and tune the review prompt and the guidelines themselves -- this is human-on-the-loop, not human-in-the-loop. Humans operate on the policy layer, not the execution layer.

Past 30 days
PRs merged	769
AI reviewer coverage	100%
Avg review iterations / PR	10.8
Max review iterations	56
Per-PR human review	~0%
Median time-to-merge	31 min
Merged within 10 min	20%
Merged within 30 min	49%

This is a typical month on cortex now.

The common refrain -- "AI speeds up writing but reviews still bottleneck" and "AI-written code lowers quality" -- is something cortex absorbs through a pipeline where neither failure mode can take hold. Let me break it down.

How the review bottleneck stops forming

The conventional wisdom: the reviewer becomes the bottleneck

As AI writes faster, the load on whoever reviews the output grows proportionally. Anthropic's internal blog (How Anthropic teams use Claude Code) reports the same pattern -- the bottleneck has shifted from writing to reviewing, and senior engineers' work has moved from writing code toward integrating and reviewing AI output.

cortex hit exactly this. The moment we ran Claude Code at full throttle, writing speed jumped by an order of magnitude or more. Meanwhile the human time available to read and approve PRs only grew linearly. If the reviewer (=me) took a day off, the whole org stalled -- a classic single point of failure.

cortex's answer: move the reviewer role to AI as well

Part 1 and Part 2 kept asking the same recurring question: "how far do you push the harness?" cortex went all-in: the AI writes the code, the AI reviews the code. What humans keep their hands on is "tuning the prompts and guidelines themselves" -- not making decisions inside each individual PR, but watching the system from above and adjusting.

Three conditions had to hold for this to work:

The AI reviewer has enough context

A generic AI reviewer only sees the PR diff. The diff alone hides business meaning, upstream/downstream dependencies, and prior incident history. cortex feeds the Product Graph (cpg) from Part 2 -- a knowledge graph that fuses code, docs, DB schemas, and infra into one structure, with each node carrying business role and upstream/downstream dependencies -- into the AI reviewer, so it can trace impact into code that the PR didn't even touch. It catches:

- Missed upstream/downstream fixes
- Missed doc updates
- Tests that should have been updated but weren't

Diff-only AI review can never reach this territory.

Reviews are not improvisational

If reviews shift day to day, the team gets confused, and the AI can't be told what "correct" looks like. We enforce this by passing an explicit review-guideline document as the mandatory citation source for every review (we open-sourced a snapshot, see below).
False positives don't blanket-block merges

Treating every false positive as Critical breaks the workflow. We control this with a severity hierarchy (Critical / Major / Minor / Nit) plus strict no-downgrade rules.

So: the cpg from Part 2 solves "what context the AI sees," the review guidelines solve "what the AI should do" as Guides (pre-execution control), and the severity ladder + no-downgrade rules solve "what the AI must not do" as Sensors (post-execution control). This maps cleanly onto Martin Fowler's Guides / Sensors taxonomy (introduced back in Part 1).

One more upstream layer: before any of those three kicks in, a 500-lines-per-file lint keeps every file in any PR small enough to fit in a single AI session. That alone keeps AI review from breaking down, and unlike a human reviewer, the AI doesn't lose focus. There are plenty of other lints in front of the AI reviewer too, but the full picture belongs to Part 4 (Self-Healing + observability + auto-added guardrails).

How the auto-review system is wired

The implementation is a script running on each developer's machine. GitHub webhooks land on an in-house Event Relay server, get persisted to Firestore, and each developer's machine subscribes as an SSE client. On reconnect, Last-Event-ID replays anything missed -- zero event loss, single webhook registration. Reviewer-mode machines stay always-on, so any incoming review fires immediately. Author mode runs in the background on the PR author's own machine, alongside their normal dev work.

How we ended up with Event Relay

The current setup wasn't the original design.

First: GitHub webhook → smee.io → each machine
Then: GitHub webhook → Cloudflare Tunnel → each machine
Now: GitHub webhook → in-house Event Relay with Firestore persistence → SSE to each machine

Both smee.io and Cloudflare Tunnel ran into connection drops and missed deliveries, which caused real misses for us. Switching to the in-house Event Relay brought event loss to zero (Firestore persistence + Last-Event-ID replay), and the relay turned into a general-purpose layer we could reuse.

The webhook ingestion for Self-Healing (covered in Part 4) actually goes through the exact same Event Relay. GitHub, Grafana, and other webhook sources get consolidated through one relay, and each machine's SSE client subscribes to whichever events it cares about. Having a single general-purpose webhook relay is a piece of infra that keeps paying off in unexpected ways -- worth investing in early.

When the reviewer's machine receives an event, the script spawns claude -p and walks through 9 dimensions (Graph / Architecture / Security / Test / Doc / Impact / Observability / AI-Antipattern / Recurrence) sequentially, then reads the verdict marker the AI emitted at the end and posts APPROVE or REQUEST_CHANGES via gh pr review.

A few notes:

Modes split the role -- the same script started with --mode reviewer becomes the reviewer process; with --mode author it becomes the PR-author response process. The machine of whoever is assigned as reviewer runs reviewer mode; the machine of whoever opened the PR runs author mode. Event Relay multicasts the events, and each machine reacts in a distributed way.
Per-PR worktree isolation -- author mode merges origin/main into a fresh worktree before spawning the AI. Multiple PRs can be handled in parallel without file state contaminating across them.
9 dimensions checked sequentially in one session -- not parallel sub-agents. A single claude -p session walks the 9 dimensions while keeping context shared, which also catches cross-dimension contradictions.
Review guidelines: public snapshot -- air-closet/cortex-review-guidelines (JP/EN). The live guidelines are inside cortex (private repo) and evolve daily; the public repo is a snapshot extracted for reference.

:::message alert
Guidelines alone scale only to projects in the tens-of-thousands-of-lines range. At cortex's scale (over 1M lines of code), the knowledge graph from Part 2 (cpg) is a hard prerequisite. Porting the guidelines without cpg won't reproduce the same review quality -- the AI reviewer simply can't navigate the codebase fast enough to reason about impact.
:::

Why sequential single-session review, not parallel sub-agents

We initially tried splitting the 9 dimensions across parallel sub-agents. Three problems emerged: cpg / guidelines / PR diff got injected 9 times (token cost balloons), cross-dimension findings couldn't reference each other (a [Test] issue rooted in a [Graph] violation gets dropped in isolation), and aggregating 9 outputs into a single verdict required its own machinery.

A single sequential session fixes all three: one cpg/guideline load, earlier findings stay in context for later dimensions (cross-dimension consistency comes for free), and one verdict marker at the end is the entire aggregation step.

We also swap CLAUDE.md to a review-specific version at startup. The default CLAUDE.md is dense with development-time context (Product Graph ops, prod-data safety, MCP ordering) -- noise for a reviewer. The review-specific version centers on severity, no-downgrade, and the verdict marker spec, keeping AI attention on the review task.

Cutting wasted context lifts judgment precision and token cost at the same time.

Operational knobs

A few filters and toggles we apply in actual use:

Draft (WIP) PRs are excluded. GitHub Draft state is received but skipped; review starts firing once the author flips it to Ready for Review.
Specific PRs can be targeted manually. The webhook is the normal trigger, but you can also kick off a review against a specific PR number from the CLI -- useful after a CI failure or for re-checking a single PR.
Auto-merge is the PR author's call. Whether the pipeline runs through to auto-merge after APPROVE + CI green is set by the PR author. Default is on; for changes that go directly to prod, the author can flip it off and hit merge themselves.

Output structure: tags and severity

Every auto-review comment is structured as tag + severity + concrete example.

Tags (dimensions)

Tag	Dimension	Primary target
`[Graph]`	Product Graph integrity	`@graph-*` JSDoc, node dependencies, doc consistency
`[Doc]`	Doc consistency	Doc updates that should follow code changes, doc placement
`[Impact]`	Impact analysis	Missed upstream/downstream fixes, `via:` field inconsistency
`[Security]`	Security	Auth, input validation, secrets
`[Architecture]`	Composable Architecture	app/package boundaries, dependency direction
`[Test]`	Test quality	Coverage, matchers, naming
`[Observability]`	Observability	Structured logging, no-truncate rules
`[AI-Antipattern]`	AI-generated code traps	Hallucinated APIs, fallback overuse, dead code
`[Recurrence]`	Recurrence prevention	Bug-fix triage (lint / horizontal rollout / new guideline)

Severity

Severity	Criteria	Action
Critical	Security, data corruption, prod-risk, doc inconsistency, missing `@graph-*`, quality-bar relaxation	`REQUEST_CHANGES`
Major	Spec violation, Composable Architecture violation, missing tests	`REQUEST_CHANGES`
Minor	Naming, maintainability, light refactor	`REQUEST_CHANGES` (must be resolved)
Nit	Style preference, minor inconsistency	`APPROVE` (comment only)

The single most important rule is "no downgrade":

"Following existing patterns" is not a valid reason to downgrade (existing violations are a separate problem to fix, not justification for new violations).
"Will be addressed in a separate PR" or "incrementally" is not a valid reason to drop Critical/Major to Nit.
"Leave a TODO/FIXME" is not a valid deferral path.

This is spelled out in severity.md, and the AI literally cites it back when it issues REQUEST_CHANGES.

Example: PR migrating the embedding model from gemini-embedding-001 to gemini-embedding-2

A real review comment is more useful than abstract description. Here's a typical one. This is a feature PR (feat(meet): dual-write embeddings to new 'embedding' column (v2)) that was merged on 2026-05-19, after 6 review-fix iterations across 1.5 hours.

First review (07:35:25):

The dual-write approach, the idempotent migration script, and the Promise.all parallelization all look sound. Please address the 3 Critical and 2 Minor findings below before merging.

Critical

[Graph] Missing @graph-business tag (x3)

graph-integrity.md requires @graph-business on every top-level declaration under the app layer (severity: Critical).
All 3 new declarations below are missing it (detailed comments in separate threads).
Existing code is also missing it, but "following existing patterns" is not a valid downgrade reason (severity.md).

generateEmbeddingV2 (v2 embedding generation function)

EMBEDDING_MODEL_V2 (model name config)

EMBEDDING_LOCATION_V2 (region config)

[Graph] embedMeetContent's @graph-connects doesn't reflect generateEmbeddingV2

The JSDoc on embedMeetContent has @graph-connects generateEmbedding [calls] Generate embedding, but no corresponding @graph-connects line has been added for the newly introduced generateEmbeddingV2 call.
The graph will be missing an edge to generateEmbeddingV2.
   * @graph-connects generateEmbedding [calls] Generate embedding
+  * @graph-connects generateEmbeddingV2 [calls] v2 embedding generation (dual-write)
   * @graph-connects insertMeetChunks [calls] Insert chunks into BQ
[Doc] Corresponding BigQuery schema doc is not updated

The "BigQuery schema" section in the related doc is missing the new embedding column.
Both graph-integrity.md and severity.md define doc inconsistency as Critical.
 | `created_at`  | TIMESTAMP   | Created at                              |
+| `embedding`   | FLOAT64[]   | Embedding vector (v2: gemini-embedding-2) |
Minor

[Test] textEmbeddingV2 value is not asserted

objectContaining allows extra fields, so the test still passes even when the v2 value is never set.
         textEmbedding: [0.1, 0.2, 0.3],
+        textEmbeddingV2: [0.1, 0.2, 0.3],
[Test] No isolated scenario for "v2 returns null"

generateEmbeddingV2: mockGenerateEmbedding reuses the v1 mock, so the case "v2 returns null while v1 succeeds" is not independently verified.



The takeaway is the precision of the details.

File + line numbers are concrete.
Suggested fixes are in diff format (copy-paste ready).
Source guideline (graph-integrity.md / severity.md) is cited explicitly.
The typical excuse ("existing code has the same problem") is pre-emptively closed.
The trailing  is a machine-readable verdict marker -- the trigger that moves the PR into REQUEST_CHANGES state.

After this, the PR author (= usually another AI running on the author's machine) pushes a fix, the reviewer re-reviews. The next review confirms all 3 Criticals are actually resolved, raises the next Major / Critical, and so on. 6 iterations in 1.5 hours, finally APPROVE, auto-merge.

Plotted on a timeline:

With a human reviewer, this is "Critical x3 -> wait until tomorrow for the fix -> re-review the day after" -- 2 to 3 days per PR. cortex closes it in 90 minutes.

The difference between human review and auto review is not just speed. A single AI session walks all 9 dimensions in order and cites the guideline each time, which makes it much harder to miss the "deep" findings humans drop because their attention drifted -- doc consistency, recurrence-prevention judgments, weak matchers. Side-by-side comparison:

This is why the review bottleneck never forms here.

Evolving the guidelines: catching the moments AI gets it wrong, then fixing the rules

The review guidelines I've been referring to are not a static document. Running this in production surfaces recurring patterns where the AI mis-judges a specific class of issue. Each time that happens, we don't add a comment to the individual PR; we rewrite the guideline so the AI behaves correctly next time -- this is the meta-layer humans actually operate on.

A few concrete failures we hit on cortex, and how we closed each one by changing the rule, not the PR.

1. AI was downgrading because "existing code has the same issue"

Early on, immediately after flagging a violation the AI would add "however, since existing code has the same violation, I'm downgrading this to Nit" and self-downgrade. The result: violations on newly added code kept dropping to Nit, and the system kept emitting Approve.

We closed this by adding the no-downgrade rule to severity.md:

"Following existing patterns" is not a valid downgrade reason: if existing code violates a guideline, new code following that pattern still gets flagged at the same severity. Deferral language like "consider during the next refactor" is not accepted.

That wasn't enough on its own. Over time other excuse patterns surfaced -- "will be addressed in a separate PR," "will be addressed in the next session," "out of scope," "incrementally" -- so we added those as forbidden downgrade categories too. We also explicitly forbade deferring via TODO/FIXME comments in code. The mindset is: close every typical excuse path preemptively.

2. The final verdict had 3 options, and "comment-only" left PRs in limbo

The final verdict at the end of every review was originally APPROVE / REQUEST_CHANGES / COMMENT (approve / request changes / comment-only). When the AI picked COMMENT -- for example when only Minor issues existed -- the script took no action, the PR sat in review-pending forever, and ultimately someone had to manually pick it up. Classic anti-pattern, and it kept happening.

We collapsed the verdict to 2 options. Anything Minor or above is REQUEST_CHANGES, a missing verdict marker defaults to REQUEST_CHANGES (safe side), and only Nit-only or no findings (with CI passing) yields APPROVE. The principle: "if the judgment is ambiguous, fail-safe by defaulting to the blocking side (REQUEST_CHANGES)." Going all-in on that design eliminated the stuck-PR class entirely.

3. Checklist items had no severity, so the AI's judgment kept drifting

Originally, each guideline (graph-integrity.md, testing.md, etc.) was just a bulleted checklist. Items like "Is the test name descriptive?" or "Are mocks minimized?" were listed, but without per-item severity. As a result, the same violation could land as Major in one PR and Nit in another, depending on the session.

We converted every guideline's checklist into a severity / scope / criterion table:

Severity	Scope	Criterion
Critical	All PRs	Missing `@graph-business`
Major	App layer only	Missing tests
Minor	Shared packages only	More than 3 function args
Nit	All PRs	Naming inconsistency

The scope column is a machine-decidable filter for which paths a check applies to, so the AI reviewer doesn't trigger irrelevant items on PRs outside that scope. Just putting it in a table -- the judgment reproducibility jumped significantly.

4. The existing guidelines didn't catch AI-specific traps

After running this for a while we noticed AI-generated code has its own cluster of antipatterns -- calling APIs that don't exist (hallucinated APIs -- something like user.findOrCreate() that looks plausible but isn't actually defined), swallowing errors and returning fallback values (e.g., silently returning an empty array when an upstream API fails), leaving unused functions (a refactor adds the new function but doesn't delete the old one, leaving dead code), expanding the modification scope beyond what was asked (you ask it to change one function and it reformats the whole file), adding unnecessary backward-compatibility code (creating a deprecated alias for an internal-only function) -- and security.md / testing.md couldn't catch these. There's a distinct class of "mistakes only AIs make."

We added a dedicated ai-antipattern.md for this. Reviews now pick these up explicitly under the [AI-Antipattern] tag. Reviewing AI output requires designing around AI-specific traps -- you don't get there just by porting human review heuristics onto an AI.

5. The AI tries to relax "the standard itself"

The last and most important pattern. When the AI was writing fix PRs, occasionally instead of fixing the guideline violation it would write a PR that relaxes the guideline. For example:

Lower the test coverage threshold to avoid writing more tests
Narrow the in-house lint rule's scope to make the violation go away
Soften the guideline doc language from "recommended" to "preferred" to weaken the binding constraint

And the AI builds a formally-coherent justification: "existing code already violates this, so let's adjust the standard to match the implementation." Left unchecked, the AI gradually walks the quality bar down.

We closed this by adding "quality-bar relaxation" as a Critical in severity.md:

A PR that relaxes the quality bar -- guideline doc, lint rule, coverage threshold -- must not be Approved by the AI reviewer. It is sent back with REQUEST_CHANGES. A human reviewer's approval is required. "Existing code already violates this" is not a valid justification for relaxation.

This is the one explicit boundary where we deliberately do not give the AI autonomous Approve authority. Whether the standard itself moves is a human decision. It's the meta-level safety valve for the "AI reviewing AI" architecture.

Evolving the guidelines is the meta-layer humans actually operate on

The common thread: "when the AI gets it wrong, don't override the individual PR -- rewrite the guideline so the fix propagates forward."

AI escapes via "existing code has the same issue" -> add no-downgrade rule
AI picks "comment-only" and PR stalls -> collapse to 2-option verdict
AI's judgment drifts -> add severity / scope columns to every item
AI falls into its own traps -> add the AI-Antipattern category
AI tries to relax the standard -> classify standard-relaxation as Critical, require human Approve

As long as this loop turns, the guideline is a living document that absorbs the failure patterns AI produces in production. Don't try to write the perfect guideline up front. Catch the moment AI gets it wrong, and write the rule for that moment. That's the actual mechanism behind "quality doesn't drop even when humans aren't inside the loop."

And one more thread. Right now, the trigger for "AI got it wrong, time to rewrite the guideline" is still mostly a human judgment, but parts of that maintenance are gradually becoming automatable too. Self-Healing (Part 4 next time) -- where AI investigates production incidents, opens a fix PR, runs it through auto-review, and auto-redeploys -- requires every fix PR to write one of {add lint, add guideline, horizontal rollout} under the [Recurrence] lens. So the AI is increasingly participating in the maintenance of its own review criteria, with humans still in the loop on adoption. I'll come back to this in Part 4.

Auto-fix: a separate AI applies the changes and pushes

Once REQUEST_CHANGES lands, the same script running on the PR author's machine, but in author mode, picks up the event and starts working.

[REQUEST_CHANGES detected]
   | SSE push via Event Relay
[Author mode boots on PR author's machine]
   | Merge origin/main into a worktree
   |  (lockfile resolved up front, remaining conflicts handled by AI)
   | Read the auto-review comment as context
   | Run claude -p inside the worktree
   | Commit + push the changes
   | New SHA is delivered back to the reviewer's machine via Event Relay -> re-review

Two design choices matter here.

Reviewer and author run on different machines in different sessions -- reviewer mode and author mode are the same script, but they run on different machines in different processes. "Is the original critique correct?" is judged independently. Unlike a single AI fixing its own complaints, the judgment passes between two separate sessions.
All iteration stays inside the same PR -- we don't spawn a new PR. The "fix the root cause, no deferrals" rule from Part 2 and the review guidelines kicks in here: if the AI tries to escape via TODO/FIXME or by splitting work out into a separate PR, the next review rejects it.

Auto-merge + parallel deploy

Once auto-review returns APPROVE and CI is fully green, the auto-merge script runs and squash-merges the PR.

[Auto review APPROVE + CI green]
   |
auto-merge script
   | squash merge to main
   |
[main updated]
   |
Turborepo build (affected packages only)
   |
Pulumi up (multiple stacks in parallel)
   |- API services
   |- pipeline services
   |- MCP servers
   `- infra
   |
[Deploy complete]
   |
cpg index rebuilt (only changed nodes regenerate embeddings -- see Part 2)

pulumi up <stack1> <stack2> ... runs in parallel, so deploying 9 stacks at once finishes in about 8-12 minutes. End to end, merge-to-production is averaging 10-15 minutes.

This compounds nicely with Self-Healing PRs. Incident alert -> Self-Healing identifies root cause -> opens a fix PR -> auto review pass -> auto merge -> auto deploy runs as a single closed loop without human involvement (covered in Part 4).

The numbers, in more detail

Unpacking the headline numbers a bit further.

Depth of the review-fix loop

Across 769 PRs in 30 days, the average per PR was 10.8 review iterations, max 56. The fact that the average is past 10 means the first review almost always surfaces at least one finding.

The embedding-model migration PR shown earlier needed 6 iterations to merge, and that's representative of the average PR. What would take a human reviewer days, cortex resolves in minutes.

What the auto reviewer typically flags

The most common findings out of the first review:

[Graph] Missing @graph-business -- a prerequisite cpg leans on (from Part 2). The classic finding on newly added declarations.
[Doc] Doc inconsistency -- code changed but the corresponding docs/ section was not updated.
[Test] Weak matchers -- objectContaining weakening value assertions, single-property checks via toBe.
[Observability] Unstructured error logs -- event field or required keys deviating from the structured-log spec.
[Recurrence] No recurrence-prevention action -- a bug-fix PR description not declaring which of {lint / horizontal rollout / add guideline / nothing} applies.

These are categories human reviewers frequently miss in practice, especially doc consistency and recurrence-prevention checks. The AI reviewer applies them mechanically on every PR.

Actual false-positive rate

It's not zero. A few times a month we get "this is Nit, not Major" type misjudgments. The fix path is the one described above -- not a comment on the individual PR, but a guideline edit that corrects the judgment for all subsequent reviews.

What changed / Bridge to Part 4

Over the past six months, the engineer's role on cortex shifted from "writer" and "reviewer" to "operator" -- the human running the system, not acting inside each individual decision.

AI writes the code (Claude Code)
AI reviews the code (auto review)
A different AI applies the fixes (author mode running on the PR author's machine)
AI decides when to merge (auto-merge script)
Deploys go in parallel (Turborepo + Pulumi)

What stays in human hands: "what to build at all (product / requirements)," "is this direction actually right (architectural judgment)," "which guideline to add and where," and "look at the reviews and adjust prompts and guidelines accordingly." High-abstraction work -- not individual decisions, but watching the whole system from above and steering. From human-in-the-loop to human-on-the-loop, you could say.

The widely-reported phenomena -- "AI lowers quality," "the reviewer becomes the bottleneck" -- happen when the harness is extended on the writer side only, and the reviewer side is left to humans. If writing speeds up and reviewing doesn't, of course it bottlenecks. Of course things get missed.

cortex is the opposite. We extended the harness on the reviewer side first, before fully extending it on the writer side. Anthropic's observation that the bottleneck shifts from writing to reviewing is exactly right -- which is precisely why "move the reviewer role to AI as well" is the answer cortex chose.

"The AI writes the code, the AI reviews the code." That's the core of cortex's auto-review pipeline. Quality drop and review bottleneck are functions of how far you extend the harness -- they are not inherent to AI-assisted development.

Up next in Part 4 — Self-Healing + Recurrence Prevention: a pipeline where a production alert (observed via OTel/Loki/Mimir/Tempo/Faro) triggers AI investigation, an AI-authored fix PR plus a new lint/type gate, auto-review, auto-merge, and auto-redeploy. The fix and a recurrence-prevention guardrail land together, so the same class of incident structurally can't fire again. If auto review protects quality at PR time, Part 4 protects it at production time, while growing the quality gates themselves.

The headline number above includes Self-Healing PRs (production alerts that AI investigates, fixes, and auto-deploys). For certain classes of incidents, the fix is already merged before anyone has time to react — that's where cortex sits today.

The Heart of the AI Harness: A Knowledge Graph of the AI, by the AI, for the AI

Ryosuke Tsuji — Tue, 19 May 2026 14:16:20 +0000

AI assistance disclosure: This article was drafted with the help of Claude. All technical content, design decisions, code references, and screenshots reflect production systems I designed and operate at airCloset; the prose was revised by me prior to publication.

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" and "cortex-product-graph" referenced in this article are internal code names for an AI platform developed in-house at airCloset. They are unrelated to existing commercial services such as Snowflake Cortex or Palo Alto Networks Cortex.

In Part 1 (Series Intro), I wrote about how AI handles PR reviews and incident response on top of a platform we call cortex. At the center of that flywheel is the Product Graph (implementation name: cortex-product-graph, or cpg) — a unified knowledge graph of code, docs, DB schemas, and infrastructure definitions, queryable through semantic search.

In Part 1, I described cpg at a high level: "all of cortex is indexed in one graph." This post goes deeper — how it's built, why we landed on this design, and what actually changed once it was in place.

Series Index

#	Theme	Key scene	Article
1	Series intro: cortex's harness	PRs auto-merge / incidents self-heal before you notice	ai-harness-intro
2	Product Graph (cpg)	Code, docs, DB, infra unified into one graph	this post ← you are here
3	AI PR review	webhook → AI review → auto-fix → squash merge	cortex-auto-review
4	Self-Healing + observability + auto-added guardrails	Alert → AI investigates → fix PR + new lint/type gate → auto redeploy + same-pattern writes get auto-rejected	cortex-self-healing
5	Democratizing the maintenance phase	Domain experts open PRs to production; the harness owns the quality gate	cortex-non-engineer-prs
6	Series Final	The underlying philosophy plus a retrospective on the failures and lessons	cortex-philosophy

Start with One Scene

"I want to change the calculation logic behind the 'bug rate' KPI on the dashboard. Where is it, and what might break?" — imagine that question comes up before you touch any code.

When you ask an AI this directly, with no function name and no file path given, it hits cpg with a semantic search and pulls the relevant nodes in one shot. What comes back isn't just functions — it includes BigQuery tables and API endpoints alongside the code. And at the end of the response, there's a "next action candidates (Runbook)" block that tells the AI to re-probe starting from the BQ table with the most reads and writes flowing through it.

The final answer looks like this:

Calculation site: calculateRatePer100pt / calculateBugCount — both pure functions with no I/O side effects; safe to change in isolation
Writers (upstream): syncKpiMetrics / writeKpiMetrics / backfillKpiMetrics all write to the kpi_bug_rate_per_100pt table; these are the real aggregation batch jobs
Readers (downstream): BigQueryKpiRepository.getSummaryByDate reads via BigQuery → /kpi/bugs API → KPI dashboard page
Related docs: docs/generator/kpi.md defines bug rate; updating the code without updating docs would leave them stale

"Update the docs together, and schedule the deploy when the aggregation batch isn't running" — that's a decision you can make with confidence.

I personally know all this — I wrote it. But that's exactly the problem: anyone else who wanted to touch this had to track me down. Three months ago, "finding out where something lives and what would break" meant finding me. Now, this same investigation is done by PMO members (non-engineers) using cpg on their own. grep didn't get them there; documentation didn't get them there. One natural-language question did.

What makes that possible is cpg — a graph where you can follow "what you want to do" in plain language to the relevant nodes in one or two hops, even when you don't know the function name. The Runbook structure — where the tool's return value itself contains the next tool call to make — is what lets the AI re-select its starting point and drill deeper on its own.

That's the setup. Now let me explain how it's built.

What Static Analysis Alone Couldn't Do

cortex has a separate system that graph-analyzes the production codebase using static analysis (I'll write about this in its own post — just touching it here). It parses JS/TS code with AST analysis across our external-facing production repos, automatically extracting function call graphs, API endpoints, DB access patterns, and event pub/sub relationships.

This works well for what it does, and we still use it actively in the production repos. But when we tried applying the same approach to cortex itself, it didn't get us where we wanted to go.

Three specific gaps:

No context — nodes exist but carry no meaning. "What is this API for?" "Why does this column exist?" isn't in the graph. Ask "where is the code that calculates the KPI bug rate?" and you'll miss unless the function name happens to look like it.
No entry point — you already have to know the file path or function name before search can start. "Let me go find it" doesn't work.
Explosion after 1–2 hops — starting from any node, related nodes multiply exponentially within a couple of hops, far exceeding what an AI can process in one context window. Trace results become too long to use.

The summary: mechanically accurate, but no semantic weighting. To be genuinely useful to AI, you need one more layer: "what matters, and why things are connected."

Meanwhile, DB Graph Was Working

Around the same time, a different approach — the DB Graph MCP we'd built — was working exactly as intended.

DB Graph is an MCP server with access to 15 schemas and 991 tables inside cortex, supporting semantic search over tables and columns with AI-generated descriptions. A natural-language query like "tables related to return processing confirmation" would find semantically connected nodes even when the table name doesn't contain those words.

After thinking about why this worked, the answer became clear: DB Graph has a business-context description attached to every node, and that description is what feeds into the embeddings. That semantic weight is what "finding by meaning" actually runs on.

Static-analysis code graph had none of that. Type relationships and call graphs exist — but "why this function exists" was never written anywhere.

The Hypothesis — Bring DB Graph's Essence into the Code Graph

The hypothesis was simple:

"A business-context description on every node, loaded into embeddings" — if that's the core of why DB Graph works, then doing the same thing for the code graph should structurally overcome the limits of static analysis.

The problem was: where do you put the "business context"?

All the options:

Location	Example	Problem
External docs	Design docs / wiki / Notion	Separate from code. Drifts instantly. Nobody maintains it.
External metadata	Sidecar YAML / `*.meta.json`	Dual-management. Breaks on rename.
Dedicated graph DB	Write annotations directly into Neo4j / Neptune	Dual-management again. Doesn't show up in PR diffs — unreviewable.
TypeScript decorator	`@GraphNode({...})` in code	Lives in the transpiled output = runtime dependency. Can't be extracted by AST alone.
DSL file	Custom `.graph` file format	High learning cost. No editor support out of the box.
JSDoc comments	`@graph-business` / `@graph-connects`	Physically co-located with the code. Extractable by AST alone. Zero runtime dependency.

The choice of JSDoc over decorators was intentional:

Zero runtime dependency: decorators survive into the transpiled output and can affect runtime behavior. JSDoc has no executable runtime semantics; with production builds that strip comments, it leaves no runtime artifact.
Generalizes beyond TypeScript: the same @graph-* syntax can extend to Pulumi definitions in infra/ and Markdown frontmatter in docs/. Decorators are locked to TypeScript syntax.
Single AST pass: ts-morph can walk declarations and extract JSDoc in one scan. Decorators sometimes require type resolution, which slows builds.
Shows up naturally in PR diffs: JSDoc sits directly above the code it annotates, so when code changes, the JSDoc diff appears in the same file. Reviewers can't miss it.
Doubles as documentation for both humans and AI: JSDoc already serves as IDE hover text and AI-readable context. Putting @graph-business there means it simultaneously explains the declaration to a human reading the code, and gives a coding AI semantic context about the surrounding functions. Graph metadata that also functions as inline documentation.

Note that the essence of this design is using parseable annotations co-located with code as the SSoT — TypeScript / JSDoc is just one implementation. The same pattern works in any language with comparable comment + AST primitives: Python docstrings + ast, Go comments + go/ast, Rust /// + syn. What matters isn't where you write the annotations, but the invariant: "physically co-located with the code, extractable by AST alone."

Same goes for the monorepo: this pattern doesn't depend on cortex being a monorepo. If anything, its real value shows when repositories are split and AI can't easily follow code across them. In a monorepo, the AI can still grep / read files across the whole tree; in a multi-repo, the cross-repo calls and data flows are the hard part to follow. Run the same build per repo, emit nodes / edges, aggregate into a central graph, and those cross-repo connections become reachable in one hop. We actually run a parallel knowledge graph over our external-facing production repos (multi-repo) using the same pattern — more on that in a separate post.

The Approach — Abandon Code Inference, Make JSDoc the SSoT

The code graph's problem was no meaning. The answer is simple: embed the meaning directly in the code.

For cortex's own code graph, we completely abandoned the approach of inferring graph structure from code. Instead:

Every declaration — function / class / method / API / Page / Cron / etc. — gets a dedicated JSDoc tag. The graph is assembled from those.

This means the SSoT (Single Source of Truth) for business context becomes the code itself. There's no gap between docs and code, because the JSDoc in the code is the authoritative source. The structural problem of "AI makes mistakes because docs are stale" is resolved at the level of where the data lives.

Placing the two side by side — "a graph from code inference alone" versus "a knowledge graph with JSDoc as SSoT" — makes the difference in what's carried on each node immediately visible:

Here's a concrete example of the tags (from cpg's own source):

/**
 * Set embeddings on nodes in place.
 * Compares textForEmbedding against existing BQ data; only re-generates
 * for nodes where the text has changed.
 *
 * @graph-stack product-graph
 * @graph-domain Engineering
 * @graph-business Compares hash of textForEmbedding against existing BQ nodes; re-generates
 *   embedding only for nodes where text has changed. Unchanged nodes reuse BQ embeddings.
 * @graph-connects cortex.product_graph_nodes [queries, via:id] read existing embeddings
 * @graph-connects vertex-ai-embedding [calls] generate embeddings for changed nodes
 */
export async function generateEmbeddings(
  nodes: ProductGraphNode[],
  options: { force?: boolean } = {},
): Promise<void> { ... }

What each tag does:

Tag	Role
`@graph-node`	Explicitly declares node type (defaults to Function)
`@graph-stack`	The infra stack this declaration belongs to
`@graph-domain`	Business domain (comma-separated, multiple allowed)
`@graph-business`	What this declaration specifically does — the body of the embedding input
`@graph-connects`	Connection targets (multiple allowed; `via:` for parameter-level tracking; `none` to explicitly declare no connections)

The key is that @graph-business feeds directly into the embedding input. It's not the node name — it's a natural-language sentence that carries semantic weight into search. In practice, almost all of these sentences are written by AI: during the normal flow of writing code in cortex, the AI writes the JSDoc alongside the code (and thanks to the ESLint enforcement below, it doesn't forget).

Making Omissions Physically Impossible

This design collapses the moment someone leaves a tag out. One function without @graph-business = that function is invisible to semantic search. One without @graph-connects = the data flow through that function is absent from the graph.

So we built enforcement that makes omissions physically impossible:

5 ESLint plugins — tag presence validation, syntax validation, naming convention enforcement (stack / domain allowlists), @graph-connects required, @graph-connects none misuse detection (flags when none appears on code that calls external services)
Automated PR review (Part 1 ③) — tags missing are flagged as [Graph] Critical; docs inconsistency is flagged as [Doc] Critical

The result: "write a declaration → business context is always written with it" holds as an invariant. Add a function → its meaning and connections are necessarily in its JSDoc.

One honest note: forcing "5 JSDoc tags on every declaration" on humans would blow up in code review within three days. Writing a @graph-business sentence per function, enumerating @graph-connects exhaustively, checking the naming allowlists — that's genuinely tedious at scale.

This works because AI writes the code. Writing four required JSDoc tags (plus optional @graph-node when the default Function type isn't enough) is rounding error on top of writing the code itself. With ESLint and automated review in the feedback loop, the AI doesn't miss tags — and human reviewers only need to check "is this tag factually correct?" not "is it there?"

:::message
This design is one that can't realistically be maintained when humans write code, but becomes viable the moment AI does. It's an AI-first design. The premise of AI-first development is what lets business context be fixed in code as the SSoT.
:::

Where Hallucination Happens Shifts

Viewed from another angle, what's going on here is that the location of hallucination shifts. Where you contain hallucination is, I think, fundamental to AI harness design.

As I wrote elsewhere, when you combine AI with a graph system, "hallucination doesn't disappear — it just changes location." For cpg, here's where it lands:

Graph build / query phase: No fresh LLM generation. Once reviewed metadata lands in the graph, the ts-morph AST pass, the BigQuery MERGE, and the MCP query responses are all deterministic.
JSDoc writing phase: This is the entry point for hallucination. Whether @graph-business is factually accurate, or whether @graph-connects is exhaustively listed — these can go wrong since the AI is writing them.

But the entry point is locked down by automated PR review. Missing tags get [Graph] Critical; factual drift gets [Doc] Critical. When something's wrong, either the AI that wrote the code or another reviewer AI catches it and fixes it.

The result: once data lands in the graph, it can be treated as deterministically sourced from reviewed code, not as a fresh generated answer that might hallucinate on every query. AI agents calling cpg don't have to guard against "this might be a generated lie" on every returned node or edge. The tools can be designed as "return facts only" without compromise.

Build — AST to Graph via ts-morph

Once JSDoc is established as the SSoT, the rest is mechanics: extract it and assemble the graph. The implementation:

AST-analyze JS/TS with ts-morph — walk every declaration (function / class / method / type / enum / variable / expression statement / export default / etc.)
Extract @graph-* tags from JSDoc — collect the four required tags plus optional @graph-node and normalize into a ParsedGraphTags structure
Generate nodes — use qualifiedName = "<filePath>:<name>" as the node ID
Generate edges — one edge per @graph-connects entry, with via: / cardinality and other metadata preserved
Generate embeddings — send @graph-business text to Vertex AI Embedding (gemini-embedding-2) and vectorize it
Load into BigQuery — MERGE all nodes / edges into cortex.product_graph_nodes / cortex.product_graph_edges

Because @graph-business goes directly into the embedding input, querying "code that calculates the KPI bug rate" in natural language returns a hit based on semantic proximity of the description — even when the function name contains neither "bug" nor "rate."

The overall flow: the three tracks (apps/ / infra/ / docs/) each go through their own parser, are merged into a single node set by the generator, and only nodes whose text has changed are sent to Vertex AI before being stored in BigQuery:

Build Cost Is Effectively Zero

The build runs automatically on push to main via GitHub Actions, using a differential embedding approach:

Compare textForEmbedding of each BQ node against the new text
Unchanged nodes reuse their existing BQ embeddings
Only changed nodes go to Vertex AI

A typical push changes a few dozen nodes, so cost is under $0.001. Full regeneration (for recovery, triggered via workflow_dispatch) is ~$0.075 for 8,000+ nodes.

Why BigQuery, Not a Graph Database

When people hear "knowledge graph," they often imagine a dedicated graph DB (Neo4j, Neptune, Memgraph, etc.). cortex runs on just two BigQuery tables (product_graph_nodes / product_graph_edges). Three reasons:

Different cost structure — dedicated graph DBs set a floor of "always-on cluster cost"; for the current implementation, BQ is storage + on-demand queries only. Even with continuous AI traffic, it's clearly cheaper than running a server 24/7.
Vector search / cosine similarity / SQL in the same place — BQ has VECTOR_SEARCH and ML.DISTANCE, so semantic search over @graph-business embeddings, filter by node properties, and adjacent-node JOINs can all live in one query. That matters when "semantic search + property filter + neighbor JOIN" is the standard access pattern.
Migration-ready for GQL once BQ Graph goes GA — BQ already has Graph in BigQuery in Preview; once it ships GA, you can put a graph view over the existing tables and likely shift to MATCH (n)-[e]->(m) queries in GQL. The current table design is already migration-ready.

In short: get the graph DB's future strength (GQL) while running on plain BQ tables today. Compared to adding a graph DB on top of a generic RAG stack (pgvector / Pinecone / etc.), fewer systems to operate and lower learning curve.

The Core Part Is Available as an Open-Source Sample

The "parse JSDoc annotations with AST analysis and output a graph" part is small enough to reproduce cleanly, so I published it as a working sample:

🔗 graph-jsdoc-extractor

It's a ~500-line library that extracts @graph-* and outputs ndjson of { kind: "node", ... } / { kind: "edge", ... } objects. Comes with a pnpm run example that runs end-to-end. For those who just want to see the output format without cloning, the built ndjson is checked in: examples/sample/output.ndjson.

This is intentionally just the "turn code into a graph" part. The real value in cortex starts when docs and DB schemas land on the same graph — that's the next section.

Connections — Landing Docs and DB on the Same Graph

Looking at the sample ndjson, a @graph-connects users [reads_from, via:id] entry has users stored as a raw string in targetId. Leaving that as-is means it's just a string. Resolving users into a rich node carrying column definitions, partition info, and per-column descriptions — that's where the resolution power of search takes a real step forward.

cortex does this in three directions.

1. DB Schemas as Nodes in the Same Graph

cpg ingests not just code but cortex's DB schemas in the same build. A @graph-connects users [queries, via:id] on the code side gets resolved at build time into a rich Table node carrying column definitions, partition metadata, and descriptions (if the same-named stub exists, its internals are replaced while its ID and all inbound edges survive).

The key point: table and column descriptions aren't AI-generated annotations attached after the fact — they're pulled directly from the description fields in the Pulumi schema definitions. Here's what that looks like (excerpt from cpg's own table definition):

export const productGraphNodesTable = new gcp.bigquery.Table('cortex-prod-product-graph-nodes', {
  datasetId: 'cortex',
  tableId: 'product_graph_nodes',
  description:
    'Product Graph nodes — unified knowledge graph of code + DB + docs. ' +
    'Auto-generated from JSDoc @graph-* tags',
  schema: JSON.stringify([
    { name: 'id', type: 'STRING', mode: 'REQUIRED',
      description: 'Unique node ID (graphId:nodeType:filePath:name format)' },
    { name: 'nodeType', type: 'STRING', mode: 'REQUIRED',
      description: 'Node type — ApiEndpoint, BigQueryTable, Function, Module, Document, etc.' },
    { name: 'qualifiedName', type: 'STRING',
      description: 'Fully qualified name — filePath:exportName format' },
    // ...
  ]),
});

Both the table-level and column-level descriptions become the embedding input for semantic search directly from the Pulumi definition. The same philosophy as cpg's JSDoc — "write the description at the place the thing is defined" — runs all the way through the DB layer. Fix a Pulumi description → semantic search improves. Same mechanics as fixing a JSDoc.

2. Docs Auto-Promoted to Nodes via Directory Convention

Markdown files under docs/ also land in the graph. The mechanism is simple: the directory structure is conventionalized so that which stack and domain each doc belongs to is deterministically resolvable:

docs/{category}/{name}.md

Examples from cpg itself:

docs/product-graph/README.md → stack: product-graph, domain: Engineering
docs/code-graph/README.md → stack: code-graph, domain: Engineering
docs/mcp/db-graph/README.md → stack: mcp-db-graph-server, domain: Engineering

Each file is ingested as a Document node in the graph, and a documented_by edge is auto-generated from code nodes whose @graph-stack matches the doc's stack. Code under apps/graph/product/ all carries @graph-stack product-graph, so it's automatically linked to docs/product-graph/README.md. Change code → related docs are already linked.

This means an AI reviewer can answer "did this code change leave related docs stale?" in one graph hop (that's the source of the [Doc] Critical comments from Part 1).

3. Infrastructure Definitions as Nodes

@graph-* tags go on Pulumi code in infra/ too. An example from cortex's own graph infrastructure:

/**
 * @graph-node {CronSchedule}
 * @graph-stack code-graph
 * @graph-domain Engineering
 * @graph-business graph-boundary-daily: runs cross-repository boundary analysis at 7:00 AM JST
 *   daily (auto-detecting API, DB, and Event connections across repos)
 * @graph-connects graph-index-job [triggers] trigger Cloud Run Job
 */
new gcp.cloudscheduler.Job(`${prefix}-graph-boundary-schedule`, { ... });

This becomes a CronSchedule node in the graph, connected to the target CloudRunJob node by a triggers edge. The Pulumi definition is itself a graph entry point — "what code runs in this cron?" is now answerable by graph traversal.

Result: Four Layers on One Graph

Adding the three together, the node types in the graph look like this:

Node type	Source
Function / Class / Method	Code (JSDoc)
ApiEndpoint / Page	Code (JSDoc `@graph-node`)
BigQueryTable / FirestoreCollection (stub)	Code `@graph-connects` targets
Table / Column / Schema (rich)	Schema files defined in Pulumi
Document	Directory parser over `docs/`
CronSchedule / PubSubTopic / CloudRunService	`infra/` JSDoc

Edge types correspondingly:

Edge type	Role
calls / queries / reads_from / writes_to / publishes / triggers	code → other nodes (`@graph-connects`)
documented_by	code → Document (auto-generated on stack match)
HAS_TABLE / HAS_COLUMN	Schema → Table → Column (DB side)
shares_topic	Between boundary nodes sharing a topic

Code ↔ DB ↔ docs ↔ infra — all reachable in one hop on the same graph. This is what "Product Graph" means: cortex's unified knowledge graph.

Here's an actual visualization of a slice of cpg itself. Starting from generateEmbeddings (code), you can see cortex.product_graph_nodes (BigQueryTable) with its columns, the Pulumi table definition resource, docs/product-graph/README.md, external services like Vertex AI, and a separate layer's graph-boundary-daily (CronSchedule) — all connected by edges on the same node set:

Where the Sample Stops

graph-jsdoc-extractor intentionally leaves out:

Resolving @graph-connects targets to real node IDs (cortex uses a seven-stage resolver; the rules are project-specific)
Same-name merging (cortex promotes DB-schema-side rich nodes to replace stubs; the merge source is project-specific)
The docs directory convention parser (cortex's docs/{category}/{name}.md convention is cortex-specific)
Embedding generation (Vertex AI setup is up to you)

These are parts where the right answer differs per project — naming conventions, where docs live, which embedding model to use, when to promote a stub to a rich node. Baking one answer into the sample library would make it harder to use, not easier. The sample draws the line at JSDoc → graph structure, and this article's job is "here's how we did it in cortex — translate it to your project's context."

MCP Tool Design and the Runbook Pattern

The graph is now assembled. Next: how AI uses it.

cpg runs as an MCP server (cortex-product-graph). From the AI's side, three tools are visible, applying the three-layer tool design (search / detail / traverse) from the Agentic Graph RAG MCP post directly to cpg:

Tool	Role
`search_product_graph_nodes`	Find entry points (vector search + name search)
`get_product_graph_node_detail`	Deterministically fetch detail by ID
`trace_product_graph_connections`	BFS subgraph traversal (`via_filter` for parameter-level tracking)

Three layers only shows you what's in the graph. For jumping from graph nodes to the actual data they point to, supplementary tools live in the same MCP:

Supplementary tool	Role
`read_file`	Pass a node's `path` property directly to fetch source (Function / Class / Method / ApiEndpoint / Document — any code-origin node carries `path`)
`grep_code`	Pattern search across the repository
`git_blame`	Last author, commit, and timestamp per line
`query_product_graph_bq`	Direct SQL against BigQuery. Find a BQTable node in the graph, then jump to its live data (executed via user OAuth, so BQ IAM applies as-is)
`read_firestore` / `write_firestore`	Read/write Firestore collections. Find a FirestoreCollection node in the graph, then go to the live documents (Firestore access follows the same user / environment permission boundary; cpg provides the entry point, not a bypass around IAM)
`list_product_graph_stacks` / `list_product_graph_domains`	Lists all stack / domain names present in the graph; useful for orienting before a search

In other words, cpg's MCP is a two-tier design: the three-layer structure for graph traversal + supplementary tools for descending into live data (source code / BQ / Firestore). The AI can do "search by meaning → traverse by structure → pull live data" entirely within one MCP server.

Runbook Pattern — Return Values Contain the Next Action

Every MCP response ends with a "related nodes (next action candidates)" block. For example, after a search returns:

3 nodes found:
- apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount (Function)
- backlog_no_embedding.kpi_bug_rate_per_100pt (BigQueryTable)
- /kpi/bugs (ApiEndpoint)

## Related nodes (next action candidates)

### 🛠 Code (1)
- apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount
  → `get_product_graph_node_detail("apps/generator/kpi/src/kpi-calculator.ts:calculateBugCount")`

### 🗄 DB tables (1)
- backlog_no_embedding.kpi_bug_rate_per_100pt
  → `trace_product_graph_connections(start_node: "backlog_no_embedding.kpi_bug_rate_per_100pt", direction: "backward")`

### 🌐 API (1)
- /kpi/bugs
  → `get_product_graph_node_detail("/kpi/bugs")`

Copy-pasteable tool calls are lined up by node type, showing exactly what to call next. The AI gets new options on every call, so it never has to figure out "what should I do now?"

Here's the AI ↔ MCP loop in diagram form. The MCP bundles next action candidates into every search response; the AI picks one and makes the next call, repeating:

`usecase` Parameter — Switching the Runbook

Every tool accepts a usecase parameter where the AI declares what kind of investigation it's doing:

usecase	Strategy (summary of what cpg optimizes for)
`general`	Basic investigation with unknown entry point. Default.
`design`	Understanding existing feature structure. Read business / connections via `get_product_graph_node_detail`. Deep trace is unnecessary; Document nodes take priority.
`impact`	Trace upstream and downstream impact deeply. Hit `trace_product_graph_connections` with `direction=both` / `max_depth=5`. Code + DB + infra + schedules are all on the same graph, so one traversal covers a wide area.
`test-create`	Test design. Fetch detail to read parameters and connected DB / called functions.
`test-review`	Compare existing tests against implementation coverage. Cross-check branch structure of target Function / Method against test case count.
`code-review`	Check impact of changes and detect `@graph-business` violations. Trace impact → detail to check business / source.
`bug`	Deep trace from error origin. `direction=both` / `max_depth=5` for upstream callers + downstream data flow.

The same search_product_graph_nodes call with usecase: "code-review" returns next action candidates optimized for "verify the change's impact first." With usecase: "bug" it returns candidates optimized for "trace deep from error origin + fetch logs." The Runbook switches to match the declared intent.

This matters because having the AI declare "what kind of investigation I'm doing" yields different angles from the same graph. Auto Review internally fires with code-review; Self-Healing fires with bug — the flywheel elements from Part 1 each run a different Runbook.

CLAUDE.md Convention — Forcing AI to Always Hit cpg First

Throughout this post I've said "the AI uses cpg," but AI doesn't spontaneously choose cpg. Claude Code defaults to grep / glob / file read as its first instinct. To flip that, the root CLAUDE.md in cortex opens with:

Product Graph MCP (cortex-product-graph)

This is the single most important asset in this repository. cortex-product-graph MCP indexes all code, DB schemas, docs, and infra into a unified knowledge graph with business context. It knows everything about this repository.

Always query Product Graph MCP first before grep/glob/file reads. It returns richer, contextualized results.

If Product Graph MCP is unavailable (auth expired, server down) and you are NOT in autonomous/auto mode, stop all work immediately and ask the user to authenticate. Do not proceed with degraded grep-only investigation.

Two things matter here. First, the explicit ordering — "cpg first, grep only as fallback." Second, fallback to grep is explicitly forbidden if cpg is unavailable. Without that second clause, the AI happily degrades to "cpg seems down, I'll just grep" and proceeds with stale context and wrong assumptions. With it, cpg unavailability is a hard stop, not a graceful degradation.

One clause in CLAUDE.md, and Claude Code's first move on any code investigation is pinned to cpg. Article writing, Auto Review, Self-Healing — all follow the same convention, so the entry point is always unified.

A Live Example — Investigating cpg with cpg

Enough abstraction. Let me walk through a real cpg query: using cpg to investigate cpg's own builder core — the meta-example.

Step 1: Semantic search for "the code that extracts graph source data from code annotations"

No function name assumed. Just the intent in plain language:

search_product_graph_nodes(
  query: "code that extracts graph source data from annotations written in code",
  search_mode: "semantic",
  usecase: "design"
)

Top 5 results:

- apps/graph/product/src/parsers/jsdoc-parser.ts:applyGraphTag (Function)
- apps/graph/product/src/parsers/jsdoc-parser.ts:extractTagsFromNode (Function)
- packages/eslint-plugin-graph/src/utils/jsdoc-utils.ts:extractGraphTags (Function)
- apps/graph/product/src/parsers/jsdoc-parser.ts:parseJSDocExports (Function)
- packages/eslint-plugin-graph/src/utils/jsdoc-utils.ts:getGraphTagValue (Function)

The query contained neither "JSDoc" nor "@graph-*" nor "parser" — yet the intent found the right nodes via the @graph-business embedding. grep cannot do this.

Step 2: Trace downstream from that node (`usecase: "design"` prioritizes Documents)

trace_product_graph_connections(
  start_node: "apps/graph/product/src/parsers/jsdoc-parser.ts:parseJSDocExports",
  direction: "forward",
  usecase: "design"
)

Edges returned:

- parseJSDocExports --calls--> extractDeclarationsFromFile
- parseJSDocExports --calls--> extractTagsFromNode
- parseJSDocExports --reads_from[via:filePath]--> filesystem
- parseJSDocExports --documented_by--> docs/product-graph/README.md (Document)

The last one — documented_by — is the point: the edge from code to the Document node was auto-generated. Following it with read_file retrieves docs/product-graph/README.md — and with it, the background, design rationale, and tag specification for this implementation, all in one hop.

Step 3: The meta-structure — this article itself is written with cpg

This article was drafted by Claude Code, not by me — I provided direction and review. That Claude Code has cpg MCP connected, so every time I said "show a real example from cpg's own code" or "use a cpg-related infra example," Claude queried cpg to pull actual function names, JSDoc, Pulumi definitions, and docs structure, then embedded them in the text.

In other words: the generateEmbeddings JSDoc, the Pulumi productGraphNodesTable description, the graph-boundary-daily cron annotation, the auto-link to docs/product-graph/README.md — none of these came from my memory. Claude queried cpg and found the real artifacts. My role is only the review judgment: "this is right / this is wrong."

This is the pattern repeating across all of cortex. Humans set the direction; AI uses cpg to verify and generate implementations / text / reviews. Part 1's ③ Auto Review and ④ Self-Healing run on the same structure. Article writing isn't a special case — as long as cpg exists, AI-driven work always takes this shape.

What Changed / Bridge to Part 3

That covers the inside of cpg. A closing summary of how it affects cortex as a whole:

1. I stopped running grep

Without knowing file names or symbol names, I can get the relevant code back by just describing what I want to do. The combination of 120+ apps and a team of one works because of this, more than anything else.

2. Auto Review produces context-grounded comments

The [Graph] / [Impact] / [Doc] / [Security] level comments Part 1's ③ Auto Review produces all stand on cpg. The substance is review carried out with the entire codebase as context — that's the real benefit of the cpg integration.

3. Self-Healing can trace from error origin to root cause

Part 1's ④ Self-Healing can hop from a Grafana alert → code → dependent tables → related docs in one graph traversal because cpg exists. It fires with usecase: "bug" and takes the shortest path from error to root cause.

4. The static-analysis code graph is working somewhere else

I said "we abandoned code inference" at the top, but that was specifically for cortex itself. For the external-facing production repositories (the core of the business), a different approach supplies context, and static analysis continues to run there. More on that in a separate post.

Most AI coding setups try to make the AI better at reading an unchanged repository. cpg takes the opposite approach: change the repository's information structure so AI has a first-class semantic map to read. That's the line between "another GraphRAG" and what cpg actually is.

In that sense, Product Graph is literally a knowledge graph of the AI, by the AI, for the AI: generated alongside AI-written code, maintained through AI review, and consumed by AI agents as their primary map of the product.

Coming up in Part 3 — automated PR review: the full pipeline of automated PR review built on top of cpg — from GitHub webhook ingestion through AI review / automated fix / automated merge / parallel deploy. What happens when Auto Review fires with usecase: "code-review", how [Graph] Critical comments are generated, and the worktree mechanism that lets AI apply fixes and push back.

DEV Community: Ryosuke Tsuji

AI-Native Redesign: The Principles Don't Change — Only the Machinery Does

The Underlying Principle

Splitting the Principle into Three Nodes

What Has Changed, What Hasn't

A Closer Look at Each Node

The Negative Spiral — Why It Collapses in Most Real Places

The Rest of This Post

Where Deterministic Automation Couldn't Reach

Type 1: Domains Where One Node Is Still Human-Only

Type 2: Domains Where Building the Artifact Wasn't Worth the Return

Type 1 vs. Type 2

The Pattern Underneath Both

Why AI Is the Pivot

What's Genuinely New About AI

Three Directions of Change

Deterministic-First

Why AI Requires Whole-System Redesign

Examples from cortex

code-graph

db-graph

biz-graph

cortex-product-graph

Observability + Self-Healing

The Pattern Underneath All Five

AI-Native Redesign vs. "Adding an AI Tool"

What "Adding an AI Tool" Usually Looks Like

Failure Mode 1: The Three-Node Balance Stays Optimized for Humans

Failure Mode 2: AI Is Asked to Judge Without a Context Foundation

Failure Mode 3: The Creation Side Doesn't Shift into a Form AI Can Consume

What AI-Native Redesign Actually Is

Life After AI-Native Redesign

What Building cortex Has Actually Felt Like

What Multi-Layered Self-Sustaining Loops Actually Look Like

Where This Structure Could Rot

How I Read the Evolution-Speed Gap

What This Post Was Trying to Say

Observability Design for the AI Era — Reconciling PII Protection With AI Searchability, and Driving Self-Healing

The Observability Stack Is a Natural Path for PII

Multi-Layer PII Design — Six Layers

Hash on Both the Write and Search Sides

Integration Surface — "Humans = Web, AI = MCP" on the Same Backend

Human side: AI Operations Portal

AI side: MCP

The Real Driver of Self-Healing

What's Still Open — Defining "What Counts as an Error" and the Stacktrace Design

Closing — Static Edition + Dynamic Edition Are Lined Up; Merging Them Is the Next Series

Observability Design for the AI Era — Application / Infrastructure / CI / LLM, Each in Its Own Shape

What Does "Observable to AI" Even Mean?

Application — OTel + Loki + Tempo, the Standard Stack

Infrastructure — Cloud Run / BigQuery / Pub/Sub Metrics, All Into Mimir

CI — Ship Logs to Loki via Post-Hoc Pull, Not Webhook Push

LLM — Gemini and Claude Code, Two Different Shapes

Gemini — Prometheus, Cost Visible in Real Time via Client-Side Estimation

Claude Code — Send to BigQuery, Built for SQL Aggregation

To Be Continued

Making the Context Across 46 Repositories Semantically Searchable for AI

The Hint Was in db-graph

Bringing the Same Pattern to code-graph

But API / Event / Page Still Need Meaning — and Annotating Every Function Is Off the Table

Designing the annotation graph

An Annotation Example

Running Annotations Without Interfering With the Day-to-Day Dev Workflow

Protecting Cross-Graph Consistency With an SLO

Joining the Static Graph and the Annotation Graph via SAME_ENTITY Bridges

The Result: Entering the Graph from "the subscription-fee calculation"

Real Usage Numbers

MCP as the Single Front Door

April–May Timeline of Trial and Error

April: Expansion and the First Bridges

May: Stabilizing and Expanding

What This Timeline Says

What Still Isn't Solved

1. Maintaining Annotation Coverage

2. Bridge Mis-Joins Aren't Fully Eliminated Structurally

3. No Dynamic Analysis

4. Onboarding Cost When a New Repo Joins Production

Closing: Not "Thrown Away," but "Evolved"

Got the Top 7 Badge — honestly thrilled 🙌

Top 7 Featured DEV Posts of the Week

"We'll do it later" and "introduce as `warn`" are banned