How autonomous AI agents can generate a complete architecture snapshot of your microservices platform - while you do push-ups - and why that documentation becomes the most powerful input for your AI-driven quality pipeline.
You can listen a podcast generated based on this publication (thanks NotebookLM):
TL;DR
Architectural documentation is not a chore. When colocated with your source code and fed into an AI-powered quality pipeline, it transforms static analysis from "catching typos" into "catching systemic security failures and costly infrastructure leaks." This article documents a real experiment where an autonomous AI agent generated architecture files across a multi-service Google Cloud platform - with the human engineer largely off-screen - and what happened when that documentation gave our AI Quality Gate an entirely new perspective.
1. The "Self-Documenting Code" Problem
There is a persistent assumption in software engineering that well-structured code is self-explanatory. Clean functions, good variable names, and a Pylint score of 10.0/10 - surely that's enough?
It is not.
Code describes how a system executes. Architecture documentation describes why a system exists and how it interacts with everything around it. Without this context layer, every automated analysis tool is operating in the dark. It sees a function, but not its role in the broader service mesh. It sees an API call, but not the security boundary it is expected to enforce.
This distinction matters enormously when you introduce AI-powered tools into your engineering workflow. An LLM analyzing raw code without architectural context is like asking a senior engineer to perform a security review without access to the system design.
2. Generating Architecture While Doing Push-ups
My platform runs on Google Cloud. It consists of dozens of microservices deployed on Cloud Run, interacting via REST APIs, persisting assets to Google Cloud Storage, and routing all AI operations through a centralized Vertex AI gateway. A rich, well-connected system - but one where the only documentation was spread across scattered README files.
I set out to change that. The goal: a standardized, machine-readable architectural snapshot for every service, committed directly to the repository.
The method: guided autonomous agent execution.
The engineer set a direction, established the documentation standard, and then stepped back. The AI agent - powered by Gemini 3 Flash and Claude Sonnet 4.6 running inside Antigravity, an agentic AI coding assistant - took over. It autonomously inspected each service, read the source code, traced inter-service dependencies, cross-referenced existing implementations against the documentation standard, and iteratively generated structured ARCHITECTURE.md files. The engineer's main activity during most of this process was physical exercise.
The output was not informal notes. It was a disciplined, multi-level documentation hierarchy:
π¦ platform-root
β£ π ARCHITECTURE.md β Level 0: Global service mesh, topology, lifecycle status
β π services
β£ π core-ai-gateway
β β π ARCHITECTURE.md β Level 1: Security policy engine, FinOps guardrails
β£ π orchestration-bot
β β π ARCHITECTURE.md β Level 1: Async task flow, Telegram webhook handling
β£ π media-transcriber
β β π ARCHITECTURE.md β Level 1: Speech-to-Text pipeline, GCS asset management
β π translation-engine
β π ARCHITECTURE.md β Level 1: Structured output, multilingual routing
Each document followed a strict template:
- Intent: The concrete business and technical reason this service exists.
- Design Principles: Key trade-offs - statelessness, latency targets, fallback strategies.
- Interaction Diagram: A Mermaid graph of service-to-service flows, security boundaries, and AI provider integrations. It may be generated by the agent and automatically drawn in Gitlab.
- LLM Context Block: A precise summary optimized for consumption by automated agents and AI reviewers.
The entire operation resulted in a navigable, cross-linked architecture map - built with minimal human cognitive effort (and with visualizations!)
3. The Quality Gate Awakening
Once the documentation was committed alongside the source code, I ran a standard CI quality review using our AI-powered Quality Gate - a service built on top of Gemini via Vertex AI, designed to perform automated architectural and security reviews on every merge request.
π‘ What is the Quality Gate, exactly?
It is not a $100,000 enterprise SaaS platform. It is a lightweight, purpose-built microservice - part of the same platform it reviews - deployed on Google Cloud Run. It exposes a single endpoint, receives the merge request diff from the CI pipeline, constructs an LLM prompt enriched with the repository's architectural documentation, calls Vertex AI (Gemini), and returns a structured JSON review report.Because it runs on Cloud Run, it starts only when a review is triggered and shuts down immediately after. The total monthly cost for me is a few dollars - a fraction of a single human code review hour. This is a practical demonstration of the Google Cloud serverless model: pay only for the compute you actually use, and use high-intelligence AI only when it adds value.
The difference was immediately visible.
Previously, without architectural context, the Quality Gate was limited to code-level analysis: style consistency, common security anti-patterns, dependency versions. Useful, but shallow.
With the ARCHITECTURE.md files available as context, the model could see the architecture and the code simultaneously. The result was a qualitative leap: the Quality Gate shifted from a static analysis tool into a reasoning system operating at the level of system design.
It identified two critical issues within minutes - issues that had existed undetected in the codebase for months.
Finding 1: The Distributed Tracing Blackout
One of our routing services included middleware that explicitly stripped incoming trace headers. On the surface, this looked like a reasonable security measure to prevent external clients from injecting trace identifiers into internal systems.
The Quality Gate identified it as a critical observability violation.
Because the architectural documentation described the distributed tracing standard across the mesh - including the requirement for end-to-end X-Trace-ID propagation compatible with Google Cloud Trace - the model understood that stripping these headers at the boundary did not isolate a threat. It severed the trace chain entirely. In any production incident, engineers would be unable to correlate logs across services in Cloud Logging, turning a routine debugging session into a multi-hour forensic investigation with no Cloud Audit Logs correlation to lean on.
Security intention β. Systemic consequence β. The documentation made this contradiction visible.
Finding 2: The Silent Storage Leak
A media processing service was documented as intentionally skipping cleanup of temporary assets in Google Cloud Storage after each processing job. The rationale was implicit - simplicity, no failure modes from deletion errors.
The Quality Gate cross-referenced this against the documented architectural principle of data minimization and least-privilege access, and flagged it as both a security and FinOps violation.
The impact: user audio files - potentially containing sensitive personal information - accumulating indefinitely in cloud storage. No lifecycle policy. No deletion trigger. Silent, compounding cost growth. An expanding attack surface with each new processing request.
Neither a linter nor a code reviewer scanning functions in isolation would have flagged either of these. Both findings emerged from the intersection of code behavior and architectural intent - visible only because the documentation existed.
4. The ROI Case
This experiment produced a measurable return on investment across three dimensions:
| Dimension | Without Documentation | With Documentation + AI Agent |
|---|---|---|
| Architecture Capture | Senior Architect hours | Agent cycle, near-zero human effort |
| Review Quality | Code-level findings | System-level and policy findings |
| Issue Discovery Cost | Post-incident or audit | CI/CD pipeline (minutes, pennies) |
| Quality Gate | Generic, rigid enterprise tool | Custom microservice, tunable per team or developer |
Three additional factors are worth noting specifically in the context of Google Cloud platforms:
Vertex AI Token Efficiency: When the Quality Gate is backed by a Gemini model, providing a structured
ARCHITECTURE.mdreduces the tokens the model spends reconstructing system intent from raw code. Better context means cheaper, faster, and more accurate generation - directly impacting your AI compute costs.Cloud Run Observability: The distributed tracing finding described above is particularly relevant for Cloud Run-based architectures, where services are stateless and ephemeral. Without continuous trace propagation, debugging inter-service failures on Cloud Run becomes significantly harder. The documentation made this risk explicit and catchable.
Serverless Cost Model: Because the Quality Gate is a Cloud Run service invoked only during CI/CD runs, there is zero idle cost. On a typical team with several merge requests per day, the entire AI-powered review pipeline costs a few dollars per month - less than a single engineering hour. This is the Google Cloud serverless model working exactly as intended: high-intelligence compute, on-demand, at minimal cost.
5. Lessons for Platform Engineers
The key insight from this experiment is not that AI agents write documentation faster than humans. That is expected. The key insight is that architecture documentation living inside the repository is a force multiplier for every automated tool that reads it.
This applies whether your automated tools are AI-powered code reviewers, compliance scanners, onboarding assistants, or infrastructure planning agents. The better the documentation, the higher the signal quality of every tool operating on top of it.
Practical recommendations:
-
Colocate documentation with code. A separate wiki that drifts out of sync is noise. An
ARCHITECTURE.mdin the service directory, updated in the same commit as the code, is signal. - Establish a documentation standard. A consistent template (Intent, Principles, Interaction Diagram) makes documentation machine-readable, not just human-readable.
- Define a lifecycle status. Clearly mark deprecated or inactive services. Automated agents should not use legacy code as a reference for current standards.
- Use agents to generate the initial draft. The cognitive overhead of starting from a blank page is real. Agents are excellent at producing a structured first pass that engineers then validate and refine.
- Feed documentation to your CI pipeline. An AI quality reviewer with architectural context is a different class of tool than one without it.
- Build your own Quality Gate - and make it yours. This is the key advantage that enterprise SaaS cannot match: flexibility. A custom Cloud Run service backed by Gemini and driven by your compliance rules, your architectural standards, and your team conventions means every developer can have a personal reviewer that understands the exact context of the project - not a generic ruleset designed for the average of all possible codebases.
6. Conclusion
Architecture documentation has historically been treated as optional overhead - valuable in theory, deprioritized in practice. This experiment demonstrates that when documentation is colocated with source code, follows a consistent machine-readable standard, and is kept current with the help of autonomous agents, it becomes a critical infrastructure component.
It enables automated systems to reason at the level of platform design, not just code syntax. It transforms AI-powered quality gates from expensive linters into genuine architectural advisors. And it can be generated - for an entire platform - while you are doing something else entirely.
The $10,000 ARCHITECTURE.md is not a metaphor. It is the estimated cost differential between finding a critical architectural flaw in a 5-minute CI review versus discovering it during a production incident, a compliance audit, or a cloud storage invoice that nobody expected.
Keep your architecture documented. Keep it in the repository. Let agents maintain it.
Stay standardized. Stay secure.


Top comments (30)
Landed on almost the same thesis from the requirements side rather than architecture. The thing that unlocked it for me was making the docs both hierarchical and machine-readable β one file per entity (goal / feature / requirement / AC) with YAML frontmatter holding IDs, status, and trace links, committed alongside the code. Colocating is only half the value; the other half is that the AI pipeline can ingest it as structured context instead of free-form prose.
I ended up building a VS Code extension around this called SPECLAN (disclosure: I'm the creator). What makes it click is the MCP server β the spec tree is exposed as tools, so the AI quality gate queries architecture and requirements the same way it queries tests. That's the piece that turned "docs in git" into "docs the gate actually uses."
Your Vertex AI gateway sounds like the natural place this lands β did the agents start quoting architecture doc IDs back in quality findings, or do they still frame concerns in natural language? That's been the hardest part of the loop for me.
Hello @thlandgraf ! It looks like on the screenshot. But as it is the Gemini answer it can be formatted as needed. I'm using critical findings section (blockers) and overall recommendations (non-blockers). This is a bit old screenshot because now it also writes additional info about OWASP. Next week I will share more info about the gate as I'm planning to perform GDG Workshop about it :)
Also it is screenshot from the developer's IDE. I decided to give me (the developer) possibility to run it from the IDE before creating MR to save time :D
Thanks @alexandertyutin β the blockers / non-blockers split is exactly the piece I was missing. Severity-bucketed findings map cleanly to CI logic ("block MR if blockers bucket is non-empty") in a way strict doc-ID quoting never really does, and it's the bit a human reviewer scans first anyway. That's a much nicer forcing function than what I was imagining.
The IDE-before-MR placement is interesting too β do you run it in both places (IDE + pre-merge CI), or did you drop the CI gate once the IDE version was fast enough? I keep hitting the same tension between "fast feedback" and "actually enforced."
Would love to catch the GDG Workshop writeup if you publish slides afterwards. Which region / date are you running it in?
@thlandgraf Thanks for your interest and questions! They really make me thinking deeply about the practical side π
This gate was born during indie-development of a niche product which intended to reach adults and children. So I understood that I was facing huge conflict of interest between CEO, CTO and CISO in my head from the very beginning π Also I have a teammate and understanding that there may be a point in future when additional people will added to the process.
So my first intent was to provide another point of me for any stage of the SDLC. When I'm implementing a new feature I have my code and approach at whole be reviewed by a security guy. And when I'm a security guy I need a developer/architect counterparty. This approach turned me into the process of finding trade-offs.
Other hand I was interested to test this approach of process automation and documenting. I've discovered that there may be interesting side effect as the practically approved approach of bringing required competence into the IDE and CI/CD.
Firstly I've added the check into the CI and was happy as CISO. But then I realized as a developer that wait several minutes for the "MR Failed" response it too expensive π So I just cloned the CI step into a bash script and started use the script as a developer (before the final push or just when I feel that I need a fresh eyes). But I haven't removed the CI check because of human aspect (I can forget to check locally) and for possible case of sharing the development process.
And sometime this cross check brings the value π I saw different cases when I forgot to check locally of when the CI check provided additional view. It some kind of a real process modelling when different approvers may provide different details π
Also it turned me to use documented security and architecture exceptions with approved compensatory measures and due dates π So I can state the it has definitely improved my development and deploy discipline and provided the scalable part of change management process at the same time.
We will run it in Russian at 25th of April (this Saturday). Link here. But I plan to prepare repo with code samples and explanatory video. Based on threads here I understand now that I should compile threads and questions into some kind of a supportive demo video and process description. At least for myself π I think I will do it and publish here also π And maybe additional workshops will be ran in English then...
The CEO/CTO/CISO-in-one-head framing really resonates β I've spent the last 7 years in Head-of-Digitization roles where the same conflict plays out, except spread across actual humans rather than one person. The gate ends up doing the same job in both cases: forcing a structured trade-off conversation that would otherwise happen as a vibes-based argument in a meeting. The artifact becomes the place where the conflict resolves instead of where it starts.
Watching for the English follow-up. Happy to be a second pair of eyes on the workshop materials before the English run if it would help.
Thanks a lot @thlandgraf π
auto-generated docs are a snapshot, not a contract. feeding stale docs into a quality pipeline doesn't catch systemic failures - it just fails silently until someone notices the docs are 3 sprints behind
That's it! But the quality gate may also be configured to block the MR in case of significant changes without additional documentation.
For example, my quality gate didn't pass my MR with new docs which shown my problems π I was turned into documenting security/arch exceptions and dates when they should be fixed π
quality gates that block on undocumented security exceptions β that's the forcing function most teams skip. pain upfront > surprise in prod.
Yeah, that's why I'm working in security 15+ years and still do not afraid to be unemployed even in the agentic AI era π
15 years of security intuition is exactly what AI canβt replicate β you know why gates exist, not just that they should. Honestly the agentic shift probably makes your pattern recognition more valuable, not less. Whatβs your biggest concern right now: trust boundaries or supply chain?
trust boundaries, tbh. supply chain at least has tooling - you can scan, audit, block. trust fails silently. an agent confidently doing the wrong thing because someone shaped its context - that one keeps me up at night.
That collapse into provenance is the real problem. Supply chain trust assumes humans at the root β someone signed off on each layer. Once AI is generating the audit tooling too, the root of trust gets fuzzy. Most orgs are a few incidents away from having to think seriously about that recursion.
The finding about the trace header stripping hit me. Not because it's a dramatic bug, but because it's the exact kind of decision that looks correct in isolation and becomes obviously wrong only when you zoom out. The engineer who wrote that middleware probably felt responsible. "I'm protecting our internal traces from external tampering." Good instinct. Wrong layer.
What's interesting is that this class of error is almost impossible to catch with traditional tooling. A linter sees a function that modifies headers. Fine. A security scanner might even flag it as a good practiceβsanitizing inputs at the boundary. You need the intent of the system to recognize that this particular header isn't a threat vector, it's a load-bearing piece of observability infrastructure.
The documentation didn't just help the AI find the bug. It gave the AI permission to reason about what the system was supposed to do. Without that, it's just pattern-matching against a corpus of code. With it, it's evaluating whether the implementation honors the design.
Makes me wonder about the inverse failure mode. If the documentation is wrongβif it describes an intent that never made it into the code, or that rotted over timeβdoes the Quality Gate become an engine for confidently flagging "violations" of a fictional standard? An AI that trusts stale docs might be worse than no AI at all. How are you handling the drift problem? Is the agent also responsible for detecting when the implementation has moved on and the ARCHITECTURE.md needs a refresh?
The drift problem is the one that keeps me up at night too. My current mental model splits it in two: mechanical drift (the doc references a symbol, endpoint or table that no longer exists) and semantic drift (the artifact is still there, but its behavior moved on). Mechanical drift you can catch with plain structural checks β each doc entity carries a pointer to a code symbol, and CI fails when the target is missing. Semantic drift is the hard one. An agent can flag "this function's behavior diverges from the description," but it's often just the agent re-reading the code and convincing itself of whichever story is more polished. I haven't found a purely automated answer. Best I've landed on is scheduled re-reviews of docs older than N weeks, with the agent surfacing "sections most likely to have drifted" to shorten the reviewer's path β which is kind of an admission that the problem isn't solved. Your "engine for confidently flagging violations of a fictional standard" line nails the failure mode I worry about most.
Exactly! That engineer was me π And I had precise the same thoughts you've described π Especially I understood that when in several days I've got a tricky bug and realized that I can't trace it from the client (mine too but in another GCP project). And while from one side I still thinking about the trace header security now I understand that some kind of transparent traceability should be not only inside the core service mesh but between the platform and its client. Will dig into it a bit later. Because it was a trade-off between the MVP speed and quality level (just as mentioned in other comments). But the exception is documented for the quality gate and due date is also defined π
Yes, good point π― Here are lot things to think about and perform experiments.
Another good point for experiments π―
Thank you for such a deep dive and such a meaningful comment! π
Treating architecture documentation as a first-class engineering asset is long overdue. When documentation lives alongside code and follows the same workflows, it naturally stays relevant and actionable.
Appreciate the emphasis on keeping it lightweight, continuously updated, and developer-friendly β thatβs what makes it actually usable rather than just existing for compliance.
Well articulated β this is the kind of discipline that truly scales engineering teams.
Yeah! Exactly π
The finding about distributed tracing headers being stripped is a perfect example of something no linter will ever catch. I've seen the same class of problem with security groups and VPC endpoint policies β the code-level decision looks reasonable in isolation, but violates a system-level invariant that only exists in someone's head (or, if you're lucky, in an architecture doc).
The practical insight that resonates most: colocating documentation with code in the same commit. The moment architecture docs live in a wiki, they're fiction within two sprints. An ARCHITECTURE.md next to the Dockerfile, updated in the same PR that changes the service β that's the only pattern I've seen survive past month three. The agent-generated first draft approach is smart too. The blank page problem is real, and a structured template (Intent, Principles, Interaction Diagram) gives the agent enough constraints to produce something worth editing rather than something worth deleting.
I'm using approach to create a task doc in the new branch initially. Before the MR I append (with agents of course) a
what was donesection. It was helpful during quality gate check because it looks not only through code but also through a supportive doc . But now I realized that not onlywhat was doneshould be added but alsowhy was doneπOr just scan with vouch-secure and be sure π€·ββοΈ
Subject for research :)
Treating architecture docs as a first-class asset is one of those things every team agrees with in principle and almost no one does in practice. The trick that worked for my last team: making ADRs a required part of every PR that touches a system boundary (new service, new external dep, schema change). Not 'should write one' β the PR template literally has an
adr-linkfield that fails CI if empty for those changes. Suddenly the docs stay current because they're a precondition for shipping, not an afterthought. Curious whether you've found a forcing function that works without becoming bureaucratic.The ADR-as-PR-precondition is honestly the cleanest version of that pattern I've seen β works because the cost of writing one is small if the change deserves it and large if it doesn't, which is exactly the right signal.
The variant I've landed on moves the gate one level up: requirements can't transition from review to approved without sign-off, and a check fails if code references something still in review. Same shape as your ADR-link field, just on the spec entity rather than the PR. Bureaucratic-ness depends almost entirely on how granular you make the unit β too fine and every comma needs documentation, too coarse and you're back to free-form prose. Haven't found a clean answer to that beyond tuning per team.
Agree with the core move here β ARCHITECTURE.md as context turns AI review from linting into reasoning. The Vertex finding (the two architectural violations) maps cleanly to what we've seen.
The honest limitation of architecture-aware review: the AI still reasons from the map the team drew. If the team didn't think to worry about a specific user journey, the architecture doc doesn't mention it, and the review doesn't catch it either. An internal pentest catches what the company already knows to worry about. The value of an outside bug bounty is the adversarial ignorance the team doesn't have.
That's roughly where we've been spending time β behavioral testing that starts from observable user intent rather than from our own ARCHITECTURE.md. Not instead of your approach; the pair is stronger than either alone.
Interested if you've seen Vertex bring in observable-behavior context yet or if it's still pure static-structure input.
Loved the technical debth of the article. Hello from Almaty!
Thanks Askar! π