I was working with a global auto industry leader on their post-sales software platform. The platform had recently launched with seven modules. Each module was built as a microservice.
Communication across services happened primarily through a message streaming broker.
The data flow between services was non-trivial — upstream and downstream dependencies, bidirectional communication patterns, and conditional routing based on context.
The Operational Reality
When an end user raised a complaint, the support team had to perform initial root cause analysis before escalating to engineering.
The system had too many moving parts for quick intuition based debugging. And more importantly, the mental model of "how everything connects” was concentrated in one person on the support team.
This wasn't a tooling problem.
It was a knowledge distribution problem.
The question became:
Can we codify the debugging intuition of the most experienced support engineer — and make it usable by anyone?
That's where the idea of an operations support AI agent emerged.
But we were careful about one thing:
The goal wasn't to make an agent that "knows everything.”
The goal was to make an agent grounded in the actual architecture of the system.
Designing the Agent Backwards from Reality
The complexity wasn't just the number of services.
It was:
- Inter service communication patterns
- Conditional flows
- Bidirectional dependencies
- And multiple layers of state verification (UI, logs, database)
So instead of jumping straight into prompt engineering, we started with context engineering.
We asked:
What does a strong human support engineer actually do when debugging?
The answer was structured, even if it wasn't documented.
And that structure became the foundation of the agent.
Step 1: Reconstruct the System's Big-Picture Flow
The services were distributed across multiple repositories (polyrepo structure). To understand interactions, we first had to bring everything into one workspace.
What we did
- Checked out all service repositories together.
- For each service, we prompted AI to generate upstream and downstream dependency diagrams based on message broker configurations found in the codebase.
- We generated these per service to avoid overloading the model.
- Once individual service documents were created, we asked AI to compile them into a single system-wide data flow diagram using Mermaid (text-based diagram generation).
The result was a consolidated "big-picture" document.
This became foundational context for the agent - not a theoretical architecture diagram, but something derived from the actual codebase configurations.
It allowed the agent to reason about interaction points instead of guessing.
Step 2: Model How Humans Debug via the UI
One of the most interesting observations was this:
The most effective early debugging didn't start with logs.
It started with the application UI.
Support engineers used UI screens to:
- Search for domain entities
- Inspect state
- Check timestamps
- Identify where a transaction stopped progressing
So we needed the agent to replicate that behavior.
What we did
- Prompted AI to extract the list of UI screens from the micro-frontend system.
- Prompted separately for each UI module to maintain output quality.
- Generated a structured document listing:
- UI screens
- Available search filters
- Displayed columns/data points
Then we created an index/router document that allowed the agent to:
- Identify which screens correspond to a domain entity
- Suggest navigation paths
- Recommend filters to apply
This transformed the agent from a generic reasoning engine into something application-aware.
Step 3: Enable Database-Level Reasoning with ER Context
When UI-level validation wasn't enough, the fallback was querying the database.
But meaningful DB debugging requires:
- Understanding entity relationships
- Knowing which fields exist
- Writing contextually valid queries
So we:
- Generated ER diagrams for each backend service
- Built a routing index so the agent could load the appropriate ER diagram based on the issue's domain context
Again, the pattern was the same:
Keep context modular.
Allow conditional loading.
Avoid overwhelming the model.
Step 4: Designing the Agent Interaction Model
Only after building the context layer did we design the agent itself.
We structured it intentionally.
Role: The agent acts as an Expert Operations Support Engineer.
Task:
For every issue:
- Extract domain entity information from the problem description.
- Generate a checklist in a strict order:
- Verify entity state via UI screens
- Verify entity flow in message broker logs (with topic names)
- Verify entity integrity via database queries
This sequence mirrors how experienced support engineers approach triage.
Context References:
The agent explicitly refers to:
- big-picture.md
- ui-screens-index.md
- er-diagrams-index.md
Output Format
Every response includes:
- Problem Understanding
- Overall Impact
- Checklist to Follow
- Interaction Points Identified
- Short Summary
The output was designed to be executable.
What We Achieved
This resulted in a PoC custom agent capable of generating structured, domain-aware debugging checklists.
More importantly:
We converted implicit operational knowledge into explicit, structured artifacts.
The support workflow was no longer dependent on a single individual's system intuition.
Human Evaluation: A Necessary Constraint
We did not treat the agent as authoritative.
Every generated checklist was reviewed by the existing support engineers who already performed these tasks manually.
The next phase was clear:
Test it with individuals who had minimal context of the system.
Iteration was always part of the plan.
An operations agent like this should be criticized continuously.
Its usefulness depends entirely on how rigorously it is refined.
Where This Can Go
Once the checklist is grounded in real architecture, each phase becomes automatable.
Future possibilities we identified:
- Automatically updating UI and ER documents on PR merges
- Garbage collection of outdated context
- Triggering the agent via MCP when a P1 ticket is raised
- Attaching generated checklists directly to support tickets
- Providing read-only search capabilities for domain entities
- Integrating log keyword searches and adapting based on results
- Even raising bug tickets automatically if conditions are met
The autonomy doesn't need to jump to full resolution.
It can increase incrementally - phase by phase — based on trust and accuracy.
Reflection
The hardest part of AI in operations isn't reasoning.
It's grounding.
Once the system's architecture, UI workflows, and data relationships were codified into structured context, the agent's job became deterministic.





Top comments (0)