Prathamesh Deshmukh

Posted on Mar 5

How We Built An Operations Support AI Agent for a Global Auto Industry Leader's Post Sales Software Department

#ai #agents #llm

I was working with a global auto industry leader on their post-sales software platform. The platform had recently launched with seven modules. Each module was built as a microservice.
Communication across services happened primarily through a message streaming broker.

The data flow between services was non-trivial — upstream and downstream dependencies, bidirectional communication patterns, and conditional routing based on context.

The Operational Reality

When an end user raised a complaint, the support team had to perform initial root cause analysis before escalating to engineering.

The system had too many moving parts for quick intuition based debugging. And more importantly, the mental model of "how everything connects” was concentrated in one person on the support team.

This wasn't a tooling problem.
It was a knowledge distribution problem.

The question became:

Can we codify the debugging intuition of the most experienced support engineer — and make it usable by anyone?

That's where the idea of an operations support AI agent emerged.

But we were careful about one thing:
The goal wasn't to make an agent that "knows everything.”
The goal was to make an agent grounded in the actual architecture of the system.

Designing the Agent Backwards from Reality

The complexity wasn't just the number of services.

It was:

Inter service communication patterns
Conditional flows
Bidirectional dependencies
And multiple layers of state verification (UI, logs, database)

So instead of jumping straight into prompt engineering, we started with context engineering.

We asked:
What does a strong human support engineer actually do when debugging?

The answer was structured, even if it wasn't documented.
And that structure became the foundation of the agent.

Step 1: Reconstruct the System's Big-Picture Flow

The services were distributed across multiple repositories (polyrepo structure). To understand interactions, we first had to bring everything into one workspace.

What we did

Checked out all service repositories together.
For each service, we prompted AI to generate upstream and downstream dependency diagrams based on message broker configurations found in the codebase.
We generated these per service to avoid overloading the model.
Once individual service documents were created, we asked AI to compile them into a single system-wide data flow diagram using Mermaid (text-based diagram generation).

The result was a consolidated "big-picture" document.

This became foundational context for the agent - not a theoretical architecture diagram, but something derived from the actual codebase configurations.

It allowed the agent to reason about interaction points instead of guessing.

Step 2: Model How Humans Debug via the UI

One of the most interesting observations was this:

The most effective early debugging didn't start with logs.
It started with the application UI.

Support engineers used UI screens to:

Search for domain entities
Inspect state
Check timestamps
Identify where a transaction stopped progressing

So we needed the agent to replicate that behavior.

What we did

Prompted AI to extract the list of UI screens from the micro-frontend system.
Prompted separately for each UI module to maintain output quality.
Generated a structured document listing:
UI screens
Available search filters
Displayed columns/data points

Then we created an index/router document that allowed the agent to:

Identify which screens correspond to a domain entity
Suggest navigation paths
Recommend filters to apply

This transformed the agent from a generic reasoning engine into something application-aware.

Step 3: Enable Database-Level Reasoning with ER Context

When UI-level validation wasn't enough, the fallback was querying the database.

But meaningful DB debugging requires:

Understanding entity relationships
Knowing which fields exist
Writing contextually valid queries

So we:

Generated ER diagrams for each backend service
Built a routing index so the agent could load the appropriate ER diagram based on the issue's domain context

Again, the pattern was the same:
Keep context modular.
Allow conditional loading.
Avoid overwhelming the model.

Step 4: Designing the Agent Interaction Model

Only after building the context layer did we design the agent itself.

We structured it intentionally.

Role: The agent acts as an Expert Operations Support Engineer.
Task:
For every issue:

Extract domain entity information from the problem description.
Generate a checklist in a strict order:
Verify entity state via UI screens
Verify entity flow in message broker logs (with topic names)
Verify entity integrity via database queries

This sequence mirrors how experienced support engineers approach triage.

Context References:
The agent explicitly refers to:

big-picture.md
ui-screens-index.md
er-diagrams-index.md

Output Format
Every response includes:

Problem Understanding
Overall Impact
Checklist to Follow
Interaction Points Identified
Short Summary

The output was designed to be executable.

What We Achieved

This resulted in a PoC custom agent capable of generating structured, domain-aware debugging checklists.

More importantly:

We converted implicit operational knowledge into explicit, structured artifacts.

The support workflow was no longer dependent on a single individual's system intuition.

Human Evaluation: A Necessary Constraint

We did not treat the agent as authoritative.

Every generated checklist was reviewed by the existing support engineers who already performed these tasks manually.

The next phase was clear:
Test it with individuals who had minimal context of the system.

Iteration was always part of the plan.

An operations agent like this should be criticized continuously.
Its usefulness depends entirely on how rigorously it is refined.

Where This Can Go

Once the checklist is grounded in real architecture, each phase becomes automatable.

Future possibilities we identified:

Automatically updating UI and ER documents on PR merges
Garbage collection of outdated context
Triggering the agent via MCP when a P1 ticket is raised
Attaching generated checklists directly to support tickets
Providing read-only search capabilities for domain entities
Integrating log keyword searches and adapting based on results
Even raising bug tickets automatically if conditions are met

The autonomy doesn't need to jump to full resolution.

It can increase incrementally - phase by phase — based on trust and accuracy.

Reflection

The hardest part of AI in operations isn't reasoning.
It's grounding.

Once the system's architecture, UI workflows, and data relationships were codified into structured context, the agent's job became deterministic.

DEV Community