Using AI Agents to Debug Distributed Systems Faster
At my company, we have a feature that allows customers to export large volumes of data to cloud providers.
Under the hood, this export process is split into multiple tasks, where each task is responsible for exporting a subset of objects. These tasks are executed by pods in a multi-tenant Kubernetes environment.
From time to time, we receive alerts indicating that some tasks are taking too long to start and remain in the queue for an extended period.
When that happens, an investigation begins.
The challenge is that this analysis is usually slow, manual, and repetitive.
A typical investigation involves:
- Checking the status of each task and validating key attributes
- Reviewing tenant configurations to identify values that may cause issues
- Inspecting overall cluster health
- Analyzing how many tasks each tenant has created
- Cross-checking configuration in Bitbucket
- Making multiple API calls across services
This process can easily take several minutes, and sometimes much longer, especially during active incidents.
The Idea: Automate the Investigation with an AI Agent
We decided to speed things up by building an AI-powered agent.
The goal was simple:
Automatically gather the relevant data, analyze it, and provide a probable root cause in seconds.
Architecture
To achieve this, we built two main components.
1. MCP Server
We created an MCP server that exposes a set of tools wrapping our internal APIs.
These tools allow the agent to:
- Query task status
- Fetch tenant configurations
- Inspect system limits such as max replicas
- Retrieve cluster-level information
2. AI Agent
On top of that, we built an AI agent that:
- Uses the MCP tools
- Analyzes the collected data
- Produces a structured diagnostic report
Input: From Alert Logs to Insight
At the time, we did not have a direct integration with our alerting system.
Because of that, we designed the agent to interpret the log lines included in the alert and generate a report from them.
Example Input
Generate report for test environment
Long waiting in queue tasks amount:13 ; Affected tenants: tenant3, tenant2, tenant5, tenant4, tenant1,tenant6
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: 528f72d7-4f5d-4559-a825-b05014114dc7 - taskId: 528f72d7-4f5d-4559-a825-b05014114dc7 - 1h 33m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: e7b2449f-0ded-4627-a6b6-eb305e074503 - taskId: e7b2449f-0ded-4627-a6b6-eb305e074503 - 1h 33m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant2 - groupId: b481240d-d166-45cd-92dd-9f45d99c17f0 - taskId: b481240d-d166-45cd-92dd-9f45d99c17f0 - 1h 33m 6s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant3 - groupId: e1514831-b1b3-444f-a209-118c82718fbe - taskId: e1514831-b1b3-444f-a209-118c82718fbe - 1h 32m 10s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: bad9fb64-ec1a-46e8-b3cd-91ded32cd551 - taskId: bad9fb64-ec1a-46e8-b3cd-91ded32cd551 - 1h 32m 0s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant4 - groupId: c875921a-03d7-498b-aba4-848703e398d7 - taskId: c875921a-03d7-498b-aba4-848703e398d7 - 1h 24m 40s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant4 - groupId: b2eb4f12-439f-4027-9b07-be8d38c18d91 - taskId: b2eb4f12-439f-4027-9b07-be8d38c18d91 - 1h 24m 26s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 29b656a1-e77b-41b9-bcdc-4d0ee9edc282 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 2fe2d3a1-0ac1-42d2-996e-0edc5a075fe7 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant6 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 45e198f9-e629-47c0-ac5f-cf8eabbdcaf6 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 9ba638b7-ed61-4977-991b-2aa691a38d73 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: cfc26a65-30eb-4264-88e0-307b878e4d3c - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: d3748c08-baf5-43e4-98e7-7dd75d1ce43d - 1h 21m 43s - com.export.tasks.local.ExportTask
Example Response
Below is an example of the report generated by the agent.
Tenant Summary
-
Tenant ID:
tenant11 -
Tenant Name:
Unknown -
Total Tasks:
58
Tasks Status
-
Group ID:
e4361de7-280e-4906-9cc5-7a56c056a959 -
Status:
SCHEDULED
Tasks in Queue
-
Task ID:
7b67dd23-fdca-4acb-ab28-88bf8597c7a0-
Group ID:
951b184b-e639-43cc-b837-0768b1f447ab -
parallelExecution:
false
-
Group ID:
Tenant Config
-
maxPeriodicTasksPerTenant:
35 -
maxTaskPartsCount:
13
Credit Balance
-
priorityCredits:
23408 -
standardSyncCredits:
2796 -
standardAsyncCredits:
-18368 -
internalCredits:
-89636
Problem Areas
Tasks with parallelExecution=false
-
Tenant:
tenant1-
Group ID:
ec39f0e0-2093-43c0-93ed-73cc2675337f -
Task ID:
5b9280cd-9a10-4786-9ae0-000a26e6b0ce
-
Group ID:
-
Tenant:
tenant3-
Group ID:
46588442-a969-4568-9a81-a7356727180f -
Task ID:
56193050-578e-4803-bca1-5c059e64fe3e
-
Group ID:
-
Tenant:
tenant2-
Group ID:
951b184b-e639-43cc-b837-0768b1f447ab -
Task ID:
7b67dd23-fdca-4acb-ab28-88bf8597c7a0
-
Group ID:
Tenants with maxTaskPartsCount or maxPeriodicTasksPerTenant equal to or higher than 30% of the max replicas
Max Replicas:
84-
Tenant:
tenant6-
maxTaskPartsCount:
30 -
Usage:
35.7% of max replicas
-
maxTaskPartsCount:
-
Tenant:
tenant5-
maxPeriodicTasksPerTenant:
52 -
Usage:
61.9% of max replicas
-
maxPeriodicTasksPerTenant:
Another Example of Agent Output
In another test, the agent produced the following findings:
Problem Areas
Tasks with parallelExecution=false
-
Tenant ID:
tenant6, Group ID:ab92df1b-9c82-4940-8e25-dfa77f275ebb -
Tenant ID:
tenant3, Group ID:e8143f64-11b1-48a7-960f-b005e5805871 -
Tenant ID:
tenant5, Group ID:d5812c89-3961-4f4d-bf1b-b08c36833a06 -
Tenant ID:
tenant4, Group ID:c875921a-03d7-498b-aba4-848703e398d7 -
Tenant ID:
tenant1, Group ID:8ed4824d-3333-4f9b-b442-123446b2006d -
Tenant ID:
tenant2, Group ID:63801bb0-b5f7-44a9-bd94-84ccb9cc845e
Cluster Saturation
-
Total active tasks:
86 -
Max replicas:
38
This means the total number of active tasks was already higher than the available capacity for that environment.
Tasks with Lower Throughput
-
Tenant ID:
tenant1 -
Group ID:
528f72d7-4f5d-4559-a825-b05014114dc7
Task Statuses Seen in the Analysis
CANCELEDCOMPLETEDSCHEDULEDPROCESSINGPAUSEDSCHEDULED_POLL
Tenants with High Limits Relative to Cluster Capacity
-
Tenant:
tenant1-
maxPeriodicTasksPerTenant:
86 - This is equal to or higher than 30% of the max replicas for that environment.
-
maxPeriodicTasksPerTenant:
What the Agent Detects
From reports like the ones above, the agent is able to highlight:
- Tasks that cannot run in parallel because
parallelExecution=false - Cluster saturation issues, where active tasks exceed available replicas
- Misconfigured tenants exceeding safe limits
- Credit imbalances that may impact execution
- Low-throughput or blocked task groups
Impact
This approach reduced investigation time from:
Manual analysis that could take several minutes or longer
to
automated analysis that usually takes less than one minute
It can take a bit longer when the agent needs to inspect more than 20 different tasks, but it is still significantly faster than the manual process.
Tech Stack
I created a Java project using Spring AI and the following modules:
- Mockoon for generating mock API data
- java-project for the MCP tools
- agent-ui for the user interface
- agent for the orchestration and reasoning layer
This agent was originally built in Python, but I thought it would be interesting to create a small project using Spring AI and get more knowledge about the tool.
Key Takeaways
A lot of operational analysis follows a repeatable pattern.
That makes it a good fit for an AI agent that can:
- Gather data from multiple sources
- Correlate information across APIs and configuration
- Suggest a likely root cause
- Produce a report that engineers can use immediately
The real value comes from combining tooling and reasoning.
What's Next
The next step is to rebuild the project using an agent + agent.md design pattern.
The goal is to make the solution:
- More modular
- Easier to maintain
- Easier to evolve as new investigation paths are added
Final Thoughts
This project started from a very practical problem: incident investigations were taking too much time.
By turning the investigation flow into a set of tools plus an AI reasoning layer, we were able to automate a large part of the process and dramatically reduce the time to get useful answers.
Instead of manually checking dashboards, configs, and APIs, we now have an agent that can:
- Read alert context
- Investigate the system
- Summarize the findings
- Suggest likely root causes
All in under a minute.
⚠️ For Disclosure
At my company, the solution was originally designed in Python, as it was the primary language used for working with MCP (Model Context Protocol) and AI agents at the time.
However, I decided to build this demo in Java to gain hands-on experience with Spring AI while applying it to a real-world problem. You can find the demo code here.
All values and data shown in this article were generated using Mockoon and do not represent real production data. They are simplified examples meant to reflect the kind of information handled by the actual application.
Feedback
I would love to hear your thoughts.
- Would you trust an AI agent to help debug production issues?
- How would you design this differently?
Top comments (0)