tomas maiorino

Posted on Apr 1

Using AI Agents to Debug Distributed Systems in Under a Minute

#ai #java #springai #troubleshooting

Using AI Agents to Debug Distributed Systems Faster

At my company, we have a feature that allows customers to export large volumes of data to cloud providers.

Under the hood, this export process is split into multiple tasks, where each task is responsible for exporting a subset of objects. These tasks are executed by pods in a multi-tenant Kubernetes environment.

From time to time, we receive alerts indicating that some tasks are taking too long to start and remain in the queue for an extended period.

When that happens, an investigation begins.

The challenge is that this analysis is usually slow, manual, and repetitive.

A typical investigation involves:

Checking the status of each task and validating key attributes
Reviewing tenant configurations to identify values that may cause issues
Inspecting overall cluster health
Analyzing how many tasks each tenant has created
Cross-checking configuration in Bitbucket
Making multiple API calls across services

This process can easily take several minutes, and sometimes much longer, especially during active incidents.

The Idea: Automate the Investigation with an AI Agent

We decided to speed things up by building an AI-powered agent.

The goal was simple:

Automatically gather the relevant data, analyze it, and provide a probable root cause in seconds.

Architecture

To achieve this, we built two main components.

1. MCP Server

We created an MCP server that exposes a set of tools wrapping our internal APIs.

These tools allow the agent to:

Query task status
Fetch tenant configurations
Inspect system limits such as max replicas
Retrieve cluster-level information

2. AI Agent

On top of that, we built an AI agent that:

Uses the MCP tools
Analyzes the collected data
Produces a structured diagnostic report

Input: From Alert Logs to Insight

At the time, we did not have a direct integration with our alerting system.

Because of that, we designed the agent to interpret the log lines included in the alert and generate a report from them.

Example Input

Generate report for test environment
Long waiting in queue tasks amount:13 ; Affected tenants: tenant3, tenant2, tenant5, tenant4, tenant1,tenant6

Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: 528f72d7-4f5d-4559-a825-b05014114dc7 - taskId: 528f72d7-4f5d-4559-a825-b05014114dc7 - 1h 33m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: e7b2449f-0ded-4627-a6b6-eb305e074503 - taskId: e7b2449f-0ded-4627-a6b6-eb305e074503 - 1h 33m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant2 - groupId: b481240d-d166-45cd-92dd-9f45d99c17f0 - taskId: b481240d-d166-45cd-92dd-9f45d99c17f0 - 1h 33m 6s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant3 - groupId: e1514831-b1b3-444f-a209-118c82718fbe - taskId: e1514831-b1b3-444f-a209-118c82718fbe - 1h 32m 10s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: bad9fb64-ec1a-46e8-b3cd-91ded32cd551 - taskId: bad9fb64-ec1a-46e8-b3cd-91ded32cd551 - 1h 32m 0s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant4 - groupId: c875921a-03d7-498b-aba4-848703e398d7 - taskId: c875921a-03d7-498b-aba4-848703e398d7 - 1h 24m 40s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant4 - groupId: b2eb4f12-439f-4027-9b07-be8d38c18d91 - taskId: b2eb4f12-439f-4027-9b07-be8d38c18d91 - 1h 24m 26s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 29b656a1-e77b-41b9-bcdc-4d0ee9edc282 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 2fe2d3a1-0ac1-42d2-996e-0edc5a075fe7 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant6 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 45e198f9-e629-47c0-ac5f-cf8eabbdcaf6 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 9ba638b7-ed61-4977-991b-2aa691a38d73 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: cfc26a65-30eb-4264-88e0-307b878e4d3c - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: d3748c08-baf5-43e4-98e7-7dd75d1ce43d - 1h 21m 43s - com.export.tasks.local.ExportTask

Example Response

Below is an example of the report generated by the agent.

Tenant Summary

Tenant ID: tenant11
Tenant Name: Unknown
Total Tasks: 58

Tasks Status

Group ID: e4361de7-280e-4906-9cc5-7a56c056a959
Status: SCHEDULED

Tasks in Queue

Task ID: 7b67dd23-fdca-4acb-ab28-88bf8597c7a0
- Group ID: 951b184b-e639-43cc-b837-0768b1f447ab
- parallelExecution: false

Tenant Config

maxPeriodicTasksPerTenant: 35
maxTaskPartsCount: 13

Credit Balance

priorityCredits: 23408
standardSyncCredits: 2796
standardAsyncCredits: -18368
internalCredits: -89636

Problem Areas

Tasks with `parallelExecution=false`

Tenant: tenant1
- Group ID: ec39f0e0-2093-43c0-93ed-73cc2675337f
- Task ID: 5b9280cd-9a10-4786-9ae0-000a26e6b0ce
Tenant: tenant3
- Group ID: 46588442-a969-4568-9a81-a7356727180f
- Task ID: 56193050-578e-4803-bca1-5c059e64fe3e
Tenant: tenant2
- Group ID: 951b184b-e639-43cc-b837-0768b1f447ab
- Task ID: 7b67dd23-fdca-4acb-ab28-88bf8597c7a0

Tenants with `maxTaskPartsCount` or `maxPeriodicTasksPerTenant` equal to or higher than 30% of the max replicas

Max Replicas: 84
Tenant: tenant6
- maxTaskPartsCount: 30
- Usage: 35.7% of max replicas
Tenant: tenant5
- maxPeriodicTasksPerTenant: 52
- Usage: 61.9% of max replicas

Another Example of Agent Output

In another test, the agent produced the following findings:

Problem Areas

Tasks with `parallelExecution=false`

Tenant ID: tenant6, Group ID: ab92df1b-9c82-4940-8e25-dfa77f275ebb
Tenant ID: tenant3, Group ID: e8143f64-11b1-48a7-960f-b005e5805871
Tenant ID: tenant5, Group ID: d5812c89-3961-4f4d-bf1b-b08c36833a06
Tenant ID: tenant4, Group ID: c875921a-03d7-498b-aba4-848703e398d7
Tenant ID: tenant1, Group ID: 8ed4824d-3333-4f9b-b442-123446b2006d
Tenant ID: tenant2, Group ID: 63801bb0-b5f7-44a9-bd94-84ccb9cc845e

Cluster Saturation

Total active tasks: 86
Max replicas: 38

This means the total number of active tasks was already higher than the available capacity for that environment.

Tasks with Lower Throughput

Tenant ID: tenant1
Group ID: 528f72d7-4f5d-4559-a825-b05014114dc7

Task Statuses Seen in the Analysis

CANCELED
COMPLETED
SCHEDULED
PROCESSING
PAUSED
SCHEDULED_POLL

Tenants with High Limits Relative to Cluster Capacity

Tenant: tenant1
- maxPeriodicTasksPerTenant: 86
- This is equal to or higher than 30% of the max replicas for that environment.

What the Agent Detects

From reports like the ones above, the agent is able to highlight:

Tasks that cannot run in parallel because parallelExecution=false
Cluster saturation issues, where active tasks exceed available replicas
Misconfigured tenants exceeding safe limits
Credit imbalances that may impact execution
Low-throughput or blocked task groups

Impact

This approach reduced investigation time from:

Manual analysis that could take several minutes or longer

to

automated analysis that usually takes less than one minute

It can take a bit longer when the agent needs to inspect more than 20 different tasks, but it is still significantly faster than the manual process.

Tech Stack

I created a Java project using Spring AI and the following modules:

Mockoon for generating mock API data
java-project for the MCP tools
agent-ui for the user interface
agent for the orchestration and reasoning layer

This agent was originally built in Python, but I thought it would be interesting to create a small project using Spring AI and get more knowledge about the tool.

Key Takeaways

A lot of operational analysis follows a repeatable pattern.

That makes it a good fit for an AI agent that can:

Gather data from multiple sources
Correlate information across APIs and configuration
Suggest a likely root cause
Produce a report that engineers can use immediately

The real value comes from combining tooling and reasoning.

What's Next

The next step is to rebuild the project using an agent + agent.md design pattern.

The goal is to make the solution:

More modular
Easier to maintain
Easier to evolve as new investigation paths are added

Final Thoughts

This project started from a very practical problem: incident investigations were taking too much time.

By turning the investigation flow into a set of tools plus an AI reasoning layer, we were able to automate a large part of the process and dramatically reduce the time to get useful answers.

Instead of manually checking dashboards, configs, and APIs, we now have an agent that can:

Read alert context
Investigate the system
Summarize the findings
Suggest likely root causes

All in under a minute.

⚠️ For Disclosure

At my company, the solution was originally designed in Python, as it was the primary language used for working with MCP (Model Context Protocol) and AI agents at the time.

However, I decided to build this demo in Java to gain hands-on experience with Spring AI while applying it to a real-world problem. You can find the demo code here.

All values and data shown in this article were generated using Mockoon and do not represent real production data. They are simplified examples meant to reflect the kind of information handled by the actual application.

Feedback

I would love to hear your thoughts.

Would you trust an AI agent to help debug production issues?
How would you design this differently?

DEV Community

Using AI Agents to Debug Distributed Systems in Under a Minute

Using AI Agents to Debug Distributed Systems Faster

The Idea: Automate the Investigation with an AI Agent