DEV Community

tomas maiorino
tomas maiorino

Posted on

Using AI Agents to Debug Distributed Systems in Under a Minute

Using AI Agents to Debug Distributed Systems Faster

At my company, we have a feature that allows customers to export large volumes of data to cloud providers.

Under the hood, this export process is split into multiple tasks, where each task is responsible for exporting a subset of objects. These tasks are executed by pods in a multi-tenant Kubernetes environment.

From time to time, we receive alerts indicating that some tasks are taking too long to start and remain in the queue for an extended period.

When that happens, an investigation begins.

The challenge is that this analysis is usually slow, manual, and repetitive.

A typical investigation involves:

  • Checking the status of each task and validating key attributes
  • Reviewing tenant configurations to identify values that may cause issues
  • Inspecting overall cluster health
  • Analyzing how many tasks each tenant has created
  • Cross-checking configuration in Bitbucket
  • Making multiple API calls across services

This process can easily take several minutes, and sometimes much longer, especially during active incidents.

The Idea: Automate the Investigation with an AI Agent

We decided to speed things up by building an AI-powered agent.

The goal was simple:

Automatically gather the relevant data, analyze it, and provide a probable root cause in seconds.

Architecture

To achieve this, we built two main components.

1. MCP Server

We created an MCP server that exposes a set of tools wrapping our internal APIs.

These tools allow the agent to:

  • Query task status
  • Fetch tenant configurations
  • Inspect system limits such as max replicas
  • Retrieve cluster-level information

2. AI Agent

On top of that, we built an AI agent that:

  • Uses the MCP tools
  • Analyzes the collected data
  • Produces a structured diagnostic report

Input: From Alert Logs to Insight

At the time, we did not have a direct integration with our alerting system.

Because of that, we designed the agent to interpret the log lines included in the alert and generate a report from them.

Example Input

Generate report for test environment
Long waiting in queue tasks amount:13 ; Affected tenants: tenant3, tenant2, tenant5, tenant4, tenant1,tenant6

Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: 528f72d7-4f5d-4559-a825-b05014114dc7 - taskId: 528f72d7-4f5d-4559-a825-b05014114dc7 - 1h 33m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: e7b2449f-0ded-4627-a6b6-eb305e074503 - taskId: e7b2449f-0ded-4627-a6b6-eb305e074503 - 1h 33m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant2 - groupId: b481240d-d166-45cd-92dd-9f45d99c17f0 - taskId: b481240d-d166-45cd-92dd-9f45d99c17f0 - 1h 33m 6s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant3 - groupId: e1514831-b1b3-444f-a209-118c82718fbe - taskId: e1514831-b1b3-444f-a209-118c82718fbe - 1h 32m 10s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant1 - groupId: bad9fb64-ec1a-46e8-b3cd-91ded32cd551 - taskId: bad9fb64-ec1a-46e8-b3cd-91ded32cd551 - 1h 32m 0s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant4 - groupId: c875921a-03d7-498b-aba4-848703e398d7 - taskId: c875921a-03d7-498b-aba4-848703e398d7 - 1h 24m 40s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant4 - groupId: b2eb4f12-439f-4027-9b07-be8d38c18d91 - taskId: b2eb4f12-439f-4027-9b07-be8d38c18d91 - 1h 24m 26s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 29b656a1-e77b-41b9-bcdc-4d0ee9edc282 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 2fe2d3a1-0ac1-42d2-996e-0edc5a075fe7 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant6 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 45e198f9-e629-47c0-ac5f-cf8eabbdcaf6 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: 9ba638b7-ed61-4977-991b-2aa691a38d73 - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: cfc26a65-30eb-4264-88e0-307b878e4d3c - 1h 21m 43s - com.export.tasks.local.ExportTask
Mar 17 21:34:06 export-pod-1 export WARN test | Long waiting in queue task: tenant5 - groupId: e4361de7-280e-4906-9cc5-7a56c056a959 - taskId: d3748c08-baf5-43e4-98e7-7dd75d1ce43d - 1h 21m 43s - com.export.tasks.local.ExportTask
Enter fullscreen mode Exit fullscreen mode

Example Response

Below is an example of the report generated by the agent.

Tenant Summary

  • Tenant ID: tenant11
  • Tenant Name: Unknown
  • Total Tasks: 58

Tasks Status

  • Group ID: e4361de7-280e-4906-9cc5-7a56c056a959
  • Status: SCHEDULED

Tasks in Queue

  • Task ID: 7b67dd23-fdca-4acb-ab28-88bf8597c7a0
    • Group ID: 951b184b-e639-43cc-b837-0768b1f447ab
    • parallelExecution: false

Tenant Config

  • maxPeriodicTasksPerTenant: 35
  • maxTaskPartsCount: 13

Credit Balance

  • priorityCredits: 23408
  • standardSyncCredits: 2796
  • standardAsyncCredits: -18368
  • internalCredits: -89636

Problem Areas

Tasks with parallelExecution=false

  • Tenant: tenant1

    • Group ID: ec39f0e0-2093-43c0-93ed-73cc2675337f
    • Task ID: 5b9280cd-9a10-4786-9ae0-000a26e6b0ce
  • Tenant: tenant3

    • Group ID: 46588442-a969-4568-9a81-a7356727180f
    • Task ID: 56193050-578e-4803-bca1-5c059e64fe3e
  • Tenant: tenant2

    • Group ID: 951b184b-e639-43cc-b837-0768b1f447ab
    • Task ID: 7b67dd23-fdca-4acb-ab28-88bf8597c7a0

Tenants with maxTaskPartsCount or maxPeriodicTasksPerTenant equal to or higher than 30% of the max replicas

  • Max Replicas: 84

  • Tenant: tenant6

    • maxTaskPartsCount: 30
    • Usage: 35.7% of max replicas
  • Tenant: tenant5

    • maxPeriodicTasksPerTenant: 52
    • Usage: 61.9% of max replicas

Another Example of Agent Output

In another test, the agent produced the following findings:

Problem Areas

Tasks with parallelExecution=false

  • Tenant ID: tenant6, Group ID: ab92df1b-9c82-4940-8e25-dfa77f275ebb
  • Tenant ID: tenant3, Group ID: e8143f64-11b1-48a7-960f-b005e5805871
  • Tenant ID: tenant5, Group ID: d5812c89-3961-4f4d-bf1b-b08c36833a06
  • Tenant ID: tenant4, Group ID: c875921a-03d7-498b-aba4-848703e398d7
  • Tenant ID: tenant1, Group ID: 8ed4824d-3333-4f9b-b442-123446b2006d
  • Tenant ID: tenant2, Group ID: 63801bb0-b5f7-44a9-bd94-84ccb9cc845e

Cluster Saturation

  • Total active tasks: 86
  • Max replicas: 38

This means the total number of active tasks was already higher than the available capacity for that environment.

Tasks with Lower Throughput

  • Tenant ID: tenant1
  • Group ID: 528f72d7-4f5d-4559-a825-b05014114dc7

Task Statuses Seen in the Analysis

  • CANCELED
  • COMPLETED
  • SCHEDULED
  • PROCESSING
  • PAUSED
  • SCHEDULED_POLL

Tenants with High Limits Relative to Cluster Capacity

  • Tenant: tenant1
    • maxPeriodicTasksPerTenant: 86
    • This is equal to or higher than 30% of the max replicas for that environment.

What the Agent Detects

From reports like the ones above, the agent is able to highlight:

  • Tasks that cannot run in parallel because parallelExecution=false
  • Cluster saturation issues, where active tasks exceed available replicas
  • Misconfigured tenants exceeding safe limits
  • Credit imbalances that may impact execution
  • Low-throughput or blocked task groups

Impact

This approach reduced investigation time from:

Manual analysis that could take several minutes or longer

to

automated analysis that usually takes less than one minute

It can take a bit longer when the agent needs to inspect more than 20 different tasks, but it is still significantly faster than the manual process.

Tech Stack

I created a Java project using Spring AI and the following modules:

  • Mockoon for generating mock API data
  • java-project for the MCP tools
  • agent-ui for the user interface
  • agent for the orchestration and reasoning layer

This agent was originally built in Python, but I thought it would be interesting to create a small project using Spring AI and get more knowledge about the tool.

Key Takeaways

A lot of operational analysis follows a repeatable pattern.

That makes it a good fit for an AI agent that can:

  • Gather data from multiple sources
  • Correlate information across APIs and configuration
  • Suggest a likely root cause
  • Produce a report that engineers can use immediately

The real value comes from combining tooling and reasoning.

What's Next

The next step is to rebuild the project using an agent + agent.md design pattern.

The goal is to make the solution:

  • More modular
  • Easier to maintain
  • Easier to evolve as new investigation paths are added

Final Thoughts

This project started from a very practical problem: incident investigations were taking too much time.

By turning the investigation flow into a set of tools plus an AI reasoning layer, we were able to automate a large part of the process and dramatically reduce the time to get useful answers.

Instead of manually checking dashboards, configs, and APIs, we now have an agent that can:

  • Read alert context
  • Investigate the system
  • Summarize the findings
  • Suggest likely root causes

All in under a minute.

⚠️ For Disclosure

At my company, the solution was originally designed in Python, as it was the primary language used for working with MCP (Model Context Protocol) and AI agents at the time.

However, I decided to build this demo in Java to gain hands-on experience with Spring AI while applying it to a real-world problem. You can find the demo code here.

All values and data shown in this article were generated using Mockoon and do not represent real production data. They are simplified examples meant to reflect the kind of information handled by the actual application.

Feedback

I would love to hear your thoughts.

  • Would you trust an AI agent to help debug production issues?
  • How would you design this differently?

Top comments (0)