Sergey Byvshev

Posted on Mar 24

AI Alert Assistant: How n8n + LLM Replace Routine Diagnostics

#sre #ai #tutorial #devops

Anyone who has dealt with keeping services running knows how exhausting and unpredictably time-consuming incident diagnostics and resolution can be.

Over the years, I've watched the evolution of incident response processes — from "whoever spots the problem first owns it" to strictly defined 24/7 on-call rotations, SLA-driven response times, runbook adherence, and separation of responsibility across platforms.

One thing has remained constant:

Gathering data from multiple sources 1.1. Metrics 1.2. Logs 1.3. Traces 1.4. Release and maintenance timelines
Analysis based on personal knowledge and experience
Formulating possible solutions

If you have a documented procedure for every situation, that simplifies things somewhat — but it doesn't teach the investigative mindset needed for real troubleshooting.

Writing and maintaining a runbook for every alert is tedious work, which is exactly why an experienced engineer will always outperform a library of hundreds of runbooks.

But what if an engineer's function could be performed even when no engineer is physically present?

Designing the Assistant: What to Define Upfront

Before writing any code, four questions need to be answered:

What events, and in what format, should be provided to the agent?
What data sources might it need?
How should it manipulate that data to identify the root cause?
What should the diagnostic report look like in terms of form and content?

Let's break down each one. A link to the workflow itself can be found below.

Events and Format

Typically, what's sufficient to kick off diagnostics is an event containing:

Alertname
Description
Labels
job_name
namespac
pod
env
region
Grafana Dashboard
Runbook Url

Data Sources

Most frequently, we turn to:

Metrics that have breached acceptable thresholds
Resource consumption and load metrics
Error logs
The platform — Kubernetes or a standalone server
Related CI/CD releases
Alert definitions and firing conditions

Data Analysis

There is arguably no canonical sequence of steps for analysis. The diagnostic process is inherently variable — which is why no one has yet managed to write a single script that covers every possible scenario. But we'll give it a shot.

First, let's consider how we ourselves approach incident diagnosis:

Examine what's happening with the metric that triggered the alert: determine the nature of the anomaly — a spike, monotonic growth, or a persistently critical value
Determine whether this is a software-level failure or caused by issues at a lower layer
Check infrastructure metrics: resources, networking, system limits
Inspect logs at the point where the problem is occurring
Determine how recently the affected components were updated and what changed
Attempt to interact with the components directly — through the orchestrator or a Linux shell

Report Format

In most cases, this kind of report is meant to be read by humans, so it should be written in plain, natural language. Concise — just the discovered facts, a list of hypotheses, and possible remediation steps. The most convenient place for such a report is a thread under the corresponding alert in the team chat.

Solution Architecture
Here's the desired flow: when an alert fires, the event is sent to a webhook that extracts the relevant data and assembles a clear, well-structured prompt for the AI agent.

The AI agent, guided by its system prompt and the available MCP tools, performs diagnostics and generates a report in a predefined format.

The report is then posted to the team chat.

Implementation

If you're in a hurry, you can view the finished workflow below.

As the execution environment for workflows like this, I chose n8n because it:

Lets you build easily readable automations fairly quickly
Makes it simple to share your work
Separates logic from secrets and other hardcoded values
Has a free self-hosted version
Has an enormous community

Personally, it reminds me of Jenkins about ten years ago — and Jenkins was great.
You can install n8n using any of the methods described in the documentation, for example using docker-compose

From here, the implementation will depend on the systems you use. In my case:

Preprocessing Incoming Events

Alertmanager can send alerts to a custom webhook. In n8n, all you need to do is create a Webhook trigger node and and you can also specify authentication parameters.

Add data about the created webhook to the new receiver n8n in alertmanager.
After this, we'll be able to send alerts from Alertmanager to our workflow. However, the received messages contain unnecessary data, and the format is not entirely appropriate. This will make it difficult for LLM to understand what's being asked of it, leading to increased token consumption. Therefore, we'll make a small modification using Code node.

You'll likely want to store certain values as variables — for instance, the UID of your Prometheus datasource in Grafana.

AI Agent

A prerequisite for the AI Agent node to operate is a connected LLM. Almost any neural network can be connected, but in my experience, Codex and Opus perform best.

We don't use Memory here, since each alert is an independent event unrelated to others.
One of the key aspects is writing the system prompt. What should it include?

Agent purpose — what it's supposed to do
Brief description of your infrastructure and the type of service you provide
Description of each MCP tool — e.g., use the Kubernetes MCP to get pod status, related events, etc.
Important rules to follow and pitfalls to avoid — e.g., never ask questions, write the response in a specific language, never make any changes to the infrastructure
Diagnostic guidelines — essentially what we discussed in the Data Analysis section above

MCP — The Agent's Eyes and Ears

MCP tools serve as the agent's eyes and ears, giving it the ability to interact with the subject of diagnosis. The specific list may vary depending on your infrastructure, but the core categories of data sources (which we outlined earlier) remain the same. In my case, the list looks like this:

Metrics — mcp-grafana
Logs — mcp-grafana
Platform — kubernetes-mcp, digitalocean-mcp
CI/CD releases — gitlab-mcp
Alert descriptions — vector store

When running your mcp's, make sure they are running in remote http streaming mode.

Vector Store

The knowledge base deserves separate attention. It allows you to store large volumes of information and perform fast lookups. This saves tokens and reduces the time spent on external system queries. I use Qdrant as this knowledge base. I strongly recommend setting a service API token for authentication.

Next, you need to create a collection where your knowledge will be stored. You can do this through the web interface at http://:6333/dashboard.

Create a QdrantApi account and use it to connect.

Once the database is connected to the agent as a tool, it's time to load it with knowledge. I use a separate workflow for this.

Simply run this workflow and upload your knowledge file(s) through the form that appears — they'll be saved to the database.

Posting to Chat

After the AI agent completes its work, we need to send the results to the chat where engineers will see them. The delivery chain consists of three nodes:

Search for recent messages in the alerts channel. Unfortunately, not all group chats support keyword search via API, so the last 10 messages are retrieved instead.
Find the message that corresponds to our alert.
Post the diagnostic results as a thread reply to that message.

For Slack integration, you'll need to set up authentication following the official Slack API documentation.

Testing and Examples

Here's what the final workflow looks like.

I've tested minor variations of this workflow across several projects, and here are the results.

On average, analyzing an alert takes 30 seconds. In that time, the agent manages to inspect metrics, review logs, assess the state of the K8s cluster, and deliver a verdict.

Conclusion

What we end up with is an assistant that gets to work the instant an alert fires. The analysis time is minimal, which guarantees that by the time an engineer sees the alert, the initial diagnostics will have already been completed.

This is just one of the directions where AI can meaningfully simplify life for infrastructure teams — and for development teams who are forced to handle their own support. The agent doesn't replace the engineer, but it takes on the first-response diagnostics and shortens the gap between "alert fired" and "we understand what's going on." And at night — when the on-call engineer is asleep — that can be invaluable.

There is base version of workflow: https://github.com/javdet/automagicops-workflows/tree/main/workflows/AlertAssistant
Want to quickly implement a similar flow for yourself? Read the full Patreon guide with detailed examples and practical tips.
Author of the article: https://linkedin.com/in/sergeybyvshev

Top comments (1)

Riche Otaza • Jul 13

If you're considering n8n, the self-hosted option gives you full control over your data and workflows. Great for businesses with compliance requirements.