Sergey Byvshev

Posted on Jun 9

Building an AI First-Line for DevOps Support on n8n for $250/Month - Part 2

#ai #devops #programming #automation

In Part 1, I walked through how chat requests in Slack land in n8n, get classified by category, and are routed into the right processing branch. Under the hood of the classifier sits an LLM with access to Slack over MCP — it reads the thread context and decides what the new request is about. Around it run a few helper sub-workflows:

attachmentsAnalyzer — parses screenshots and text logs;
httpProbeTool — performs endpoint availability checks correctly, without taking the agent chain down with it;
errorReporter — covers us when the workflow itself fails so the requester isn't left in the dark.

Using the CI/CD assistant as an example, I showed how those building blocks compose into a processing branch. Today — the three remaining branches. Each handles a different class of requests, but architecturally they follow the same template. Thanks to that, adding a new branch takes a couple of hours rather than reinventing the wheel from scratch.

In this part:

Incident investigator — the most temperamental category, with the lowest rate of fully autonomous resolution and the most interesting edge cases;
Task manager — handling infrastructure modification requests with automatic ticket creation in Jira;
Infrastructure knowledge assistant — answers to "where is X configured?" with a small trick involving auto-generated READMEs in IaC repos.

All workflows and system prompts are published separately — link at the end.

Incident Investigator

This is probably the most "temperamental" branch of them all. The incident category catches pretty much anything that means "something just broke": a hung Postgres, 502s on the ingress controller, sudden OOMs in some consumer, mysterious "all my requests are slow but my neighbor's are fine." The spectrum is huge, there's no universal recipe — so the rate of fully autonomous resolution here is the lowest across all branches.

But even when the automation doesn't close the problem end-to-end, the dossier it puts together saves the on-call engineer 10–15 minutes at the start: firing alerts are already pulled, metrics and logs are already eyeballed, hypotheses are already framed. When you get dragged into an on-call rotation on a Saturday morning, those 10 minutes are sometimes the difference between "I had time to wake up" and "I'm already answering in chat with coffee in one hand."

Input data

The sub-workflow expects the same input structure as the CI/CD assistant from Part 1. To save you from jumping between articles — a short description of the fields:

{
  "message": "Chat request text",
  "post_id": "Message ID — needed to reply into that exact thread",
  "channel_id": "Slack channel ID",
  "channel_name": "Channel name, passed into the prompt for context",
  "user_name": "Author of the request, mentioned in the final reply",
  "user_id": "User ID",
  "file_ids": ["Attachment IDs, if any"],
  "category": "incident",
  "confidence": 0.95,
  "summary": "Short summary from the classifier",
  "is_thread": true,
  "thread_root_id": "ID of the thread's root message",
  "on_call_user": "On-call engineer name — comes in handy for escalation"
}

First, attachmentsAnalyzer2 runs — the same sub-workflow familiar from Part 1 that parses screenshots and logs. Triggering it only requires passing file_ids. If there are no attachments, the empty-attachments branch goes through a Merge node, and the pipeline doesn't crash on missing data.

Collecting variables in SetVars

Next, the SetVars node assembles everything that'll be needed both in the agent's system prompt and when posting the reply back to Slack. This part is worth pausing on, because these variables effectively parameterize the system prompt without forcing you to edit it by hand:

K8S_CLUSTERS — list of available contexts (we have two: dev and prod, both in DigitalOcean Frankfurt);
K8S_NAMESPACE — the main namespace with production workload;
GITHUB_ORG — GitHub organization name, so the agent doesn't try to search code across the entire internet;
prometheus_uid and loki_uid — Grafana datasource UIDs; without them, the agent has no idea where to knock for metrics and logs;
reply_root_id — ID of the message the agent will reply into (either the thread's root post or the original request post itself).

These values then get substituted into the system prompt via {{ $('SetVars').first().json.* }}. When a new cluster comes up or the namespace changes, you just edit values in a single node — instead of crawling through the big block of prompt text.

Querying the agent

The user prompt is built with the same template as the CI/CD assistant:

Investigate incident from {{ $json.user_name }}
in channel {{ $json.channel_name }}{{ $json.is_thread ? ' (message in thread, thread_ts=' + $json.thread_root_id + ' — read the history through Slack conversations.replies first)' : '' }}

{{ $json.message }}{{ $json.attachments_context && $json.attachments_context.trim().length > 0 ? '\n\nAdditional information from attachments:\n' + $json.attachments_context : '' }}

Three blocks inside it: the original message, an explicit instruction to read the thread history (if the request came from a thread), and context from attachments if any were present.

I use GPT-5.5 from OpenAI as the model. On this task, Sonnet and Gemini show comparable quality — the choice is more out of habit. What actually moves the needle isn't the model but the completeness of the system prompt and the toolset.

Tools the agent has access to

For the agent to make sense of an incident, it has to see the infrastructure through an engineer's eyes. When our on-call goes through a breakdown, the usual path is: "what do the alerts say → what's in the service logs → what's in the metrics → what releases went out → what's in the IaC." A corresponding MCP tool is wired up for each step:

Kubernetes MCP — look at pods, events, read container logs. The system prompt explicitly says: don't call pods_log for the same pod more than twice. Without that constraint, the agent loves to loop trying to "double-check just one more time."
Grafana MCP — queries into Prometheus (query_prometheus) and Loki (query_loki_logs). Same tool also covers dashboard search, in case the agent wants to drop a link to a ready-made panel into the reply.
DigitalOcean MCP — needed when the incident touches the infra layer: App Platform, droplets, the DOKS cluster, load balancers.
GitHub MCP — look at the latest commits, Actions runs, open PRs. Especially useful for the "everything broke right after deploy" scenario — and these scenarios, like for everyone else, aren't rare.
Slack MCP — read-only. Mostly used for conversations.replies at the very start of the investigation. We don't trust the agent with sending the reply itself — that's done by a separate node after the output is received. If something goes wrong at the posting step, the thread just stays without a final reply, but the execution doesn't crash.
Qdrant Vector Store — knowledge base on our infrastructure: component descriptions, service relationships, naming conventions, useful labels. Used when the agent runs into the name of an unfamiliar service and wants to figure out where it lives and what it talks to.

And as a separate line item — prometheusAlertSearch. This is a custom sub-workflow that the agent reaches for almost always within the first 2–3 steps of an investigation. It's worth describing in more detail.

Alert tool: prometheusAlertSearch

The idea is simple: the lion's share of complaints from developers essentially mirror an alert that's already firing in Prometheus. "Postgres is kinda slow" usually arrives at the exact moment when PostgresHighLatency has already been firing for ten minutes. It's logical to first check whether the cause is sitting right on the surface — and only then go digging through logs.

The sub-workflow accepts keywords and a match mode:

{
  "keywords": ["postgres", "pgbouncer"],
  "match_mode": "any"
}

Inside, it queries Prometheus's /api/v1/alerts and, among the firing alerts, picks the ones where the keywords appear in the name, labels, or annotations. match_mode: "any" (default) is an OR across the keywords; "all" is AND.

It connects to the agent through the toolWorkflow node as a regular MCP-tool. The description for the agent is critically important — without it, the agent doesn't understand when and how to use this tool:

Search currently firing alerts in Prometheus by keywords. Use early in
investigation to find correlated active alerts across the platform.

Input:

keywords (array of strings, required): lowercase substrings matched against alert names, all label keys/values, and all annotation values. Examples: ["postgres"], ["http","5xx"], ["kafka","redpanda"].
match_mode (string, optional): "any" (default, OR) or "all" (AND).

Returns: { ok, total_firing, matched_count, returned_count, truncated,
alerts:[...] }
Each alert has alertname, severity, labels, summary, description,
activeAt, value. Output capped at 25 alerts — if truncated=true,
refine keywords.

Without an explicit "call this in the first 2–3 steps," the agent likes to first wander into logs, then into cluster events, then somewhere else — and only at the end remember about alerts. A hard hint in the tool description visibly changes the behavior.

Response format

The system prompt defines a strict output format:

What happened

Short description of the incident

Timeline of events

Events from 10 minutes before the problem appeared

Likely causes

At most two hypotheses

What to try

Step-by-step actions for each hypothesis

The structure mirrors the familiar incident-review format — easy to lift into a postmortem if the incident turns out to be significant, without rewriting.
Average processing time per request and token consumption: roughly under 2 minutes end-to-end and 40–50K tokens depending on how deep the agent has to go. Cost-wise — peanuts compared to engineer time, especially considering that some of these requests used to require a call rather than just a chat exchange.

Task Manager

The "infrastructure modification" category covers requests like "roll us out a new service," "give us access to Grafana," "add a bucket for analytics." Full automation here is off the table: nearly every such request needs approval, estimation, and just plain human attention. But what definitely can be automated is turning free-form text into a properly formatted ticket.

The previous typical scenario looked like this: a developer writes in chat, the on-call reads it, asks clarifying questions, and writes all of it into Jira in their own words. Now Jira gets the request straight away — and the on-call gets a notification with a ready-made link.

The input is the same structure with fields from the classifier. Then — the familiar sequence: attachment parsing through attachmentsAnalyzer2, variable collection in SetVars, handoff to the LLM agent.

Output format

The distinctive part here is the strictly defined response structure. The agent must return JSON:

{
  "summary": "<area>: <what needs to be done>",
  "description": "<1-3 sentences with details>",
  "label": "<one of the predefined directions>"
}

label is picked from a constrained dictionary: kubernetes, monitoring, network, access, database, ci-cd, and so on. This simplifies further task routing by engineer competence and gathering analytics on "who's working on what and how much of it."

For the agent to write decent summary and description, it has access to:

Qdrant Vector Store — to see how the area the request belongs to is generally structured at our place. This is needed so the agent doesn't invent a new name where an existing one already exists.
Slack MCP — if the request came in a thread, read the backstory. People often clarify details exactly in the thread, while the first message is a single line.
HTTP Request — in case the request contains a link to someone's PR, a Confluence doc, or an external spec.

Validation and task creation

After receiving the JSON, validation runs: we check for required fields and a permitted label value. If something's off — a fallback message goes out ("couldn't formulate the task automatically, please take a look"), and the on-call handles it manually. If everything's fine — the task is filed in Jira through n8n's built-in node.
A short message comes back to Slack with the task title and a link to it. The on-call then sees a ready ticket with the right label and decides: take it on, reassign, or ask for clarifications.

An important point: automatic task creation doesn't cancel manual validation. Sometimes the classifier gets it wrong and reads an incident as a modification (especially when the author writes in the "we need X to work" style instead of "X is not working"). That's why the system prompt explicitly carries a rule: if it's unclear from the text whether new work is actually needed or whether this is about something existing — ask "could you clarify what exactly needs to be done?" as a separate message, and don't create the task. A cheap measure that noticeably cuts down on junk tickets.

Infrastructure Knowledge Assistant

The last workflow for today is the "calmest" one. The question category catches requests like "where is the connection limit configured in pgbouncer?", "where do alloy logs from droplets go?", "what region is the assets-prod bucket in?". Sometimes from new joiners on the team, sometimes from the same DevOps engineers who forgot where what lives. (It happens, I won't lie.)

The structure is almost identical to the incident investigator: the same input JSON structure, the same chain with attachment parsing, the same SetVars. Toolset: GitHub MCP, Slack MCP, Kubernetes MCP, Grafana MCP, DigitalOcean MCP, Qdrant Vector Store, and HTTP Request for pulling official component documentation when the required parameter isn't in the code and you have to look up vendor defaults.

I won't dwell on the same nodes — let me tell you about one trick without which the agent would hit a wall pretty quickly.

A skill that writes READMEs in IaC repositories

The initial hypothesis was: give the agent GitHub MCP and a knowledge base in Qdrant — and it'll figure it out. In practice, it turned out that the structure in IaC repositories is almost always non-obvious: somewhere Terragrunt is mixed with Helmfile, somewhere Ansible playbooks live with inventories two subdirectories deep, somewhere Terraform modules are laid out under names that only make sense to us. The agent burned a ton of tokens and time just to figure out where to look in the first place.

The solution came from the Claude Code Skills format: I wrote a separate skill that runs locally in the IDE and generates/updates the README in every infrastructure repo. The skill reads the directory structure, identifies entry points, and describes them in a unified format. The output looks like this:

# infra

Main IaC repository: terragrunt code for cloud resources,
Ansible inventories and playbooks for VMs, Helm releases and
manifests for k8s.

## Directory structure

- `terraform/` — terragrunt code, organized by cloud and region:
  - `do/fra1/` — DigitalOcean Frankfurt resources (DOKS clusters,
    buckets, load balancers, droplets for DBs).
  - `yc/ru-central1/` — Yandex Cloud resources.
- `ansible/` — playbooks and inventories for Droplet management:
  - `playbooks/` — entry points (`postgres.yml`, `redpanda.yml`, ...).
  - `inventories/{stage,prod}/` — per-environment inventories with
    `group_vars/` next to them.
- `helm/` — Helmfile releases of infrastructure components in k8s
  (ingress-nginx, cert-manager, kube-prometheus-stack, etc.).
- `manifests/<cluster>/` — manifests applied on top of Helm releases:
  alerts, ServiceMonitors, standalone CRDs.

## Naming conventions

- VM: `<project>-<env>-<kind>-[<purpose>]-<index>.example.com`
- clusters: `<env>-<region>-01` (e.g. `prod-fra1-01`)
- buckets: `s3-<project>-<purpose>-<env>` (e.g. `s3-project-assets-prod`)

From the agent's perspective, this changes everything: the first thing it does is read README.md through get_file_contents, understand the structure, and then go into the right subdirectory for a specific file. The number of GitHub MCP calls dropped roughly 3×, and answer quality went up noticeably — especially on questions of the "where is X configured" type.

I'll publish the skill itself in the repo — it's simple and easy to adapt to someone else's structure.

What we got in total

After all three branches rolled into production and lived through a couple of months of real traffic:

Incident investigator — closes about a quarter of requests fully; in the rest, the on-call gets a ready-made breakdown with hypotheses and saves 10–15 minutes at the start.
Task manager — practically every modification request makes it into Jira with a meaningful summary and the right label set; manual fixes are needed rarely, usually around the description wording.
Knowledge assistant — closes roughly half of the questions without an engineer; for the rest, the agent's answer is still useful as a starting point for the on-call.
Combined cost at our flow stays around $250/month on LLM spend. Considering the system works 24/7 and doesn't take vacations — laughably small money.

What hasn't worked out yet

To avoid creating the illusion that everything's smooth — a few rough edges we're still living with:

Objective quality metrics — for now, the only feedback channel is occasional comments along the lines of "no, the agent guessed wrong here." I want to wire up more structured feedback — for example, through emoji reactions in Slack with automatic stats collection.
Long incident threads — the agent reliably gets lost when a thread already has 30+ messages. Right now, I cap the context depth hard in the prompt, but it's a tradeoff: sometimes important context from the start gets dropped.
The "security" category — it doesn't exist in the classifier, but requests like "does this comply with GDPR?" come in periodically. For now, I shove them into question, but answer quality there is shaky.

Plans for the future

What's in the queue for the coming months:

Expand the agent toolset to "actions, not just reads" — carefully give access to restarting pods, applying pre-vetted manifests, restarting systemd services. Obviously, this is minefield territory, so it'll be done through explicit engineer confirmation in Slack — without confirmation, no changes get applied.
Add handling for maintenance announcements — for now, such messages just get tagged and ignored, but they could be pushed into a separate digest channel with an automatic "what's planned for this week" summary.
Wire up a dedicated branch for security questions — with its own knowledge base on our compliance docs and security policy.
Add analytics over processed requests — which categories are growing, where the autonomous resolution rate is what, how many tokens go to which category. Without numbers, it's hard to understand what should actually be improved first.

Wrap-up

In short: AI agents driven by n8n with MCP tools turned out to be a very workable way to offload a significant chunk of routine work from the on-call engineer. Not a silver bullet — but something close to a modest, hardworking intern who works around the clock and costs about as much as a couple of lunches.

The key is not to try automating everything at once. Better to ship one branch, live closely with it, understand its weak spots — and only then extend the approach to the other categories. From my first working CI/CD assistant to the full set of all branches took about three months — and I have no regrets about that pacing.

Link to the repo with workflows: https://github.com/javdet/automagicops-workflows
Which category of requests automates best in your setup? And have you ever had the situation where an AI agent confidently nailed the wrong diagnosis and led the engineer down the wrong path? Share in the comments — especially curious to compare where everyone’s stepping on rakes.

DEV Community