DEV Community

Cover image for Stop Being the Human Incident Router: AI-Powered Alert Triage with n8n
Mychel Garzon
Mychel Garzon

Posted on

Stop Being the Human Incident Router: AI-Powered Alert Triage with n8n

Your monitoring sends 50 notifications a day.

Maybe 3 are actually urgent. The rest? Noise.

So you start ignoring them. Then something customer-facing breaks and you miss it because it was buried under 47 low-priority alerts.

Alert fatigue is killing your response time.

I built an n8n workflow that fixes this. Two AI agents read your runbooks, validate severity, and enforce SLAs automatically.


How It Works

Agent 1 (Analyzer) validates every alert:

  • Checks against your runbook database
  • Looks for customer impact signals
  • Assigns confidence-scored severity

Agent 2 (Response Planner) builds the action plan:

  • What to do first
  • Who to notify
  • When to escalate

Then SLA enforcement runs autonomously. P1 gets 15 minutes. P2 gets 60. Nobody responds? Auto-escalates to management.

No manual checking. No human bottleneck.


The Stack

  • Primary LLM: Gemini 2.0
  • Fallback: Groq (when Gemini fails)
  • Storage: Google Sheets
  • Alerts: Slack + Gmail

When Gemini goes down, the workflow automatically switches to Groq. Each agent gets 3 retry attempts with 5-second intervals.

Basically always works.


Real Example

Monitoring sends this:

{
  "title": "DB Connection Pool Exhausted",
  "description": "user-service reporting 503 errors",
  "severity": "P3"
}
Enter fullscreen mode Exit fullscreen mode

Agent 1 reasoning:

  • Finds runbook entry: "Connection pool exhaustion = P2 if customer-facing"
  • Detects "503 errors" = customer impact
  • Service check: user-service is customer-facing
  • Decision: Override P3 → P2 (confidence: 0.87)

Agent 2 action plan:

  • Check active DB connections
  • Restart service if pool >90%
  • Notify #incidents channel
  • Start 60-minute SLA timer

What happens next (automatically):

  1. Slack alert posts to #incidents
  2. Timer starts
  3. Workflow waits, then checks Google Sheets
  4. Still empty after 60 min? Escalates to #engineering-leads with "SLA BREACH"
  5. Everything logged to audit trail

Why This Works

Uses your runbooks, not generic templates
The workflow reads your Google Sheets runbook database. It knows your systems.

Stops false alarms
That "P1 URGENT" email from marketing? Gets downgraded automatically.

Multi-LLM fallback = reliability
Primary fails? Fallback takes over. No manual intervention.

SLAs enforce themselves
Timers run autonomously. Management gets paged if nobody responds.

Complete audit trail
Every decision logged. Perfect for post-mortems.


The Fallback Pattern

1. Try Gemini (primary)
2. Error? Wait 5 seconds
3. Retry Gemini (attempt 2)
4. Error? Wait 5 seconds
5. Retry Gemini (attempt 3)
6. Still failing? Switch to Groq
7. Groq gets same 3-retry pattern
Enter fullscreen mode Exit fullscreen mode

6 total attempts across two providers = 99.9%+ uptime.


Two Agents vs One

Why split the work?

One agent doing everything (analyze + plan + format) = inconsistent outputs.

Two specialized agents = better at their specific jobs.

Agent 1: Incident Analyzer

You are an incident severity analyzer.
Given this alert and runbook, determine:
1. Is the reported severity accurate?
2. What signals indicate customer impact?
3. What's your confidence score?

Output JSON only.
Enter fullscreen mode Exit fullscreen mode

Agent 2: Response Coordinator

You are an incident response planner.
Given validated severity, determine:
1. What immediate actions to take?
2. Who to notify?
3. What's the SLA target?

Output JSON only.
Enter fullscreen mode Exit fullscreen mode

Clean separation. One job each.


Google Sheets Setup

The workflow needs three sheets:

Runbooks:
| Service | Known Issue | Severity | Impact | Contact |
|---------|-------------|----------|--------|---------|
| user-service | Connection pool exhausted | P2 | High | database-team |

Incidents:
| ID | Service | Severity | Acknowledged By | Status |
|----|---------|----------|----------------|--------|
| INC-001 | user-service | P2 | john@example.com | Met |

AI_Audit_Log:
| Timestamp | ID | Agent | Decision | Confidence |
|-----------|----|----|----------|-----------|
| 2026-03-26 14:30 | INC-001 | Analyzer | P3→P2 | 0.87 |


Setup

What you need:

  • Google Gemini API (free tier works)
  • Groq API (also free tier)
  • Google Sheets
  • Slack OAuth2
  • Gmail

Time: 30-45 minutes

Steps:

  1. Clone the n8n template
  2. Add API credentials
  3. Create Google Sheets structure
  4. Configure Slack channels
  5. Point webhook at your monitoring

SLA Enforcement Logic

1. Incident arrives → severity determined
2. SLA timer starts:
   P1: 15 min | P2: 60 min | P3: 4 hours
3. Workflow waits
4. Checks Google Sheets "Acknowledged By"
5. Empty? Escalate:
   P1 → page management + war room
   P2 → alert engineering leads
   P3 → remind team
6. Log SLA breach
7. Keep checking until acknowledged
Enter fullscreen mode Exit fullscreen mode

No cron jobs. No external schedulers. Workflow handles timing internally.


Get the Template

Grab it here: https://n8n.partnerlinks.io/incident-triage-linkedin

Includes:

  • Two-agent setup
  • Multi-LLM fallback
  • SLA automation
  • Google Sheets logging
  • Slack integration

Deploy it. Point it at your runbooks. Stop being the human incident router.

Top comments (0)