Mychel Garzon

Posted on Mar 26

Stop Being the Human Incident Router: AI-Powered Alert Triage with n8n

#ai #automation #n8n #devops

Your monitoring sends 50 notifications a day.

Maybe 3 are actually urgent. The rest? Noise.

So you start ignoring them. Then something customer-facing breaks and you miss it because it was buried under 47 low-priority alerts.

Alert fatigue is killing your response time.

I built an n8n workflow that fixes this. Two AI agents read your runbooks, validate severity, and enforce SLAs automatically.

How It Works

Agent 1 (Analyzer) validates every alert:

Checks against your runbook database
Looks for customer impact signals
Assigns confidence-scored severity

Agent 2 (Response Planner) builds the action plan:

What to do first
Who to notify
When to escalate

Then SLA enforcement runs autonomously. P1 gets 15 minutes. P2 gets 60. Nobody responds? Auto-escalates to management.

No manual checking. No human bottleneck.

The Stack

Primary LLM: Gemini 2.0
Fallback: Groq (when Gemini fails)
Storage: Google Sheets
Alerts: Slack + Gmail

When Gemini goes down, the workflow automatically switches to Groq. Each agent gets 3 retry attempts with 5-second intervals.

Basically always works.

Real Example

Monitoring sends this:

{
  "title": "DB Connection Pool Exhausted",
  "description": "user-service reporting 503 errors",
  "severity": "P3"
}

Agent 1 reasoning:

Finds runbook entry: "Connection pool exhaustion = P2 if customer-facing"
Detects "503 errors" = customer impact
Service check: user-service is customer-facing
Decision: Override P3 → P2 (confidence: 0.87)

Agent 2 action plan:

Check active DB connections
Restart service if pool >90%
Notify #incidents channel
Start 60-minute SLA timer

What happens next (automatically):

Slack alert posts to #incidents
Timer starts
Workflow waits, then checks Google Sheets
Still empty after 60 min? Escalates to #engineering-leads with "SLA BREACH"
Everything logged to audit trail

Why This Works

Uses your runbooks, not generic templates
The workflow reads your Google Sheets runbook database. It knows your systems.

Stops false alarms
That "P1 URGENT" email from marketing? Gets downgraded automatically.

Multi-LLM fallback = reliability
Primary fails? Fallback takes over. No manual intervention.

SLAs enforce themselves
Timers run autonomously. Management gets paged if nobody responds.

Complete audit trail
Every decision logged. Perfect for post-mortems.

The Fallback Pattern

1. Try Gemini (primary)
2. Error? Wait 5 seconds
3. Retry Gemini (attempt 2)
4. Error? Wait 5 seconds
5. Retry Gemini (attempt 3)
6. Still failing? Switch to Groq
7. Groq gets same 3-retry pattern

6 total attempts across two providers = 99.9%+ uptime.

Two Agents vs One

Why split the work?

One agent doing everything (analyze + plan + format) = inconsistent outputs.

Two specialized agents = better at their specific jobs.

Agent 1: Incident Analyzer

You are an incident severity analyzer.
Given this alert and runbook, determine:
1. Is the reported severity accurate?
2. What signals indicate customer impact?
3. What's your confidence score?

Output JSON only.

Agent 2: Response Coordinator

You are an incident response planner.
Given validated severity, determine:
1. What immediate actions to take?
2. Who to notify?
3. What's the SLA target?

Output JSON only.

Clean separation. One job each.

Google Sheets Setup

The workflow needs three sheets:

Runbooks:
| Service | Known Issue | Severity | Impact | Contact |
|---------|-------------|----------|--------|---------|
| user-service | Connection pool exhausted | P2 | High | database-team |

Incidents:
| ID | Service | Severity | Acknowledged By | Status |
|----|---------|----------|----------------|--------|
| INC-001 | user-service | P2 | john@example.com | Met |

AI_Audit_Log:
| Timestamp | ID | Agent | Decision | Confidence |
|-----------|----|----|----------|-----------|
| 2026-03-26 14:30 | INC-001 | Analyzer | P3→P2 | 0.87 |

Setup

What you need:

Google Gemini API (free tier works)
Groq API (also free tier)
Google Sheets
Slack OAuth2
Gmail

Time: 30-45 minutes

Steps:

Clone the n8n template
Add API credentials
Create Google Sheets structure
Configure Slack channels
Point webhook at your monitoring

SLA Enforcement Logic

1. Incident arrives → severity determined
2. SLA timer starts:
   P1: 15 min | P2: 60 min | P3: 4 hours
3. Workflow waits
4. Checks Google Sheets "Acknowledged By"
5. Empty? Escalate:
   P1 → page management + war room
   P2 → alert engineering leads
   P3 → remind team
6. Log SLA breach
7. Keep checking until acknowledged

No cron jobs. No external schedulers. Workflow handles timing internally.

Get the Template

Grab it here: https://n8n.partnerlinks.io/incident-triage-linkedin

Includes:

Two-agent setup
Multi-LLM fallback
SLA automation
Google Sheets logging
Slack integration

Deploy it. Point it at your runbooks. Stop being the human incident router.

DEV Community