Your monitoring sends 50 notifications a day.
Maybe 3 are actually urgent. The rest? Noise.
So you start ignoring them. Then something customer-facing breaks and you miss it because it was buried under 47 low-priority alerts.
Alert fatigue is killing your response time.
I built an n8n workflow that fixes this. Two AI agents read your runbooks, validate severity, and enforce SLAs automatically.
How It Works
Agent 1 (Analyzer) validates every alert:
- Checks against your runbook database
- Looks for customer impact signals
- Assigns confidence-scored severity
Agent 2 (Response Planner) builds the action plan:
- What to do first
- Who to notify
- When to escalate
Then SLA enforcement runs autonomously. P1 gets 15 minutes. P2 gets 60. Nobody responds? Auto-escalates to management.
No manual checking. No human bottleneck.
The Stack
- Primary LLM: Gemini 2.0
- Fallback: Groq (when Gemini fails)
- Storage: Google Sheets
- Alerts: Slack + Gmail
When Gemini goes down, the workflow automatically switches to Groq. Each agent gets 3 retry attempts with 5-second intervals.
Basically always works.
Real Example
Monitoring sends this:
{
"title": "DB Connection Pool Exhausted",
"description": "user-service reporting 503 errors",
"severity": "P3"
}
Agent 1 reasoning:
- Finds runbook entry: "Connection pool exhaustion = P2 if customer-facing"
- Detects "503 errors" = customer impact
- Service check: user-service is customer-facing
- Decision: Override P3 → P2 (confidence: 0.87)
Agent 2 action plan:
- Check active DB connections
- Restart service if pool >90%
- Notify #incidents channel
- Start 60-minute SLA timer
What happens next (automatically):
- Slack alert posts to #incidents
- Timer starts
- Workflow waits, then checks Google Sheets
- Still empty after 60 min? Escalates to #engineering-leads with "SLA BREACH"
- Everything logged to audit trail
Why This Works
Uses your runbooks, not generic templates
The workflow reads your Google Sheets runbook database. It knows your systems.
Stops false alarms
That "P1 URGENT" email from marketing? Gets downgraded automatically.
Multi-LLM fallback = reliability
Primary fails? Fallback takes over. No manual intervention.
SLAs enforce themselves
Timers run autonomously. Management gets paged if nobody responds.
Complete audit trail
Every decision logged. Perfect for post-mortems.
The Fallback Pattern
1. Try Gemini (primary)
2. Error? Wait 5 seconds
3. Retry Gemini (attempt 2)
4. Error? Wait 5 seconds
5. Retry Gemini (attempt 3)
6. Still failing? Switch to Groq
7. Groq gets same 3-retry pattern
6 total attempts across two providers = 99.9%+ uptime.
Two Agents vs One
Why split the work?
One agent doing everything (analyze + plan + format) = inconsistent outputs.
Two specialized agents = better at their specific jobs.
Agent 1: Incident Analyzer
You are an incident severity analyzer.
Given this alert and runbook, determine:
1. Is the reported severity accurate?
2. What signals indicate customer impact?
3. What's your confidence score?
Output JSON only.
Agent 2: Response Coordinator
You are an incident response planner.
Given validated severity, determine:
1. What immediate actions to take?
2. Who to notify?
3. What's the SLA target?
Output JSON only.
Clean separation. One job each.
Google Sheets Setup
The workflow needs three sheets:
Runbooks:
| Service | Known Issue | Severity | Impact | Contact |
|---------|-------------|----------|--------|---------|
| user-service | Connection pool exhausted | P2 | High | database-team |
Incidents:
| ID | Service | Severity | Acknowledged By | Status |
|----|---------|----------|----------------|--------|
| INC-001 | user-service | P2 | john@example.com | Met |
AI_Audit_Log:
| Timestamp | ID | Agent | Decision | Confidence |
|-----------|----|----|----------|-----------|
| 2026-03-26 14:30 | INC-001 | Analyzer | P3→P2 | 0.87 |
Setup
What you need:
- Google Gemini API (free tier works)
- Groq API (also free tier)
- Google Sheets
- Slack OAuth2
- Gmail
Time: 30-45 minutes
Steps:
- Clone the n8n template
- Add API credentials
- Create Google Sheets structure
- Configure Slack channels
- Point webhook at your monitoring
SLA Enforcement Logic
1. Incident arrives → severity determined
2. SLA timer starts:
P1: 15 min | P2: 60 min | P3: 4 hours
3. Workflow waits
4. Checks Google Sheets "Acknowledged By"
5. Empty? Escalate:
P1 → page management + war room
P2 → alert engineering leads
P3 → remind team
6. Log SLA breach
7. Keep checking until acknowledged
No cron jobs. No external schedulers. Workflow handles timing internally.
Get the Template
Grab it here: https://n8n.partnerlinks.io/incident-triage-linkedin
Includes:
- Two-agent setup
- Multi-LLM fallback
- SLA automation
- Google Sheets logging
- Slack integration
Deploy it. Point it at your runbooks. Stop being the human incident router.
Top comments (0)