Automated Incident Response Workflows with n8n and Monitoring Tools

Most teams face the same challenge: alerts either go to everyone (causing fatigue) or get missed entirely (causing outages). The solution? Intelligent routing based on severity, business hours, and context.

Building a Smart Incident Response Workflow

We'll create an n8n workflow that connects Prometheus alerts to intelligent response actions. Here's what our system will do:

Analyze alert severity and timing
Route critical after-hours alerts to PagerDuty
Send routine alerts to Discord/Slack
Attempt automated resolution for common issues
Document everything for post-incident analysis

Step 1: Set Up Your Monitoring Stack

First, you need Prometheus and AlertManager configured to send alerts to n8n:

# AlertManager configuration
route:
  receiver: 'n8n-webhook'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 30m

receivers:
- name: 'n8n-webhook'
  webhook_configs:
  - url: 'http://your-n8n-instance:5678/webhook/prometheus'
    send_resolved: true

Step 2: Build the n8n Workflow

The workflow consists of several key nodes:

Webhook Node - Receives alerts from AlertManager
Function Node - Classifies alerts intelligently:

const alerts = items[0].json.body.alerts || [];
return alerts.map(alert => {
  const startsAt = new Date(alert.startsAt);
  const hour = startsAt.getUTCHours();
  const isBusinessHours = hour >= 9 && hour < 17;
  const durationMinutes = (Date.now() - startsAt.getTime()) / 1000 / 60;

  return {
    json: {
      alertname: alert.labels.alertname,
      severity: alert.labels.severity,
      instance: alert.labels.instance,
      description: alert.annotations.description,
      isBusinessHours: isBusinessHours,
      durationMinutes: durationMinutes
    }
  };
});

Switch Node - Routes based on criticality and business hours:

Critical + After Hours → PagerDuty (immediate response)
Critical + Business Hours → Discord (urgent channel)
Non-critical → Discord (general alerts)

Step 3: Add Intelligent Auto-Resolution

For common issues like high CPU usage, add an AI-powered decision node:

// AI prompt for auto-resolution decision
Analyze this alert to determine if auto-resolution should occur:
- Alert: {{ $node["Code"].json["alertname"] }}
- Severity: {{ $node["Code"].json["severity"] }}
- Duration: {{ $node["Code"].json.durationMinutes }} minutes
- Business Hours: {{ $node["Code"].json["isBusinessHours"] }}

Auto-resolve if:
1. CPU > 80% AND outside business hours
2. CPU > 90% AND duration < 5 minutes
3. Critical severity AND outside business hours

Return JSON: {"shouldAutoResolve": boolean, "reason": "explanation"}

This approach lets you automatically restart services or scale resources for known issues while escalating complex problems to humans.

Step 4: Document Everything

Add a Notion or database node to log all incidents:

Timestamp and duration
Severity and affected services
Resolution method (auto vs manual)
Follow-up actions needed

Sample Workflow Structure

Webhook (Prometheus)
  → Function (Classification)
    → Switch (Routing)
      ├── PagerDuty (Critical + After Hours)
      ├── Discord (Business Hours)
      └── AI Analysis (Auto-Resolution)
        ├── Lambda (Restart Service)
        └── Notion (Document Incident)