DEV Community

Tom
Tom

Posted on • Originally published at bubobot.com

Automated Incident Response Workflows with n8n and Monitoring Tools

Automated Incident Response Workflows with n8n and Monitoring Tools

Most teams face the same challenge: alerts either go to everyone (causing fatigue) or get missed entirely (causing outages). The solution? Intelligent routing based on severity, business hours, and context.

Building a Smart Incident Response Workflow

We'll create an n8n workflow that connects Prometheus alerts to intelligent response actions. Here's what our system will do:

  • Analyze alert severity and timing

  • Route critical after-hours alerts to PagerDuty

  • Send routine alerts to Discord/Slack

  • Attempt automated resolution for common issues

  • Document everything for post-incident analysis

Step 1: Set Up Your Monitoring Stack

First, you need Prometheus and AlertManager configured to send alerts to n8n:

# AlertManager configuration
route:
  receiver: 'n8n-webhook'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 30m

receivers:
- name: 'n8n-webhook'
  webhook_configs:
  - url: 'http://your-n8n-instance:5678/webhook/prometheus'
    send_resolved: true
Enter fullscreen mode Exit fullscreen mode

Step 2: Build the n8n Workflow

The workflow consists of several key nodes:

  1. Webhook Node - Receives alerts from AlertManager

  2. Function Node - Classifies alerts intelligently:

const alerts = items[0].json.body.alerts || [];
return alerts.map(alert => {
  const startsAt = new Date(alert.startsAt);
  const hour = startsAt.getUTCHours();
  const isBusinessHours = hour >= 9 && hour < 17;
  const durationMinutes = (Date.now() - startsAt.getTime()) / 1000 / 60;

  return {
    json: {
      alertname: alert.labels.alertname,
      severity: alert.labels.severity,
      instance: alert.labels.instance,
      description: alert.annotations.description,
      isBusinessHours: isBusinessHours,
      durationMinutes: durationMinutes
    }
  };
});

Enter fullscreen mode Exit fullscreen mode
  1. Switch Node - Routes based on criticality and business hours:
  • Critical + After Hours → PagerDuty (immediate response)

  • Critical + Business Hours → Discord (urgent channel)

  • Non-critical → Discord (general alerts)

Step 3: Add Intelligent Auto-Resolution

For common issues like high CPU usage, add an AI-powered decision node:

// AI prompt for auto-resolution decision
Analyze this alert to determine if auto-resolution should occur:
- Alert: {{ $node["Code"].json["alertname"] }}
- Severity: {{ $node["Code"].json["severity"] }}
- Duration: {{ $node["Code"].json.durationMinutes }} minutes
- Business Hours: {{ $node["Code"].json["isBusinessHours"] }}

Auto-resolve if:
1. CPU > 80% AND outside business hours
2. CPU > 90% AND duration < 5 minutes
3. Critical severity AND outside business hours

Return JSON: {"shouldAutoResolve": boolean, "reason": "explanation"}

Enter fullscreen mode Exit fullscreen mode

This approach lets you automatically restart services or scale resources for known issues while escalating complex problems to humans.

Step 4: Document Everything

Add a Notion or database node to log all incidents:

  • Timestamp and duration

  • Severity and affected services

  • Resolution method (auto vs manual)

  • Follow-up actions needed

Sample Workflow Structure

Webhook (Prometheus)
   Function (Classification)
     Switch (Routing)
      ├── PagerDuty (Critical + After Hours)
      ├── Discord (Business Hours)
      └── AI Analysis (Auto-Resolution)
        ├── Lambda (Restart Service)
        └── Notion (Document Incident)

Enter fullscreen mode Exit fullscreen mode

Getting Started

  1. Set up basic monitoring with Prometheus and AlertManager

  2. Create the n8n workflow with webhook and classification nodes

  3. Add routing logic based on your team's needs

  4. Implement auto-resolution for common, well-understood issues

  5. Document and iterate based on real-world usage

Beyond Basic Automation

While this n8n workflow provides powerful automation capabilities, teams with critical uptime requirements might need more advanced features like:

  • Sub-minute monitoring intervals

  • AI-powered anomaly detection

  • Advanced correlation and grouping

  • Enterprise-grade reliability features

The key is starting with intelligent routing and basic automation, then expanding based on your team's specific needs and operational maturity.


For the complete n8n workflow JSON and deployment scripts, check out our full implementation guide.

Read more at https://bubobot.com/blog/automated-incident-response-workflows-with-n8n-and-monitoring-tools

Top comments (0)