Automate Incident Response: PagerDuty to Slack in 5 Nodes

#productivity #ai #devops #tutorial

When an incident fires at 3am, the last thing your on-call engineer needs is ambiguity. They need to know what broke, how bad it is, and what to do about it. Most PagerDuty alerts give you a service name, a check name, and a timestamp. That is not enough context to act on quickly.

This tutorial builds a Swrly workflow that intercepts PagerDuty alerts, uses an AI agent to analyze the incident and suggest response steps, and routes the enriched alert to the right Slack channel based on severity. Five nodes, zero manual triage.

What You Will Build

A 5-node workflow that:

Receives PagerDuty incident webhooks via an event trigger
Runs an AI agent that analyzes the alert, pulls context, and suggests runbook steps
Evaluates severity with a condition node
Routes critical incidents to #incidents in Slack and acknowledges in PagerDuty
Routes non-critical incidents to #ops-alerts in Slack

The on-call engineer wakes up to a Slack message that already explains the probable cause and the first three things to check. Response time drops. Escalation accuracy improves.

Prerequisites

A Swrly account (Pro plan or higher for event triggers)
A PagerDuty account with at least one service configured
A Slack workspace with #incidents and #ops-alerts channels
Your Claude Code session token (Settings > API Keys)
PagerDuty and Slack integrations connected in Settings > Integrations

Step 1: Set Up the Event Trigger

Create a new swirl and name it "Incident Response Bot."

Drag a Trigger node onto the canvas. Set the trigger type to Event and configure it to listen for PagerDuty webhook events.

In PagerDuty, go to Services > Your Service > Integrations > Add Integration. Select Generic Webhooks (v3). Set the webhook URL to your Swrly event endpoint:

https://swrly.com/api/v1/webhooks/trigger/wh_abc123def456

Configure PagerDuty to send webhooks for incident.triggered events. When an incident fires on this service, PagerDuty will POST a payload to Swrly that includes the incident ID, title, service name, severity, and the triggering alert details.

A typical PagerDuty webhook payload includes:

{
  "event": {
    "event_type": "incident.triggered",
    "data": {
      "id": "P8BXYZ1",
      "title": "High CPU on api-prod-03",
      "service": {
        "name": "API Production",
        "id": "PABC123"
      },
      "urgency": "high",
      "created_at": "2026-04-03T03:14:00Z",
      "body": {
        "details": "CPU usage exceeded 95% for 5 minutes on api-prod-03. Threshold: 90%."
      }
    }
  }
}

Your trigger node receives this payload and passes it downstream.

Step 2: Add the Incident Analyzer Agent

Drag an Agent node onto the canvas and connect it to the trigger. Name it Incident Analyzer.

This is the most valuable node in the workflow. The agent reads the alert, reasons about what might be wrong, and suggests concrete response steps. Configure the system prompt:

You are a senior SRE incident responder. You receive PagerDuty
incident alerts and provide rapid incident analysis.

When you receive an alert:

1. Identify the affected service and component from the alert data
2. Analyze the alert details (metrics, thresholds, error messages)
3. Determine the likely root cause based on common failure patterns:
   - High CPU: runaway process, traffic spike, memory leak causing
     GC thrashing, crypto mining
   - High memory: memory leak, cache overflow, connection pool
     exhaustion
   - 5xx errors: upstream dependency failure, deployment regression,
     database connection limits
   - Latency spike: slow queries, cold cache, network saturation,
     DNS resolution issues
4. Suggest 3-5 specific runbook steps the on-call engineer should
   take, in order of priority

Format your response as:

**Incident:** [service name] — [short description]
**Probable cause:** [1-2 sentences]
**Severity assessment:** CRITICAL or NON-CRITICAL
**Recommended steps:**
1. [First thing to check/do]
2. [Second thing]
3. [Third thing]

Mark as CRITICAL if: the service is customer-facing AND the issue
indicates complete service degradation or data loss risk.
Mark as NON-CRITICAL for: partial degradation, internal services,
warning-level thresholds, or non-customer-facing components.

Be specific. "Check the logs" is not helpful. "Run kubectl logs
-f deployment/api-server -n production --since=10m and look for
OOMKilled or connection refused errors" is helpful.

Under Agent Settings, configure:

Max Turns: 5 -- the agent may want to use tools to pull additional context.
Accumulate Context: enabled
Tools: If you have internal service catalog or runbook integrations, enable them here. Even without tools, the agent produces useful analysis from the alert data alone.

For the example CPU alert above, the agent would produce something like:

**Incident:** API Production — High CPU on api-prod-03
**Probable cause:** CPU at 95% on a single node suggests either a
runaway process or uneven load balancer distribution. If other nodes
are healthy, this is likely process-level, not a traffic spike.
**Severity assessment:** CRITICAL
**Recommended steps:**
1. SSH into api-prod-03 and run `top -c` to identify the process
   consuming CPU. Look for unexpected PIDs or processes outside
   the normal service set.
2. Check the load balancer dashboard for request distribution
   across nodes. If api-prod-03 is receiving disproportionate
   traffic, drain it from the pool.
3. Review the last deployment timestamp against the incident start
   time. If they correlate, roll back the most recent release.
4. Check `dmesg` and `/var/log/syslog` for OOM killer activity —
   high CPU can be a symptom of memory pressure causing swap thrash.

That is substantially more useful than the raw "High CPU on api-prod-03" alert.

Step 3: Add the Severity Condition

Drag a Condition node onto the canvas and connect it to the Incident Analyzer's output. Configure the rule:

Field: output
Operator: contains
Value: CRITICAL

If the agent assessed the incident as critical, the workflow follows the true branch. Non-critical incidents follow the false branch. The agent's structured output format makes this condition reliable -- the word "CRITICAL" only appears in the severity assessment line.

Step 4: Route to Slack Channels

Critical branch (true): Drag a Slack Integration node and connect it to the condition's true output. Configure it:

Integration: Slack
Action: slack_send_message
Channel: #incidents
Message template:

:red_circle: *CRITICAL INCIDENT*

*Service:* {{trigger.event.data.service.name}}
*Alert:* {{trigger.event.data.title}}
*PagerDuty ID:* {{trigger.event.data.id}}
*Triggered at:* {{trigger.event.data.created_at}}

*AI Analysis:*
{{incident_analyzer.output}}

_Respond immediately. This incident was auto-triaged by Swrly._

For critical incidents, you may also want to add a second integration node in parallel that acknowledges the incident in PagerDuty using the pagerduty_manage_incident action with status set to acknowledged. This signals to PagerDuty that the alert has been seen and is being worked, suppressing repeated notifications. Connect this node to the same true branch as the Slack node.

Non-critical branch (false): Drag another Slack integration node and connect it to the condition's false output:

Integration: Slack
Action: slack_send_message
Channel: #ops-alerts
Message template:

:large_yellow_circle: *Ops Alert*

*Service:* {{trigger.event.data.service.name}}
*Alert:* {{trigger.event.data.title}}

*AI Analysis:*
{{incident_analyzer.output}}

_Non-critical. Review during business hours._

Step 5: Save and Test

Save the workflow. To test without triggering a real incident, click Save and Run and paste a sample PagerDuty webhook payload. Verify that:

The agent produces a structured analysis with a severity assessment
The condition routes correctly based on the CRITICAL/NON-CRITICAL keyword
The right Slack channel receives the message
Template variables resolve with the correct data from the payload

Then trigger a real test incident in PagerDuty (most services have a "Create Test Incident" button). Confirm the end-to-end flow works, from PagerDuty webhook to Slack message, in under 30 seconds.

Why This Matters

Incident response is one of the highest-leverage places to put AI. The difference between a 2-minute and a 20-minute response time can be hundreds of thousands of dollars for a production outage. But the bottleneck is rarely the fix itself -- it is the triage. Reading the alert, understanding what service is affected, figuring out where to look first.

This workflow compresses that triage step from minutes to seconds. The on-call engineer gets a structured analysis and a prioritized list of steps before they have even opened their laptop. They skip the "what does this alert mean" phase and go straight to investigating the probable cause.

The AI does not replace your incident process. It front-loads the thinking that a senior SRE would do, so that a junior engineer on-call at 3am has the same starting point a Staff SRE would. That levels up your entire team's response capability.

And because every run is logged with full input and output, you build up a library of incident analyses over time. That library becomes a training resource, a pattern database, and an audit trail all at once.