When an incident fires at 3am, the last thing your on-call engineer needs is ambiguity. They need to know what broke, how bad it is, and what to do about it. Most PagerDuty alerts give you a service name, a check name, and a timestamp. That is not enough context to act on quickly.
This tutorial builds a Swrly workflow that intercepts PagerDuty alerts, uses an AI agent to analyze the incident and suggest response steps, and routes the enriched alert to the right Slack channel based on severity. Five nodes, zero manual triage.
What You Will Build
A 5-node workflow that:
- Receives PagerDuty incident webhooks via an event trigger
- Runs an AI agent that analyzes the alert, pulls context, and suggests runbook steps
- Evaluates severity with a condition node
- Routes critical incidents to
#incidentsin Slack and acknowledges in PagerDuty - Routes non-critical incidents to
#ops-alertsin Slack
The on-call engineer wakes up to a Slack message that already explains the probable cause and the first three things to check. Response time drops. Escalation accuracy improves.
Prerequisites
- A Swrly account (Pro plan or higher for event triggers)
- A PagerDuty account with at least one service configured
- A Slack workspace with
#incidentsand#ops-alertschannels - Your Claude Code session token (Settings > API Keys)
- PagerDuty and Slack integrations connected in Settings > Integrations
Step 1: Set Up the Event Trigger
Create a new swirl and name it "Incident Response Bot."
Drag a Trigger node onto the canvas. Set the trigger type to Event and configure it to listen for PagerDuty webhook events.
In PagerDuty, go to Services > Your Service > Integrations > Add Integration. Select Generic Webhooks (v3). Set the webhook URL to your Swrly event endpoint:
https://swrly.com/api/v1/webhooks/trigger/wh_abc123def456
Configure PagerDuty to send webhooks for incident.triggered events. When an incident fires on this service, PagerDuty will POST a payload to Swrly that includes the incident ID, title, service name, severity, and the triggering alert details.
A typical PagerDuty webhook payload includes:
{
"event": {
"event_type": "incident.triggered",
"data": {
"id": "P8BXYZ1",
"title": "High CPU on api-prod-03",
"service": {
"name": "API Production",
"id": "PABC123"
},
"urgency": "high",
"created_at": "2026-04-03T03:14:00Z",
"body": {
"details": "CPU usage exceeded 95% for 5 minutes on api-prod-03. Threshold: 90%."
}
}
}
}
Your trigger node receives this payload and passes it downstream.
Step 2: Add the Incident Analyzer Agent
Drag an Agent node onto the canvas and connect it to the trigger. Name it Incident Analyzer.
This is the most valuable node in the workflow. The agent reads the alert, reasons about what might be wrong, and suggests concrete response steps. Configure the system prompt:
You are a senior SRE incident responder. You receive PagerDuty
incident alerts and provide rapid incident analysis.
When you receive an alert:
1. Identify the affected service and component from the alert data
2. Analyze the alert details (metrics, thresholds, error messages)
3. Determine the likely root cause based on common failure patterns:
- High CPU: runaway process, traffic spike, memory leak causing
GC thrashing, crypto mining
- High memory: memory leak, cache overflow, connection pool
exhaustion
- 5xx errors: upstream dependency failure, deployment regression,
database connection limits
- Latency spike: slow queries, cold cache, network saturation,
DNS resolution issues
4. Suggest 3-5 specific runbook steps the on-call engineer should
take, in order of priority
Format your response as:
**Incident:** [service name] — [short description]
**Probable cause:** [1-2 sentences]
**Severity assessment:** CRITICAL or NON-CRITICAL
**Recommended steps:**
1. [First thing to check/do]
2. [Second thing]
3. [Third thing]
Mark as CRITICAL if: the service is customer-facing AND the issue
indicates complete service degradation or data loss risk.
Mark as NON-CRITICAL for: partial degradation, internal services,
warning-level thresholds, or non-customer-facing components.
Be specific. "Check the logs" is not helpful. "Run kubectl logs
-f deployment/api-server -n production --since=10m and look for
OOMKilled or connection refused errors" is helpful.
Under Agent Settings, configure:
-
Max Turns:
5-- the agent may want to use tools to pull additional context. -
Accumulate Context:
enabled - Tools: If you have internal service catalog or runbook integrations, enable them here. Even without tools, the agent produces useful analysis from the alert data alone.
For the example CPU alert above, the agent would produce something like:
**Incident:** API Production — High CPU on api-prod-03
**Probable cause:** CPU at 95% on a single node suggests either a
runaway process or uneven load balancer distribution. If other nodes
are healthy, this is likely process-level, not a traffic spike.
**Severity assessment:** CRITICAL
**Recommended steps:**
1. SSH into api-prod-03 and run `top -c` to identify the process
consuming CPU. Look for unexpected PIDs or processes outside
the normal service set.
2. Check the load balancer dashboard for request distribution
across nodes. If api-prod-03 is receiving disproportionate
traffic, drain it from the pool.
3. Review the last deployment timestamp against the incident start
time. If they correlate, roll back the most recent release.
4. Check `dmesg` and `/var/log/syslog` for OOM killer activity —
high CPU can be a symptom of memory pressure causing swap thrash.
That is substantially more useful than the raw "High CPU on api-prod-03" alert.
Step 3: Add the Severity Condition
Drag a Condition node onto the canvas and connect it to the Incident Analyzer's output. Configure the rule:
-
Field:
output -
Operator:
contains -
Value:
CRITICAL
If the agent assessed the incident as critical, the workflow follows the true branch. Non-critical incidents follow the false branch. The agent's structured output format makes this condition reliable -- the word "CRITICAL" only appears in the severity assessment line.
Step 4: Route to Slack Channels
Critical branch (true): Drag a Slack Integration node and connect it to the condition's true output. Configure it:
- Integration: Slack
-
Action:
slack_send_message -
Channel:
#incidents - Message template:
:red_circle: *CRITICAL INCIDENT*
*Service:* {{trigger.event.data.service.name}}
*Alert:* {{trigger.event.data.title}}
*PagerDuty ID:* {{trigger.event.data.id}}
*Triggered at:* {{trigger.event.data.created_at}}
*AI Analysis:*
{{incident_analyzer.output}}
_Respond immediately. This incident was auto-triaged by Swrly._
For critical incidents, you may also want to add a second integration node in parallel that acknowledges the incident in PagerDuty using the pagerduty_manage_incident action with status set to acknowledged. This signals to PagerDuty that the alert has been seen and is being worked, suppressing repeated notifications. Connect this node to the same true branch as the Slack node.
Non-critical branch (false): Drag another Slack integration node and connect it to the condition's false output:
- Integration: Slack
-
Action:
slack_send_message -
Channel:
#ops-alerts - Message template:
:large_yellow_circle: *Ops Alert*
*Service:* {{trigger.event.data.service.name}}
*Alert:* {{trigger.event.data.title}}
*AI Analysis:*
{{incident_analyzer.output}}
_Non-critical. Review during business hours._
Step 5: Save and Test
Save the workflow. To test without triggering a real incident, click Save and Run and paste a sample PagerDuty webhook payload. Verify that:
- The agent produces a structured analysis with a severity assessment
- The condition routes correctly based on the CRITICAL/NON-CRITICAL keyword
- The right Slack channel receives the message
- Template variables resolve with the correct data from the payload
Then trigger a real test incident in PagerDuty (most services have a "Create Test Incident" button). Confirm the end-to-end flow works, from PagerDuty webhook to Slack message, in under 30 seconds.
Why This Matters
Incident response is one of the highest-leverage places to put AI. The difference between a 2-minute and a 20-minute response time can be hundreds of thousands of dollars for a production outage. But the bottleneck is rarely the fix itself -- it is the triage. Reading the alert, understanding what service is affected, figuring out where to look first.
This workflow compresses that triage step from minutes to seconds. The on-call engineer gets a structured analysis and a prioritized list of steps before they have even opened their laptop. They skip the "what does this alert mean" phase and go straight to investigating the probable cause.
The AI does not replace your incident process. It front-loads the thinking that a senior SRE would do, so that a junior engineer on-call at 3am has the same starting point a Staff SRE would. That levels up your entire team's response capability.
And because every run is logged with full input and output, you build up a library of incident analyses over time. That library becomes a training resource, a pattern database, and an audit trail all at once.
Top comments (0)