loading...
n8n

Creating Custom Incident Response Workflows with n8n

tanay1337 profile image Tanay Pant Originally published at Medium on ・15 min read

I’ve been involved in the DevOps world for a while and yet, I finished reading The Phoenix Project only recently. This piqued my interest in how teams execute their incident response playbooks. It’s enlightening to see the different approaches teams take, to hone what works best for them.

Incident response management in Mattermost powered by n8n and PagerDuty

I wanted to test how automating a minimalist incident response playbook would look like and I decided to test it out with three of my favorite tools n8n, PagerDuty and Mattermost. Here’s a quick introduction to the three tools, in case you aren’t aware of them:

  1. n8n is a fair-code licensed tool that helps you automate tasks, sync data between various sources, and react to events all via a visual workflow editor.
  2. PagerDuty is a SaaS incident response platform for IT departments in companies.
  3. Mattermost is a flexible and open-source messaging alternative to Slack.

To avoid panic during an incident, a lot of companies have an incident response playbook. I created a minimalist six-step playbook for this tutorial. Whenever, a service goes down or something unexpected happens, the on-call team would follow this high-level protocol:

  1. Triage issue in Jira
  2. Create auxiliary channel
  3. Invite the on-call team to the channel
  4. Acknowledge the issue
  5. Fix the issue
  6. Resolve the ticket

We will automate this playbook with three workflows in n8n and this is how the result shall look like once we are done.

Incident response workflow in play

Workflow 1 — Make sure everyone knows what happened

Our first workflow will cover the first three steps of the playbook. Whenever a service goes down and creates an incident report on PagerDuty, we want the workflow to automate the following tasks for us:

  1. A webhook gets triggered and informs a general incidents channel on Mattermost that something is wrong.
  2. Create an auxiliary channel for the specific incident, invite the on-call team to it and share its link for those interested in the incident.
  3. Triage an issue on Jira.
  4. Share the links of the auxiliary channel, PagerDuty incident and the Jira issue in the Incidents channel, and the auxiliary channel.
  5. Share action buttons in the auxiliary channel to acknowledge and resolve the incident.

Let’s get started with the nodes of the first workflow. I have also submitted Workflow 1 on n8n.io, in case you’d like to skim through this workflow. Please note that you’ll still need to configure a couple of things like your credentials, channels on Mattermost as well as the settings of the nodes. You can find information on how to setup n8n in the documentation.

1. Webhook node: Get data from PagerDuty

First of all, we need to pull in the new incident reports from PagerDuty. To do that start n8n with the tunnel parameter:

n8n start --tunnel

Note: Make sure that you don’t forget to add the --tunnel parameter.

Add a new node by clicking on the + button on the top right of the Editor UI. Select the Webhook node under the Triggers section.

In the Node Editor view, set the HTTP method to POST. For the Path, I have entered webhook but feel free to add something else here according to your preferred convention. Now, you’ll need to save the workflow. I named it ‘Incident Response Workflow’. Once the workflow is saved, click on Webhook URLs, select Test, and then click on the URL to copy it to the clipboard.

Note: Don’t forget to save the workflow first before copying the Webhook URLs.

Here’s a GIF of me following the steps mentioned above.

Creating a Webhook node to get data from PagerDuty

Now that we have our Webhook node ready on n8n, we’ll need to configure the settings on PagerDuty, so that it sends the new incident reports to the webhook.

Unless your team already uses PagerDuty, you can create a free trial account on PagerDuty. If you are creating a new account, you’ll also have to create a service that PagerDuty will be monitoring.

PagerDuty has integrations with a lot of services, to monitor them, in case something goes wrong. Once you have created your service, let’s configure the webhooks for the service.

To do that, select the Configuration menu on the top and click on Services. Click on the More button on the right side and select View Integrations from the menu (do this for the service that you want to configure the webhook for). Now, under the section called Extensions, click on the New Extension button and select ‘Generic V2 Webhook’ as the Extension Type. I entered n8n as the name and entered the URL that the copied from the Webhook node. Click on the Save button and we are done!

Here’s a GIF of me following the steps mentioned above.

Adding the webhook as an extension on PagerDuty

Now, click on the Execute Workflow button to register the webhook. Once you’ve done that, you can create a new incident at PagerDuty. Your Webhook node will receive all the details. Keep in mind that the Test webhooks are only valid for 120 seconds. It should look something like in the following image.

Response received by the Webhook node upon creation of an incident on PagerDuty

At times, when you are sending too many requests from PagerDuty, it will disable the webhook. You’ll have to re-enable it by going to the list of extensions and clicking on the Re-enable button.

Re-enabling the webhook on PagerDuty

2. Mattermost node: Create an auxiliary channel

Now, we need to create a Mattermost node that will create an auxiliary channel so that the on-call team can coordinate on a fix for the incident.

To do that, click on the + button and click on the Mattermost node. In the Node Editor, enter your Mattermost credentials. Here’s some detailed information on how to create an access token for the credentials. I have used an access token from a bot account, but you can also use the access token from your account.

Note: Throughout the tutorial, please make sure that the nodes are connected properly before you start the configuration in the Node Editor. If you don’t do this, the variables mentioned in the tutorials might not be visible to you.

Once you are all sorted out with the credentials, select ‘Channel’ as the Resource in the Node Editor. Now select your team as the Team ID (in case you are unable to acquire that, please check with your system admin). We now need to enter a Display Name for the channel. Since this would be a dynamic piece of information, click on the gears icon next to the field and select Add Expression. Select the following in the Variable Selector:

Nodes > Webhook > Output Data > JSON > body > messages > [Item: 0] > log_entries > [Item: 0] > incident > summary

Quite some indentation, I know! This will make sure that the display name of the channel would be the same as the incident summary on PagerDuty to keep things coherent. Now you need to enter a Name. This needs to be a unique value, so we’ll select the id from the Incident report. Click on Add Expression and select the following in the Variable Selector:

Nodes > Webhook > Output Data > JSON > body > messages > [Item: 0] > id

Perfect, now click on Execute Node and this will create an auxiliary channel on Mattermost. Here’s a GIF of me following the steps mentioned above.

Creating a Mattermost node to create an auxiliary channel

3. Mattermost node: Add on-call team to auxiliary channel

Once the auxiliary channel has been created, we need to make sure that all the on-call team members have been added to the channel. However, right now we‘ll add a single user to the channel.

To do that create another Mattermost node. Select the credentials that you entered earlier. Select ‘Channel’ as the Resource and click on ‘Add User’ for Operation. Now we have to specify the Channel ID where the user should be added. Since this is another dynamic piece of information, click on Add Expression and in the Variable Selector, select the following:

Nodes > Mattermost > Output Data > JSON > id

Now we will specify a user by selecting ourselves from the dropdown list for User ID. Click on the Execute Node button and you will notice that you will be added to the channel. This node ensures that the specified user is always added to the auxiliary channel created by the workflow.

Here’s a GIF of me following the steps mentioned above.

Creating a Mattermost node to add ourselves to the channel

As an exercise, try using the PagerDuty API to pull a list of the email IDs of the people who are on-call and add them to the auxiliary channel in Mattermost. Feel free to pick this up once you are finished with the tutorial.

4. Jira Software node: Triage the issue in Jira

Since the playbook specifies that the issue should also be triaged in Jira, we’ll need to add a node that creates a ticket in Jira. To do that, create a Jira node by clicking on the + button on the top right.

In the Node Editor, enter the Credentials for Jira. Here’s detailed information on how you can create a new API Token for the credentials.

Once you are sorted out with the Credentials, select the Project where the tickets would be created. I selected a test project that I created specifically for this tutorial. In the Issue Type, I selected ‘Story’ but feel free to select ‘Bug’ or something else. Summary is a dynamic piece of information, select Add Expressions and pick the summary variable just like you did for the Display Name section while configuring the Mattermost node to create a channel.

Click on Execute Node and this will create a Jira ticket for you. Here’s a GIF of me following the steps mentioned above.

Creating a Jira Software node to triage the issue in Jira

5. Mattermost node: Post details in the Incidents channel

The next thing that needs to be done is to post the details of the incident in the Incidents channel. We will need to share the following information in the channel:

  1. Summary of the incident
  2. Link to the Auxiliary channel
  3. Link to the PagerDuty incident
  4. Link to the Jira ticket

Sharing these pieces of information will ensure that if someone outside of the on-call team is interested to check out what is going on, they can get this information from the Incidents channel.

To do this, create a new Mattermost node. In the Node Editor, select your Credentials. Now we need to enter the Channel ID. Since this is not a dynamic piece of information (the Incidents channel would always be there and hence, the ID will remain the same), we need to grab its Channel ID.

If you don’t already have a channel like this for the tutorial, you can manually create a new channel on Mattermost. To get its ID, click on the down arrow next to the channel name and click on the View Info option. This will reveal the ID of the channel. You can then copy and paste that in the Channel ID field in the node. In the message section, I entered the following expression to include the information that we mentioned in the list above.

🚨 New incident: {{$node["Webhook"].json["body"]["messages"][0]["incident"]["summary"]}}

Auxiliary Channel -> https://mattermost.internal.n8n.io/test/channels/{{$node["Mattermost"].json["name"]}}

PagerDuty Incident -> {{$node["Webhook"].json["body"]["messages"][0]["incident"]["html_url"]}}

Jira Issue -> https://n8n.atlassian.net/browse/{{$node["Jira Software"].json["key"]}}

Finally, click on the Execute Node button to send this information to your Incidents channel. Here’s a GIF of me following the steps mentioned above.

Creating a Mattermost node to post details in the Incidents channel

6. Mattermost node: Post details and action buttons in the auxiliary channel

As a last step of this workflow, we need to provide the information that we talked about in the previous node to the auxiliary channel as well. Moreover, we will need to provide the following two buttons in the channel:

  1. Acknowledge: Clicking this button will change the status of the incident on PagerDuty from ‘Triggered’ to ‘Acknowledged’.
  2. Resolve: Clicking this button will change the status of the incident on PagerDuty from ‘Acknowledged’ to ‘Resolved’ and mark the ticket in Jira to ‘Done’.

To do this, create a new Mattermost node and connect it to the Jira node. This will ensure that this and the previous Mattermost node can run in parallel. In the Node Editor, select your Credentials. Next, you’ll need to enter the Channel ID of the auxiliary channel. You can follow the steps mentioned in Workflow 1, Step 3 to do that. In the Message section, I entered the following expression (this is quite similar to the Message from the previous node):

⚠️ {{$node["Webhook"].json["body"]["messages"][0]["log_entries"][0]["incident"]["summary"]}}

PagerDuty incident: {{$node["Webhook"].json["body"]["messages"][0]["log_entries"][0]["incident"]["html_url"]}}

Jira issue: https://n8n.atlassian.net/browse/{{$node["Jira Software"].json["key"]}}

Now, we need to create the buttons which will trigger the actions that we talked about. To do that, under Attachments, click on the Add attachment button, click on Add attachment item, and select Actions. Then click on the Add Actions button and name it Acknowledge.

Now click on the Add Integration button. This will allow us to give the URL of the webhook this button will trigger on being clicked. We’ll leave this empty for now.

We’ll also need to send details (to the next workflow) about the PagerDuty incident to mark as resolved when the button is clicked. To do that, click on the Add Context to Integration button under the Context section. We’ll enter pagerduty_incident as the Property Name. Since the Property Value is a dynamic piece of information, click on Add Expression. In the Variable Selector, select the following:

Nodes > Webhook > Output Data > JSON > body > messages > [Item: 0] > incident > id

Now, add another button called Resolve and following the same steps mentioned above. For this button, we’ll need to add the context of the pager duty incident and the Jira ticket key. I’ll leave this as an exercise for you. For the sake of uniformity, you can name the Property Name jira_key.

In case you were wondering, it is important to send the context with the buttons as there might be multiple auxiliary channels at any given time and multiple people clicking on different Acknowledge and Resolve buttons. We need the correct context so that we don’t close up the wrong PagerDuty incidents and Jira tickets by mistake.

Click on the Execute Node button to send all this information to the auxiliary channel. Here’s a GIF of me following the steps mentioned above.

Creating a Mattermost node to post details and action buttons in the auxiliary channel

Workflow 2 — Make sure that the incident is acknowledged

Our second workflow will cover the fourth step of the playbook. Once all the people responsible get notified that an incident has occurred, we need to make sure that there is a quick and easy way to acknowledge the incident so that it is clear that someone in the on-call team has got it.

Let’s get started with the nodes of the second workflow. I have also submitted Workflow 2 on n8n.io, in case you’d like to skim through this workflow. Please note that you’ll still need to configure a couple of things like your credentials as well as the settings of the nodes.

1. Webhook node: Get data from the Acknowledge button

We now need to set up a Webhook node that listens to the event when somebody clicks on the Acknowledge button in the auxiliary channel.

Create a Webhook node the same way you did in Workflow 1, Step 1. Now copy the link of the Test webhook from this Webhook node, go to the node from Workflow 1, Step 6 and paste it in the URL field in the Integration section of the Acknowledge button under Actions.

Once you are done with that, click on the Execute Node button to register the webhook and test it by clicking on the Acknowledge button in the auxiliary channel. Here’s a GIF of me following the steps mentioned above.

Creating a Webhook node to get data from the Acknowledge button

2. PagerDuty node: Acknowledge the incident on PagerDuty

Now we need to get the ID of the incident from the webhook node to know which incident to mark as acknowledged. We get this information from the context that we added to the Integration of the button.

Add a PagerDuty node by clicking on the + button on the right side. In the Node Editor view, first of all, you’ll have to enter the Credentials for PagerDuty. Here’s detailed information on how you can create a new API Token for the credentials. Once you are done with that, select ‘Update’ as the Operation. Since the Incident ID is a dynamic piece of information, click on Add Expression and select the following in the Variable Selector:

Nodes > Webhook > Output Data > JSON > body > context > pagerduty_incident

In the Email field, I have just entered my email. In the Update Fields section, click on the Add Field button and select Status. From the dropdown list in the Status field, select ‘Acknowledged’. Now, click on the Execute Workflow button. Go to the auxiliary channel and click on the Acknowledge button. This will change the status of your incident report from ‘Triggered’ to ‘Acknowledged’. Here’s a GIF of me following the steps mentioned above.

Creating a PagerDuty node to acknowledge an incident on PagerDuty

3. Mattermost node: Confirm the acknowledgment

Now we just need to confirm the change of status of the PagerDuty incident by sending a message to the auxiliary channel. I’ll leave this as an exercise for you. In case you run into any troubles, here’s a GIF of me creating this node.

Creating a Mattermost node to confirm the acknowledgment of the incident on PagerDuty

Workflow 3 — Make sure that everything is marked resolved after the fix

Our third workflow will cover the sixth step of the playbook. Once the issue has been fixed, we need to make sure that the incident on PagerDuty has been marked as ‘Resolved’ and the ticket on Jira has been marked as ‘Done’. We also need to ensure that everyone in the Incidents and the auxiliary channel is aware of the resolution as well.

Let’s get started with the nodes of the third workflow. The nodes of this workflow have been left as an exercise for you. I have added GIFs for the nodes and have also submitted Workflow 3 on n8n.io, in case you run into any troubles. Please note that you’ll still need to configure a couple of things like your credentials as well as the settings of the nodes.

1. Webhook node: Get details from the Resolve button

Just like in the last workflow, we need a Webhook node that listens to the event when somebody clicks on the Resolve button in the auxiliary channel. Here’s a GIF of me creating this node.

Creating a Webhook node to get data from the Resolve button

2. PagerDuty node: Resolve the incident on PagerDuty

Now we need to change the status of the PagerDuty incident from ‘Acknowledged’ to ‘Resolved’. This is very similar to the Workflow 2, Step 2. Here’s a GIF of me creating this node.

Creating a PagerDuty node to resolve an incident on PagerDuty

3. Jira Software node: Resolve the incident on Jira

Now we need to update the status of the Jira ticket to ‘Done’. Here’s a GIF of me creating this node.

Creating a Jira Software node to mark the ticket as done on Jira

4. Mattermost nodes: Announce the resolution in the auxiliary and Incidents channel

Lastly, we need to create two Mattermost nodes:

  1. To acknowledge in the auxiliary channel that the incident report on PagerDuty and the ticket on Jira have been resolved.
  2. To announce in the Incidents channel that the incident has been resolved.

Here’s a GIF of me creating this node.

Creating two Mattermost nodes to announce the resolution in the auxiliary and Incidents channel

Congratulations, you successfully built an automated incident response workflow using n8n, PagerDuty and Mattermost 🎉

Let’s run the whole system end to end. First of all, you’ll have to click on the Execute Workflow button on all three workflows to register the Webhook nodes. Go ahead and get started by creating a new incident on PagerDuty.

Incident response workflow in play

Now, to make sure that the workflow runs permanently without you having to press the Execute Workflow on all three workflows before each incident creation, we’ll need to use the Production webhook.

To do that, you’ll just need to get the Production webhook URL from the different Webhook nodes, update the URLs on PagerDuty and the Mattermost node from Workflow 1, Step 6, save the workflows and finally activate the workflows. This will make your workflows ready to use.

Note: When working with a Production webhook, please ensure that you have saved and activated the workflow. Don’t forget that the data flowing through the webhook won’t be visible in the Editor UI with the Production webhook.

Conclusion

Today we created an automatic incident workflow using a variety of n8n nodes. The first-class support for webhooks and APIs allows n8n to integrate a very wide array of services and products, to create powerful workflows in a simplified way. This was an example of automating a minimalist incident response playbook. Which other services are you using for managing incidents in your organization? In case you have created other workflows with n8n that use different nodes, I’d love to check them out, please consider sharing those workflows with the community.

In case you’ve run into an issue while following the tutorial, feel free to reach out to me on Twitter or ask for help on our forum 💙


Posted on by:

tanay1337 profile

Tanay Pant

@tanay1337

Tanay is an author, speaker, developer and tech enthusiast.

n8n

Free, open and self-hostable fair-code licensed workflow automation tool.

Discussion

markdown guide