When considering tools to help optimize parts or all of your cloud incident remediation workflows and runbooks, there are several factors to take into consideration. This article will break down several capabilities to successfully achieve good optimization, and will do a direct comparison between the companies leading the charge into this new and expanding area.
Disclosure: I work at Fylamynt, one of the offerings, but will keep the comparison to facts that can be known based on company web sites and information from customers.
This article will compare several offerings in the market: Fylamynt, FireHydrant, Blameless, Rundeck and Transposit.
One of the first and most important factors to consider is how many third party integrations the platform has, and how easy they are to use. SREs have a plethora of options at their fingertips to handle monitoring, data collection, incident tracking, and many others.
First, let’s take a raw look at what integrations the platforms support.
Y — Only available on Advanced and Enterprise plans.
Community — Not supported as an enterprise integration by Rundeck.
Be careful to simply checkbox supported integrations. For example, both Fylamynt and Rundeck list Datadog logos on their integrations page, but not all integrations are created equal. Let’s break each offering down.
FireHydrant: FireHydrant is an incident management system that helps engineers manage incidents by creating tickets, setting up Slack and Zoom channels as required. However, it lacks incident response features that help engineers remediate the problems.
Blameless: Blameless is a post-mortem tool that’s typically used after an incident happened and resolved to understand what happened during the incident resolution. The tool shows a timeline view of what happened, but doesn’t help in resolving the problem itself.
Rundeck: Rundeck (acquired by PagerDuty) was a tool that was founded in 2010, and was originally targeted towards running multiple scripts (e.g. bash, Python) together in a single pipeline. It’s a tool that’s typically run by an engineer using a CLI. Rundeck lacks integrations with cloud-native SaaS services and API-driven, event-triggered automation that’s more common in today’s incident response workflows.
Transposit: Transposit has changed their message over time from being a ServiceNow kind of platform for IT to helping SREs resolve incidents. It’s unclear what exact features they support in their tool. They claim to help with responding to alerts, unclear how they differ from showing a timeline view or resolving the incident itself.
Fylamynt: Fylamynt provides a no code / low code drag and drop editor for all of their supported integrations. Within minutes you can drag your favorite tool or service into the editor, wire it to another and in the GUI make small configuration changes. Fylamynt has simplified the use of all the product and service APIs allowing engineers to wire them up in no time.
Creating your workflows/runbooks in the past has typically been writing some code to your tool APIs, creating your own branches and customizations. While in many cases this works and in fact can even get quite complex, there are several potential issues.
- Error prone — humans make mistakes (just look at the Facebook outage)
- Time consuming — looking up all the APIs, writing the code, testing all of the connections
- Maintenance — APIs and capabilities change, as well as how you want your runbooks to behave
Considering the above issues, providing a quick and error free way to build your workflows/runbooks is ideal. Fylamynt has a no code / low code drag and drop builder that includes all of the integrations they support. Engineers can drag nodes to the editor, wire them up and with minor conifg changes in the UI they’re ready to go. You can add more complex things like conditional branches, custom code and input/output transformation.
When looking at solution limitations, it’s helpful to consider a few fundamental pillars needed to effectively reduce cloud incident remediation time.
Rarely is it solely up to a single individual to respond to, solve and report on incidents. When incidents are serious and need to be escalated, or require the expertise of a subject matter expert with knowledge beyond that of the on call engineer, collaboration needs to happen. Having the ability to spin up a slack channel or a zoom — quickly and dropped right in front of the pre-defined set of people that need to be there is a huge time saver. You can see in the feature chart above that many of the platforms do not have this ability.
Any adequate solution must be able to automate portions of the remediation process. At a minimum assembling the relevant data to put in front of an SRE is required. If not, countless time will be wasted.
Being able to quickly and easily build and modify workflows/runbooks, with the ability to easily integrate your tools is critical. The difference between dragging and dropping your steps together vs. writing custom code that could have errors, not scale, and not restrict permissions is quite large.
Having the ability to see your incidents in a dashboard, with the steps that have been taken, success/failure and the time things have taken is very important. In Fylamynt we call these tasks, and within the dashboard you can see all of the workflow executions, state and time each took. Having this realtime view into the state of incidents is critical.
Fully automating your workflow/runbook might sound amazing. Imagine never being woken up when something breaks in the middle of the night. However, many people get wary of full automation, especially when you have actions like taking down services or VMs. To help this, all of the solutions allow you to put a human in the loop which means the runbook will pause and wait for human decision. Much of the data gathering can be automated ahead of time so everything the SRE needs to make that decision is at their fingertips.
With Fylamynt you can add human approval into your workflow with a Slack message or email.
Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed.