DEV Community: Scott Lasica

Fylamynt and Squadcast Team Up To Handle Cloud Incident Response, Management, and Remediation

Scott Lasica — Tue, 30 Nov 2021 16:47:24 +0000

As much as every company dreams of cloud operations running perfectly all the time, as even junior operations people know the reality is there are issues, things break, and they have to be dealt with constantly. Savvy operations teams prepare for this eventuality and together with industry leading incident management, incident response and incident remediation tools are able to minimize user facing issues and especially dreaded downtime.

This is where a modern incident management & response platform like Squadcast comes to the rescue, helping organizations in their journey to deliver super-reliable services. Organizations can quickly and easily adopt Site Reliability Engineering (SRE) practices to improve their incident resolution metrics and ultimately, the reliability of their systems.

The first step towards doing better incident management is adding enough context to incidents while they get detected. With Squadcast, teams can discover everything they need, to take action and achieve best-in-class MTTD (Mean Time To Detect) with highly configurable features like [alert deduplication and tagging(https://www.squadcast.com/effective-on-call-and-incident-response), thus facilitating on-call teams to streamline high-priority alerts and stay productive. Teams can also collaborate in real-time with virtual incident war rooms on Squadcast to get the right responders virtually in one place making operations transparent.

Obviously the story doesn’t end once an incident has been created, routed and enriched. The incident still needs to be remediated. This is where Fylamynt steps in as the perfect compliment to Squadcast. Fylamynt provides a no-code, drag and drop interface for building workflows (runbooks) that can be triggered by a number of ways, including a Squadcast incident.

Fylamynt integrates with over 40 commonly used tools for dealing with cloud operation incidents, handling all the API calls and the end result is a fully or partially automated workflow that will run in a consistent manner every time.

Workflow as Code

We call this “workflow as code” because our user interface gets out of your way and lets you switch seamlessly between drag-and-drop and coding scripts in Python and JSON, without loss of information.
You can select from a comprehensive library of connectors and automated actions to connect any part of your cloud. You can select from a list of actions to create a workflow to solve a specific business task such as fixing an incident that caused the website to be down.

By automating the parts of the workflow that are the most tedious and time consuming, SRE teams can focus their expertise where it’s needed to make those critical decisions. We call this “human in the loop” and this causes the workflow to pause and can send a message through slack or otherwise. The SRE can then click a link and have all the needed information at their fingertips, allowing them to quickly make the decision on what to do next (could be transferring traffic to a new instance or destroying an instance that was spiking CPU).

Another added benefit of defining and automating your workflows is that less experienced support engineers can handle more issues, freeing time for the more senior staff as well as repairing issues more quickly.

Fylamynt also provides a dashboard that shows all executed and currently executing workflows, with tons of detail about every step that ran, what the inputs and outputs were and what branches and actions were taken.

At this point you can pop back into Squadcast to handle your incident postmortem — the next logical step after any incident is to dissect and analyze the why, how and the what of the incident. Squadcast’s incident postmortem feature helps build an insightful timeline in a matter of minutes. This is especially useful as automation ensures that you can quickly have a system-generated postmortem for pretty much any incident.

One of the core principles of SRE is Transparency and Squadcast’s Status Page helps you communicate to customers and stakeholders with real-time updates. By configuring your public-facing services and their dependent components, you can show their status in real-time directly within Squadcast.

Squadcast’s native mobile application also helps in triggering remediations from anywhere. Teams can also connect via APIs to enhance incident response by bringing their entire toolchain into one platform.

Together Squadcast and Fylamynt provide the end-to-end solution for handling cloud operations incidents, helping your end users to experience a consistently delightful application experience. Teams can practice site reliability engineering through better Incident Management to proactively respond, resolve, and learn from every incident.

Try Squadcast Free →

Try Fylamynt Free →

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using virtual incident war rooms, and automate repetitive tasks to eliminate toil. Organisations can quickly and easily adopt Site Reliability Engineering practices to improve their incident resolution metrics and ultimately, the reliability of their systems.

Fylamynt has created the world’s first low code incident response and remediation platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed.

Incident Remediation With Jenkins and Terraform

Scott Lasica — Wed, 10 Nov 2021 18:38:24 +0000

Experienced DevOps personnel are very familiar with tools like Jenkins to create workflows and Terraform to automate orchestration. But are these the best tools to use when firefighting production cloud incidents?

What is Jenkins?

Jenkins is an open source automation server for DevOps. Jenkins has ~1800 plugins that support many of the tools used in build and deployment scenarios. The plugins cover build management, source code management, administration, platforms and UI. Jenkins was designed specifically for CI/CD (continuous integration / continuous delivery) environments as well as automating other routine development tasks.

Jenkins still requires scripts to be written for the steps, but gives a framework for integrating the entire chain of build / test / deploy. These "pipeline scripts" are stored in a file called Jenkinsfile, which is stored in your repo.

What is Terraform?

Terraform is an open source infrastructure as code (IaC) software tool. Terraform allows you to write code in a higher level language to manage operations in the cloud. Terraform supports ~100 cloud providers, and gives you the ability to create new resources, manage existing ones and destroy those that are no longer needed.

Terraform has a concept called modules. Terraform modules are like functions in programming languages. They provide a standard interface (input/output) for creating resources. Essentially, modules allow for consistent (and debugged) common actions - again just like you'd create a function that encapsulates many actions to perform a higher level action.

Are Jenkins and Terraform suitable for incident remediation?

To answer this question, we can look at the tools used to respond to and resolve cloud incidents. First, a monitoring tool needs to detect the issue. Popular products in this space include Datadog and New Relic. When inspecting the Datadog plugin for Terraform, you quickly learn that Terraform is simply configuring and deploying Datadog resources. When you get the next step in resolution, you typically use an incident management tool like PagerDuty or Opsgenie. Inspecting the Terraform plugins for those tools reveals the same situation. Terraform is designed primarily for creation, configuration and destruction of cloud resources due to its declarative nature.

Could Terraform be used to automate portions of a cloud incident runbook/workflow? Absolutely, but since this wasn't the intended use case a lot of custom code will need to be written to tie the tools together, requiring not only on-going maintenance but also opens the door to edge-cases and bugs. Facebook's outage in late 2021 is a classic example of this problem. They stated they had written code to check for errors in deployment scripts but that code had a bug in it, and allowed the error to propagate across the entire Facebook/Instagram/WhatsApp footprint cutting it off from the Internet.

Now take a look at Jenkins. Again, incident response and remediation was never the intended use case for Jenkins. It excels at CI/CD automation, making the lives of developers and DevOps personnel much easier. However, This is even more of a square peg into a round hole type of approach. The pipelines do operate like workflows, but have none of the logic or connections built in for the remediation steps required. You would essentially be writing most of the code required to make this work, and at that point you might as well just ditch Jenkins and wire everything together by hand.

Fylamynt has created the world's first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed.

Try Fylamynt for free ->

IR - Incident Response, Repair, Resolution or Remediation?

Scott Lasica — Wed, 03 Nov 2021 21:26:48 +0000

What does IR stand for?

Many people consider it to be Incident Response. In this case an incident refers to something not operating correctly in a cloud environment. The issue could be small - say a performance impact that doesn’t really affect successful operations. Alternatively, the impact could be massive - an outage that has ceased all operations with data loss, revenue loss, and damaged reputation (consider the recent Facebook outage or Roblox outage).

Incident Response

When an incident occurs, organizations will typically be alerted from a monitoring tool. These tools have parameters and ranges of acceptable use (think too many 500s, or too long to respond). The tool will likely then trigger an alert, which could come from another tool. The alerting tool could be simple (page everyone) or there could be sophisticated rules to page just the on-call team, or the subject matter experts for the type of incident. Next comes the much harder part, fixing the problem.

Incident Repair, Incident Resolution, Incident Remediation

These three terms all refer to the same basic result: the incident is fixed and operations are back to normal. Depending on the incident, the difficulty to resolve could be very simple, or take hours or days. The amount of time to repair the incident is called MTTR. Obviously, you want your MTTR to be as low as possible, and you want to consider more advanced tools and methodologies to achieve that. Reducing MTTR is one of the key objectives of a site reliability engineer (SRE).

How are incidents resolved?

Savvy organizations start by creating runbooks. These runbooks are basically an instruction manual on what to do, in what order, to remediate the incident. Simple incidents could be handled by level 1 support personnel, while multi day outages will be all hands on deck.

Runbooks can have many steps in them, but a typical set of high level steps are as follows:

Type of incident, what services are affected
How to collect the data and logs to verify the incident
What to do to correct the incident (this could be pages)

At Fylamynt, we call runbooks a workflow. Fylamynt has built the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed, removing mistakes and simple errors.

Try Fylamynt for free ->

From Ad-hoc Scripting to Workflow as Code: The Evolution of Runbooks

Scott Lasica — Mon, 01 Nov 2021 23:00:11 +0000

Unfortunately the word workflow has been used for many years to represent some very specific things in the business world (the most common being BPMN — Business Process Modeling Notation). However, at a general level it’s simply describing a set of steps done in a specific order to achieve the desired end result.

Workflow as code simply means that we’re using code to orchestrate and execute a workflow, very likely in a distributed environment. In the site reliability engineering (SRE) or cloud engineering space, these workflows tend to deal with things like cost savings and incident resolution.

In the early days of SRE (when it was still called DevOps), the ability to chain together specified actions with code was a much more daunting task. Let’s take what seems like a simple example: a database instance is out of storage. Assuming the engineer had the appropriate monitoring in place, they would be alerted. At that point they need to verify that it’s not a false alarm, spin up a new larger instance, copy the data over into the new one, verify the data integrity, redirect all the services using the old db to the new one, verify services are operating normally then destroy the old db instance. Engineers realized situations like this will happen often enough that they can automate some of these steps, writing code between them to at least do things like verification steps automatically.

Moving forward to modern day, there are tools that can help with many of these steps. As an example, you could have PagerDuty collect data from CloudWatch and generate an incident, then using code modify the database instance storage capacity. With things like AWS RDS, the steps of create, copy, destroy aren’t needed as they can resize on the fly. Still, the code you write to connect these services together will still be custom, need to be maintained and could contain bugs. Using another tool to build the workflows, connecting the services together for you and handling the orchestrated execution once put into production is ideal.

Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed. With over 40 prebuilt integrations and more than 60 sample workflows to cover common SRE workflow needs, getting up and running takes no time at all.

Try Fylamynt for free ->

What is MTTR (Mean Time To Repair)?

Scott Lasica — Mon, 01 Nov 2021 22:52:06 +0000

When using cloud-native services, you will undoubtedly have cloud incidents that disrupt the normal operation of your systems. No SRE team believes they can achieve 100% uptime. Instead, they plan ahead, trying to anticipate what could go wrong (or has in the past) and create runbooks (sometimes called pipelines or workflows) to get things back to normal as quickly as possible.

MTTR is a metric used by SRE teams to help their team better understand how often incidents occur, and how quickly they are repaired. The first three letters are always seen as Mean Time To, but the R is interchanged between Repair, Respond, Resolve, Remediate, and Recover. MTTR can also sometimes be used in customer contracts, with consequences when exceeded. Keep in mind that MTTR represents a typical repair time, not guaranteed, so when reviewing a vendor’s MTTR know that some incidents will resolve more quickly, and others longer.

MTTR is calculated differently by many organizations. The key is consistency within the organization, and that it’s a meaningful metric that can be used to help the SRE team improve their results (and reduce the MTTR). When you hear someone talk about MTTR, it’s always a good idea to get clarification to ensure you’re on the same page and your discussion makes sense.

However calculated, a low MTTR is obviously a good thing and indicates either a robust and resilient set of services, a sharp and quick to respond team or both. MTTR can be measured in whatever units make sense (minutes, hours, days).

What contributes to MTTR?

A typical SRE workflow after an outage or disruption in the service is detected involves multiple steps that contribute to the end-to-end response time.

First step is to troubleshoot to determine the root cause of the problem. ZK Research found that 90% of the time spent in MTTR is spent identifying the source of the problem.

When an incident occurs, the responder often has to first acknowledge the alert (unless you don’t have monitoring and alerts and you learn of outages from your users), then gather the appropriate system information, then find the runbook that should be used to start the repair. At this point you hope that the person following the runbook is using the right one, for the right environment, and has the right permissions to run it.

How do you reduce your MTTR?

As you can see above, there are multiple steps involved in responding to an event that requires the SRE to interact with multiple services. SREs should continually look for repeatable processes they can automate, with code. By doing so they reduce human error, have a consistent approach to incident remediation regardless of who is handling it, and can in many cases greatly speed up the time to resolve, thereby reducing MTTR. Tying together monitoring, alerting, and data collection the SRE can have everything they need at their fingertips to make the call on next remediation steps. They can get even more advanced by having a slack channel spun up or a zoom meeting created adding the right people for the severity and type of issue that occurs.

Try Fylamynt for free ->

A Comparison of SRE Workflow Tools

Scott Lasica — Mon, 01 Nov 2021 22:48:34 +0000

When considering tools to help optimize parts or all of your cloud incident remediation workflows and runbooks, there are several factors to take into consideration. This article will break down several capabilities to successfully achieve good optimization, and will do a direct comparison between the companies leading the charge into this new and expanding area.

Disclosure: I work at Fylamynt, one of the offerings, but will keep the comparison to facts that can be known based on company web sites and information from customers.

This article will compare several offerings in the market: Fylamynt, FireHydrant, Blameless, Rundeck and Transposit.

Integrations

One of the first and most important factors to consider is how many third party integrations the platform has, and how easy they are to use. SREs have a plethora of options at their fingertips to handle monitoring, data collection, incident tracking, and many others.

First, let’s take a raw look at what integrations the platforms support.

Y — Only available on Advanced and Enterprise plans.
Community — Not supported as an enterprise integration by Rundeck.

Be careful to simply checkbox supported integrations. For example, both Fylamynt and Rundeck list Datadog logos on their integrations page, but not all integrations are created equal. Let’s break each offering down.

FireHydrant: FireHydrant is an incident management system that helps engineers manage incidents by creating tickets, setting up Slack and Zoom channels as required. However, it lacks incident response features that help engineers remediate the problems.

Blameless: Blameless is a post-mortem tool that’s typically used after an incident happened and resolved to understand what happened during the incident resolution. The tool shows a timeline view of what happened, but doesn’t help in resolving the problem itself.

Rundeck: Rundeck (acquired by PagerDuty) was a tool that was founded in 2010, and was originally targeted towards running multiple scripts (e.g. bash, Python) together in a single pipeline. It’s a tool that’s typically run by an engineer using a CLI. Rundeck lacks integrations with cloud-native SaaS services and API-driven, event-triggered automation that’s more common in today’s incident response workflows.

Transposit: Transposit has changed their message over time from being a ServiceNow kind of platform for IT to helping SREs resolve incidents. It’s unclear what exact features they support in their tool. They claim to help with responding to alerts, unclear how they differ from showing a timeline view or resolving the incident itself.

Fylamynt: Fylamynt provides a no code / low code drag and drop editor for all of their supported integrations. Within minutes you can drag your favorite tool or service into the editor, wire it to another and in the GUI make small configuration changes. Fylamynt has simplified the use of all the product and service APIs allowing engineers to wire them up in no time.

Building The Workflows

Creating your workflows/runbooks in the past has typically been writing some code to your tool APIs, creating your own branches and customizations. While in many cases this works and in fact can even get quite complex, there are several potential issues.

Error prone — humans make mistakes (just look at the Facebook outage)
Time consuming — looking up all the APIs, writing the code, testing all of the connections
Maintenance — APIs and capabilities change, as well as how you want your runbooks to behave

Considering the above issues, providing a quick and error free way to build your workflows/runbooks is ideal. Fylamynt has a no code / low code drag and drop builder that includes all of the integrations they support. Engineers can drag nodes to the editor, wire them up and with minor conifg changes in the UI they’re ready to go. You can add more complex things like conditional branches, custom code and input/output transformation.

Limitations

When looking at solution limitations, it’s helpful to consider a few fundamental pillars needed to effectively reduce cloud incident remediation time.

Collaboration

Rarely is it solely up to a single individual to respond to, solve and report on incidents. When incidents are serious and need to be escalated, or require the expertise of a subject matter expert with knowledge beyond that of the on call engineer, collaboration needs to happen. Having the ability to spin up a slack channel or a zoom — quickly and dropped right in front of the pre-defined set of people that need to be there is a huge time saver. You can see in the feature chart above that many of the platforms do not have this ability.

Automation

Any adequate solution must be able to automate portions of the remediation process. At a minimum assembling the relevant data to put in front of an SRE is required. If not, countless time will be wasted.

Orchestration

Being able to quickly and easily build and modify workflows/runbooks, with the ability to easily integrate your tools is critical. The difference between dragging and dropping your steps together vs. writing custom code that could have errors, not scale, and not restrict permissions is quite large.

Case Management

Having the ability to see your incidents in a dashboard, with the steps that have been taken, success/failure and the time things have taken is very important. In Fylamynt we call these tasks, and within the dashboard you can see all of the workflow executions, state and time each took. Having this realtime view into the state of incidents is critical.

Human In The Loop

Fully automating your workflow/runbook might sound amazing. Imagine never being woken up when something breaks in the middle of the night. However, many people get wary of full automation, especially when you have actions like taking down services or VMs. To help this, all of the solutions allow you to put a human in the loop which means the runbook will pause and wait for human decision. Much of the data gathering can be automated ahead of time so everything the SRE needs to make that decision is at their fingertips.

With Fylamynt you can add human approval into your workflow with a Slack message or email.

Try Fylamynt for free ->

Can I Automate Away SRE Roles?

Scott Lasica — Mon, 01 Nov 2021 22:40:53 +0000

The word automation brings some strong emotions to the surface for many. It could elicit joy from automating mundane tasks, but it can also create fear and mistrust. There has been extensive history and research on automation being brought into many industries.

I’ll start in 1811 England. There was a new invention called a loom, allowing lower skilled laborers to operate and produce lower quality products that ruined the artisans’ reputation for quality. The name Luddites was coined, and this group of people went on to physically smash looms eventually causing Parliament to make frame-breaking a hanging offense. The industrial revolution continued in spite of the Luddites and a whole new role was born: the factory worker, which exploded in numbers creating many more jobs than those that were displaced.

Let’s jump ahead to more modern automation. Computers brought amazing automation to just about every industry on the planet. Think of the efficiencies brought to accounting, manufacturing, media and many, many more. Taking a look at the US Bureau of Labor Statistics data (dating back to 1980), employment levels tracked very closely with major events, dipping for things like the housing crisis and climbing during times like the dot-com bubble. They do not seem to coordinate with large shifts in technology. In fact, we are currently experiencing a worldwide labor shortage as new advancements in AI and other technologies “threaten” to take over jobs. The Economic Policy Institute posted in 2017 that there is no evidence that automation leads to joblessness or inequality.

Every industry, every role faces some type of automation “intrusion” at some point. The role of a Site Reliability Engineer (SRE) is no different. Just as an SRE wouldn’t consider the best way to learn of an outage is when customers start complaining, an SRE also wouldn’t want to shy away from automation where it makes sense. Part of an SREs duties are to create automation for cloud-native systems in order to reduce MTTR and create organization wide optimization.

So can we automate all the SRE duties and eliminate the role? Far from it. Google posted a great article on SRE automation, how they thought they were automating themselves out of jobs when in fact it turned out to free up time to focus on things that could help the business instead of constant tedious tasks or firefighting.

Don’t be a Luddite.

Try Fylamynt for free ->

What is SRE (Site Reliability Engineering)?

Scott Lasica — Mon, 01 Nov 2021 22:37:32 +0000

Introduction

Site reliability engineering (SRE) is a software engineering (developer) approach to IT operations (ops). SRE teams manage systems, handle scale, firefight incidents/problems and automate some operational tasks.

SRE was coined by the Google engineering team, when they realized that the duties and responsibilities required had deviated significantly from traditional IT/DevOps. One of the key differences is the use of code to help solve problems within cloud-native systems and infrastructure.

Any system that requires high availability and/or scalability needs SRE as a dedicated practice.

SRE can also stand for site reliability engineer, which are the individuals who handle site reliability engineering. SREs perform many tasks and are focused on the production cloud environment. Some of the common tasks an SRE will perform are:

Scaling the system
Optimizing cloud spend
Remediating incidents (when things break)
Automation
Standardization
Patching and upgrades

SREs will often write custom code (software) to link systems together, and will create workflows (often called runbooks) to help automate parts or all of the cloud system needs.

What does an SRE do?

At a high level an SRE is responsible for ensuring the systems run 24/7 and can handle scale as needed. To achieve this requires a lot of tools and expertise, not to mention often times having to “carry the pager” and handle incidents any time of the day or night.

Historically SREs came from the software development or sysadmin worlds and became a bit of a hybrid of the two. There are several areas that SREs are responsible for.
Deployment — How code is deployed into the production environment.

Monitoring — Using systems to monitor proper operations.
Alerting — Using tools to alert the appropriate people when systems aren’t functioning properly (or are at risk of not functioning properly).
Configuration — Configuring systems appropriately for optimal performance or cost reduction.
Performance — Keeping latency of systems within acceptable limits.
Change management — Keeping track of changes in systems both as a historical record but also in many cases to comply with industry standards and certifications
Emergency response — Quickly reacting to and mitigating cloud incidents as they happen
Optimization — Optimizing systems, often with automation, to reduce MTTR (Mean Time To Recovery/Repair/Resolution) — when things break, fix them as quickly as possible.

One of the primary outputs from an SRE are called runbooks or workflows. There are many situations that happen repeatedly, so it of course makes sense to create a repeatable process to handle these situations. Tying steps together in an automated way is how SREs optimize their processes. Common workflows will deal with things like cost optimization or incident remediation. For example, an SRE might create a workflow that runs on a daily basis for cost optimization (autoscaling). A simplified workflow for this could have the following steps:

Check instance utilization
If usage has remained under 50% for the last 24 hours reduce instance size

Conversely, an SRE might create a workflow for replacing a bad EC2 instance.

Alert from AWS Health
Spin up new instance
Reroute traffic
Kill old instance

These very simplified workflows will have several steps in them, with conditional branches and could even have what’s being called a “human in the loop”, which is a defined pause point in the workflow to allow a human to verify the situation and authorize appropriate actions.

SREs look for repeatable processes and then try to automate as much of those as they can to both simplify their job, but also to maintain as high availability as possible. No SRE team expects systems to have 100% uptime, but they plan for incidents and create processes to address them quickly.

SRE Tools

There are many categories of tools that SREs use to effectively maintain cloud operations. The tools range from monitoring, logging, alerting, incident management, orchestration, and workflow automation and execution.

Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows.

Try Fylamynt for free ->

Incident Response vs. Incident Managment

Scott Lasica — Mon, 01 Nov 2021 20:59:45 +0000

If you found your way to this post, it’s likely because you’re trying to determine what the difference is between incident response and incident management. You may be a new SRE, or switched companies and things aren’t being treated in the same way. The good news is you’ve come to the right place. The bad news is you won’t be leaving with a definitive answer.

Incident response and incident management are defined differently by different organizations around the world. Doing a google on incident response vs incident management brings up an article from the UK NCSC. In this article, they state:

Incident Management (IM) sits within and across any response process, ensuring all stages are handled. IM deals with any communications, media handling, escalations and any reporting issues, pulling the whole response together, coherently and holistically.

Incident Response (IR) This includes triage, in-depth analysis, technical recovery actions and more.*

The above implies that IM is at a higher level, spanning the organization and defining the overall process for handling incidents, while IR defines the actual technical steps done to contain and resolve the issue.

On the same first page google results, I found another definition from the US CISA. This definition states:

This process of identifying, analyzing, and determining an organizational response to computer security incidents is called incident management.

Unfortunately this reads as the opposite of the prior definition, stating that IM encompasses the technical steps of identifying and analyzing the incident, as well as the “response” which implies the repair/remediation.

Just another couple Google results down the page finds a post from Educause. Here, they say they are the same thing:

Information security incident management programs (sometimes also called information security incident response programs)…

Irrespective of your definition, it’s important to define a clear incident response process with repeatable consistent steps to be followed in the case of an outage.

Fylamynt can help with the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed.

For good practices around IR and IM, take a look at our article What’s a Runbook?

Try Fylamynt for free ->