DEV Community

Cover image for Self-Healing Systems in DevOps
Alex Mair
Alex Mair

Posted on

Self-Healing Systems in DevOps

DevOps teams face the challenge of keeping increasingly complex systems operational. Because these systems are business-critical, they need to be continuously operational, a goal that is more and more difficult to achieve manually. Self-healing systems offer a potential solution by giving systems the capacity to respond to adverse conditions without the assistance of an on-call engineer.

The trouble with DevOps currently

DevOps has advanced rapidly in the past decade. An explosion of new monitoring tools, some of which incorporate AI, has given DevOps engineers greater insights into their systems than ever before. Low code DevOps tools are making observability and monitoring accessible to a wider range of IT professionals.

But something is missing. Currently, DevOps functions as an open loop control system. The observability and monitoring infrastructure functions as the DevOps team’s eyes and ears, but human engineers must make all the decisions.

When a metric like CPU usage rises above its normal parameters, a DevOps engineer must be alerted to the problem so they can solve it manually. If this happens in the middle of the night and there are no on-call engineers doing night shifts, the problem may not be fixed for several hours. Services may be interrupted for extended periods.

This has the potential to create serious problems. In 2019 businesses estimated that every hour of infrastructure downtime cost them an average of $301,000- $400,000. This means that if a system was down for five hours, the cost to business on average would be $1.5 million.

Self-healing systems have come at just the right time. They take the increasingly sophisticated monitoring toolchains to turn DevOps systems into closed-loop control systems. These systems can actively respond to failure when no on-call engineer is around.

Self-Healing Systems explained

The idea that DevOps systems should be “self-healing” can be traced back to the Reactive Manifesto published in 2014. It advocated the creation of systems that were responsive, resilient and elastic. These systems would have a degree of autonomy in their operation and would be able to handle unexpected occurrences (such as a spike in user traffic) without “asking for help” from a human engineer.

Self-healing systems take inspiration from biological organisms. For example, the human body has the ability to modify heart rate automatically. When someone is running, their heart rate increases in response to increased oxygen demand. When they stop, their body brings it back to normal.

Self-healing aims to bring this capability to DevOps systems. According to Viktor Farcic, self-healing comes in a variety of flavors.

Self-Healing Levels

System-level self-healing is most relevant to DevOps teams. Two ways to implement this are Time To Live (TTL), which involves requiring a service to confirm that it’s operational at regular intervals and pinging, where an external system pings the service at regular intervals.

Both TTL and Pinging require excellent cloud monitoring and there are a range of tools for this, including low code solutions.
Self-healing can also be implemented on the application and hardware levels.

Self-Healing Types

Farcic lists two types of self-healing, reactive and preventative. Reactive self-healing is when a system is configured to respond to failure by performing some restorative action. Reactive self-healing can be quite easy to implement, especially with low code tools.

Preventative self-healing involves analyzing system data to predict potential failures and stop them before they happen. For example, a steady increase in memory usage could indicate a memory leak, which an intelligent system could notify developers about in advance.

Self-Healing Systems in Action

Self-healing systems might seem like sci-fi. However, they are readily achievable with existing DevOps tooling. In the following sections, we’ll see how Databricks used self-healing in an auto-remediation system, Healer, as well as showing how your DevOps team can make their own.

Andrew Nitu’s Healer

Databricks is a company specializing in data and AI. Their Cloud Infrastructure team is responsible for maintaining thousands of VMs and database instances. The team has on-call engineers to deal with problems. Unfortunately, these engineers aren’t available 24/7.

This is a lacuna exacerbated by the fact that Databricks operates worldwide. If a failure happened in Amsterdam and the engineers with the relevant knowledge were in Toronto, time zone differences would mean the issue wouldn’t be fixed for hours.

There was a gap in Databricks’ operations. Cloud intern Andrew Nitu built Healer to plug it. Healer is an event-driven auto-remediation system. In essence, Healer does exactly what a human on-call engineer would do. It waits for an alert, looks at what type of alert it is and performs a task to remediate the problem. For example, Healer might start Jenkins.

Nitu designed Healer to plug into Databricks’ existing alerting systems, which are Prometheus and Alertmanager. Healer, sits at the back end, listening for HTTP requests. When it gets an alert, it looks at the alert’s metadata.
This includes the alert type and label; which Healer uses to select an appropriate remediation. Once the remediation has been chosen, it is put into a pool where it waits for its turn to run. Healer keeps tabs on the remediation until it has successfully completed. Healer then sends a message through JIRA and Slack.

To illustrate Healer’s power, Nitu explains how it can be used to remediate low disk space on Kubernetes nodes. Databricks engineers typically handled this by watching for an alert called NodeDiskPressure.

They would remediate the problem by connecting the appropriate node and running the command 'docker image prune'. Healer automates this entire process with a remediation configured to NodeDiskPressure’s parameters. When it receives the alert, it triggers a Jenkins job that connects to a Kubernetes node and runs 'docker image prune' automatically.

Doing it yourself

It’s worth pointing out that no part of Healer required cutting edge programming skills. It plugged into the existing alert system and simply ran scripts automatically that human engineers would have done manually. In future, even scripting may be unnecessary with the increasing availability of low code DevOps tools.

Vladimir Fedak outlines five steps to make a self-healing infrastructure. Step 1 involves utilizing Immutable Infrastructure as Code (IIaC). This involves using containerization solutions and cloud-based services to take the burden of provisioning servers off your hands. It’s much easier to automate response to failure when malfunctioning instances can be replaced in moments.

Step 2 is keeping the codebase efficient with automated testing. In recent years, automated testing has become much easier to do with the development of intuitive testing tools that don’t require in-depth knowledge.

Step 3 requires the team to go all out on logging and monitoring. Self-healing systems are premised on the concept of being responsive to internal/external conditions. As Moe Long explains, logging data is critical to analyzing DevOps performance.

However, it’s also increasingly difficult to capture and analyze log data in a cloud-based infrastructure that makes extensive use of virtualization. Conventional DevOps is starting to suffer from what Long calls Franken-monitoring, a situation where DevOps teams have a menagerie of monitoring tools and no real idea how to use them to analyze data.

A self-healing system can capitalize on large amounts of log data by extracting trends and automatically taking action in real-time. Having good logging data increases the observability of your system allowing for better problem-solving.

Step 4 is creating smart alerts and using prescriptive analytics to monitor the system. High-quality alerts can supercharge your DevOps practice by providing summaries of critical data. For example, an alert might perform data aggregation on a metric such as page loading time.

This can allow engineers to see whether the average page load time is above or below a certain threshold. Features like this can be very powerful when combined with self-healing. We saw with Healer that alert metadata provides rich pickings for a self-healing system to accurately identify problems.

Smarter alerts can increase your system’s visibility, which in turn means self-healing systems can be even more adaptive and responsive.

Step 5 is for teams that wish to leverage machine learning in their self-healing architecture. It involves using machine learning algorithms to optimize the performance of the system so that it learns to solve problems better over time. Adam Frank argues that the best self-healing systems combine observability and monitoring with artificial intelligence to create robust and resilient systems.

Conclusion

Self-healing systems have come at just the right time. DevOps teams are currently caught between two forces that are in danger of coming to a head. On the one hand the increasingly critical importance of IT infrastructure to modern businesses is putting pressure on them to keep systems continuously operational.

However, the increasing sophistication of cloud-based infrastructure means that DevOps engineers need to constantly keep track of an ever-increasing number of variables. A DevOps team is like a juggler trying to keep loads of balls in the air, and the crowd keeps throwing her more balls!

Luckily, there is a potential savior. Toolchains have become increasingly sophisticated and smart, reducing the requirement for arcane coding skills. Self-healing systems leverage that sophisticated to create system infrastructure that is reactive, responding to changing conditions. This means engineers no longer have to be constantly on-call as they can trust the system to look after itself.

Top comments (4)

Collapse
 
automatecloud profile image
Andreas Wilke

I love the idea of self-healing and also used the low code platform kaholo to implement self-healing for compliance events detected by lacework github.com/automatecloud/lacework-.... It`s the right time to start using self-healing!

Collapse
 
liortal1 profile image
liortal1

Amazing work Andreas!

Collapse
 
hchen98 profile image
Hui (Henry) Chen

Well written! Thanks for sharing this kind of DevOps system!

Collapse
 
adamautomation profile image
AdamKaholo

Really in depth, thanks for sharing this!