DEV Community

Cover image for Don’t forget about roots: root cause analysis in data monitoring
ricklatham
ricklatham

Posted on

Don’t forget about roots: root cause analysis in data monitoring

Hi folks, this is Rick again. In the previous article, I talked about how the auto-discovering and auto-actions functionality in the AlOps systems helps IT specialists delegate routine tasks to machines, freeing up time for work that only a human brain can handle. But even in this case, AI cannot be discounted. If we are dealing with complex big data analysis, machines will not be able to think for us, but they will be able to significantly simplify this analysis, reducing time needed for thinking. Today I want to talk about the approach that is necessary in monitoring called root cause analysis and show how it works based on the Acure example. So meet the sequel to the story of my adventures in the world of monitoring automation.

What is root cause analysis and why is it so important?

I already compared an IT system to a living organism. And like any living organism, it can get sick. But in order to cure the disease, it is not enough to eliminate the symptoms. It is important to find out the cause of this disease and eliminate it. For this, we need root cause analysis.

I came across the following picture on the Internet, which, in my opinion, clearly demonstrates how important it is to understand the cause of the problem in order to overcome it.

The analogy with a tree is very accurate imagery. Root Cause Analysis is a method for identifying hidden causes that allows you to determine why a particular problem occurred. Thus RCA is a tree-like hierarchical structure of the dependencies between problem and causes.

Root cause analysis answers three questions:

  • What’s the problem?

  • What’s the reason?

  • What should be done to prevent it in the future?

The search for answers to these questions leads us to a chain of three simple steps: Define-Analyze-Solve.

RCA helps not only to detect a problem, but also knowing its cause, to prevent its occurrence in the future.

It is worth noting that many who use this approach in analytics mistakenly believe that there can be only one root problem, although in reality, everything can be much more complicated. Therefore, it is so important to remember about the connections of the analyzed objects.

Of course, in other areas where RCA is used, everything can be simpler, but definitely not in data monitoring.

What about RCA in data monitoring?

When monitoring data, we work with incidents that are almost impossible to solve if you do not know the reason for their occurrence. But event notifications often do not contain enough information about root causes. The more complex the IT infrastructure, the more difficult it is to find the root problem. Even if the IT specialist discovers the cause on his own, it may be just one of several.

In order to make the process of searching and preventing the problem more streamlined, it is important for professionals to understand as quickly as possible what the original cause is. And you can only do this if you have:

  • a visual representation of the entire infrastructure as a whole

  • a clear understanding of the relationships and dependencies of its objects

Now I’ll show you how I found all this in Acure.

A cure not only for symptoms

Let me remind you that before getting a complete picture of the entire IT complex, I set up data flows and built a resource-service model using CIs and their connections. I will not delve into these processes again, which are described in detail here. During all these manipulations, I was presented with a visual topology in the form of a tree, showing the health of the IT infrastructure and the impact of one element on another.

On the card of each configuration item, you can see its health, as well as dependencies with other elements. The health of each object is calculated based on the health of the affecting objects, as well as the monitoring events associated with it. The following are used as metrics:

  1. the weight of the connection — used in assessing the “equivalent” effect;

  2. a critical factor — the direct inheritance of health, suitable for critical nodes.

In order to understand how the calculation takes place, the guys from Acure give a simple example in the documentation, which I also want to share for clarity:

For example, the cluster contains 5 objects. The first object is a master, and if it fails, it does not matter what happens to the rest, the cluster will be broken. The remaining objects are additional “nodes”. All five objects weigh equal to 1, but the critical factor is put for the master. According to the model, if the master fails or degrades on it, the state of the cluster will not be better than that of the master. If one of the nodes fails, the cluster health will be 80%. Thus, the model allows quick assessment of the state of the entire IT environment.

Thus, after any changes in the topology, the health of the system is instantly recalculated, coloring the entire tree in the appropriate colors. If the health of the root CI starts to turn traitorous red, you will see in detail which factors most negatively affect the object, and go through the branches in order to eventually come to the element that affected the health of the entire system. Easy!

Congratulations! You have just learned root cause analysis.

Top comments (0)