DEV Community: ricklatham

Don’t forget about roots: root cause analysis in data monitoring

ricklatham — Sat, 04 Jun 2022 02:21:03 +0000

Hi folks, this is Rick again. In the previous article, I talked about how the auto-discovering and auto-actions functionality in the AlOps systems helps IT specialists delegate routine tasks to machines, freeing up time for work that only a human brain can handle. But even in this case, AI cannot be discounted. If we are dealing with complex big data analysis, machines will not be able to think for us, but they will be able to significantly simplify this analysis, reducing time needed for thinking. Today I want to talk about the approach that is necessary in monitoring called root cause analysis and show how it works based on the Acure example. So meet the sequel to the story of my adventures in the world of monitoring automation.

What is root cause analysis and why is it so important?

I already compared an IT system to a living organism. And like any living organism, it can get sick. But in order to cure the disease, it is not enough to eliminate the symptoms. It is important to find out the cause of this disease and eliminate it. For this, we need root cause analysis.

I came across the following picture on the Internet, which, in my opinion, clearly demonstrates how important it is to understand the cause of the problem in order to overcome it.

The analogy with a tree is very accurate imagery. Root Cause Analysis is a method for identifying hidden causes that allows you to determine why a particular problem occurred. Thus RCA is a tree-like hierarchical structure of the dependencies between problem and causes.

Root cause analysis answers three questions:

What’s the problem?
What’s the reason?
What should be done to prevent it in the future?

The search for answers to these questions leads us to a chain of three simple steps: Define-Analyze-Solve.

RCA helps not only to detect a problem, but also knowing its cause, to prevent its occurrence in the future.

It is worth noting that many who use this approach in analytics mistakenly believe that there can be only one root problem, although in reality, everything can be much more complicated. Therefore, it is so important to remember about the connections of the analyzed objects.

Of course, in other areas where RCA is used, everything can be simpler, but definitely not in data monitoring.

What about RCA in data monitoring?

When monitoring data, we work with incidents that are almost impossible to solve if you do not know the reason for their occurrence. But event notifications often do not contain enough information about root causes. The more complex the IT infrastructure, the more difficult it is to find the root problem. Even if the IT specialist discovers the cause on his own, it may be just one of several.

In order to make the process of searching and preventing the problem more streamlined, it is important for professionals to understand as quickly as possible what the original cause is. And you can only do this if you have:

a visual representation of the entire infrastructure as a whole
a clear understanding of the relationships and dependencies of its objects

Now I’ll show you how I found all this in Acure.

A cure not only for symptoms

Let me remind you that before getting a complete picture of the entire IT complex, I set up data flows and built a resource-service model using CIs and their connections. I will not delve into these processes again, which are described in detail here. During all these manipulations, I was presented with a visual topology in the form of a tree, showing the health of the IT infrastructure and the impact of one element on another.

On the card of each configuration item, you can see its health, as well as dependencies with other elements. The health of each object is calculated based on the health of the affecting objects, as well as the monitoring events associated with it. The following are used as metrics:

the weight of the connection — used in assessing the “equivalent” effect;
a critical factor — the direct inheritance of health, suitable for critical nodes.

In order to understand how the calculation takes place, the guys from Acure give a simple example in the documentation, which I also want to share for clarity:

For example, the cluster contains 5 objects. The first object is a master, and if it fails, it does not matter what happens to the rest, the cluster will be broken. The remaining objects are additional “nodes”. All five objects weigh equal to 1, but the critical factor is put for the master. According to the model, if the master fails or degrades on it, the state of the cluster will not be better than that of the master. If one of the nodes fails, the cluster health will be 80%. Thus, the model allows quick assessment of the state of the entire IT environment.

Thus, after any changes in the topology, the health of the system is instantly recalculated, coloring the entire tree in the appropriate colors. If the health of the root CI starts to turn traitorous red, you will see in detail which factors most negatively affect the object, and go through the branches in order to eventually come to the element that affected the health of the entire system. Easy!

Congratulations! You have just learned root cause analysis.

Auto discovering and auto actions in data monitoring or How to drink coffee instead of routine tasks

ricklatham — Mon, 09 May 2022 12:36:42 +0000

Hi! I am Rick Latham. I am an IT engineer responsible for data monitoring in an American telecom company. Until today, I was just a humble reader and shy content consumer but after spending years of my life on routine manual activities and finally discovering a solution, I felt compelled to share. This is my first article here and hope that you will find it interesting and useful. I look forward to your feedback, but for now, make yourself comfortable, pour some coffee and enjoy!

Btw, about coffee…

Drinking coffee with your legs on the table and looking at the green lights near all elements of the system is the dream of every IT engineer doing IT monitoring. But in my experience that is usually all it is, a dream. In my work, I support more than 100 different systems, services, and servers and this infrastructure expands, changes, acquires new items and connections constantly. I work with Zabbix and although it meets performance requirements, I need to monitor not only the IT infrastructure but also myself. If I don’t put new elements on monitoring, I will not receive the alerts when one of these elements doesn’t answer. Until, of course, angry users simulate a DDoS attack on the external interface of my department when one of the services is down.

To avoid such incidents and ensure I don’t wake up in a cold sweat trying to remember whether I put a new configuration item on monitoring, I decided to find if there was a solution to automate these processes and assign this routine work to machines. This is how I found Acure. No, it is not a typo, it’s the name of an AIOps platform — the protagonist of this story.

What is AIOps?

I heard the term “Artificial Intelligence for IT operations”, or “AIOps”, a few years ago thanks to Gartner (who else?) This is the definition:

AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight. AIOps platforms enable the concurrent use of multiple data sources, data collection methods, analytical (real-time and deep) technologies, and presentation technologies.

In short there are two aspects to AIOps — big data and machine learning. And now you should see a clever picture.

Ah, here it is.

The source: Gartner, Inc

It was clear to see that for my case I would benefit from an AIOps platform but I needed to pick the right one. The problem I was running into was a serious cost barrier, since I didn’t have an Elon Musk sized budget. So I started looking into free solutions and found Acure.io.

A cure?

A little background on Acure. Acure is a new incident control and automation platform created by a company out of Latvia and boy did they pack in a lot of features. It caught my attention since they market it as being made for engineers by engineers which was huge for me since there is no one who understands all the pain points and insides of IT monitoring better than a fellow engineer. It boasts flexible and open architecture, root cause and impact analysis, topology models, a single screen for the state of the entire IT system, integrations with popular monitoring systems (including Zabbix of course), and a low-code engine. It all sounded great, especially the low-code because even though I’m an IT wizard, I am not a master at coding so I can make all the configurations and changes by myself.

Soon Acure will release the SaaS version, but they also have an on-premise license for enterprises. I contacted the team, explained my demands and they kindly provided me an on-premise version so I could test it. I got to configure the data streams, automate the process of configuration item discovering, put it on monitoring and even start auto-healing scripts. All of these processes are fully described in the informative Acure User Guides.

I only wrote about my case and showed my interface, settings, buttons, and scripts but don’t worry about lots of configurations. Once configured, Acure will make your future work processes the way easier.

As I mentioned, Acure is a SaaS solution and on-premise versions are requested individually. Therefore, I will not focus on the installation process (in the SaaS version it will just mean creating space) and get straight to my points.

Aggregation and analysis of events from external systems

At first, I set up a Data stream to receive data from the other system. In Acure, this process is very easy and intuitive. In the section Data Collection — Data Streams just press the +Add stream button and fill in the following fields.

It is a great feature that Acure allows you to choose a template of a popular monitoring system with preconfigured tasks and handlers. You don’t need to puzzle over a complex configuration since the system will do everything for you.

As I’ve already said I am working with Zabbix so I’ve chosen its template.

Then I connected Zabbix and Acure using the connection URL and logging into Zabbix.

The configuration template already contains tasks for binding Zabbix to Acure, but in my case I needed to add one more custom task to take data from Zabbix for further topology creation. I wrote the script in YAML and made a request to the Zabbix API:

jobs:
  - steps:
      - run: 'curl -H \"Content-Type: application/json\" -s --request POST --data-raw \"${zabbixGetHostsJson}\" ${zabbixUri}'
        env:
          zabbixUri: https://<zabbix-domain>/api_jsonrpc.php
          zabbixGetHostsJson: >
            {
            "jsonrpc": "2.0",
            "method": "host.get",
            "params": {
                "filter": {
                    "output": "extend"
                },
                "selectGroups": "extend"
            },
            "id": 1,
            "auth": "<zabbix-token>"
            }
        outputs:
          data: $._outputs.shell
    artifacts:
      - data: '{{ outputs.data }}'
        send-to:
          api:
            uri: https://<acure-domain>/api/public/cl/v1/stream-data
            headers:
              x-smon-stream-key: $.vars.stream.key
              x-smon-userspace-id: $.userspaceId
            media-type: application/json

This script can easily be used as a template by inserting it manually and changing URL of primary system, the source for data receiving and token according to your data.

The API token was taken from Zabbix:

I then saved the changes and voila! The stream was added into Data Streams.

The magic doesn’t stop here though. The task of topology data collecting cannot be run on the internal agent because of Security Policies. I had to connect the external agent before running it. That could sound a little bit complicated but it really isn’t that bad. Once this step is taken care of, you won’t need to worry about it again.

I then added a New Coordinator and set up the custom agent inside it.

After setting up, I returned to Data Stream and changed the agent in the manually added custom task to the newly created one.

Finally, I started the stream. In Events and Logs I could check the events to make sure that the data was being collected by the system.

Having “fallen” into one of the events reporting a problem, I saw not only its table view with information about all the elements contained in Zabbix but also the standard JSON structure for Zabbix, for example:

Info about the Zabbix hosts

JSON structure for one of the Zabbix events

Based on the events coming from the primary monitoring system, I built the Resource Service Model.

Building of Resource Service Model

The resource service model (RSM) is a list of configuration items (CI) from the system and the connections between them. In Acure, the RSM is based on topology but I won’t get ahead of myself. Let’s get back to the JSON structure. The following parameters is what the RSM is built on.

For further automatic creation of configuration items I determined that:

CI’s name == host.name (Zabbix host name)

Parent CI’s name == host.groups[0].name

Related object => Zabbix node

In the Automation section I added a new scenario for my Working Group.

The process of scenario creation is clearly described in the documentation. After reading it and looking at a couple of examples, anyone can create a scenario of any complexity without any hard coding. The low-code engine makes this process very easy for non-programmers like me. Just add blocks, bind them, and create the script.

Each scenario starts from the “OnLogEvent” block where you receive the topology events. Mine was no exception.

I then checked the streams: my stream name was matching the value in the script (in my case, Zabbix Sync), so the script was executed further. If not, the script will not be executed.

To grow the topology “bush” from one information system, I manually created a root CI through the CI creation option in the Service Model Graph tab.

Having created a CI, I took its ID from the link and pasted it into the following block in the script.

When creating a scenario, all groups were linked to this root CI, which displays the health of the entire system on the topology.

I don’t want to spoil the topology just yet, so let’s return to creation of a single script, which can be conditionally divided into two parts:

Creation of configuration items (CI).
Binding the nodes of primary monitoring systems (for further binding of triggers).

I improved my script, made various checks, added the automatic creation of CIs and configured bindings between them, compiled and ran the scenario.

The execution of this script showed me the Service Model graph which displays the statuses of all components and the health of the system in real time.

I added new CIs to test auto discovering and immediately saw it on the map. To be sure that the system didn’t lie to me, I checked the addition of the new items in the Event Log. And wow! All the new CIs were added automatically.

But I would not be me if I limited myself to auto discovering.

Auto rules and auto actions

As you already understood, Acure takes over some tasks and automates routine processes that were previously performed manually. I decided to check whether the system itself will be able to carry out the algorithm of certain actions and rules that will be triggered by certain changes in the CI.

In the “Rules and Actions” section I added a New Rule for all CIs and events with a priority of 2 or higher.

I configured an automatic action that will be performed if this rule is triggered. In my case, these were two notifications by e-mail with an interval of 2 and 30 minutes and the execution of an auto-repair script if the incident has not been fixed during this time.

Alert templates are also configurable inside Acure. I used the preconfigured template but also was able to add any text and attach the necessary files using Markdown and HTML for markup.

As a result, when an event with a high priority is detected, this auto-action script should be be triggered: the necessary notifications get sent and the auto-healing will run. Automatically! While you’re comfortably drinking coffee with your legs on the table.

Conclusion

In this article, using the Acure platform as an example, I showed why there is no reason to be nervous when it comes to delegating routine work to machines and how it can free up your time for more interesting tasks.

The system has a convenient and intuitive interface and several cool features: the ability to integrate with other monitoring systems, topology, low-code, the status of the entire infrastructure on one screen, and full automation of processes ranging from setting configuration items for monitoring to running automatic scripts. When an event enters the system, it changes the Service Model itself or its state. Acure automatically allows the tracking of all these changes and takes actions related to them. Objects are automatically added, primary local monitoring system triggers are automatically bound, escalation policies are automatically propagated, system health is automatically calculated, and automation scripts are run.

It’s great how such a simple solution can be so powerful. This is the result when monitoring systems are made by those who use these systems themselves.

I also liked the development team — the guys were responsive, quickly answered questions and provided all the necessary information and licenses. Even though Acure is a young platform and it has something to strive for, I see huge potential.

Soon the guys should release the new SaaS version, which I’ll be looking forward to while drinking coffee, of course ;)

But for now share your thoughts, feedback and see you in comments.