ricklatham

Posted on May 9, 2022 • Edited on May 23, 2022

Auto discovering and auto actions in data monitoring or How to drink coffee instead of routine tasks

#monitoring #machinelearning #bigdata #devops

Hi! I am Rick Latham. I am an IT engineer responsible for data monitoring in an American telecom company. Until today, I was just a humble reader and shy content consumer but after spending years of my life on routine manual activities and finally discovering a solution, I felt compelled to share. This is my first article here and hope that you will find it interesting and useful. I look forward to your feedback, but for now, make yourself comfortable, pour some coffee and enjoy!

Btw, about coffee…

Drinking coffee with your legs on the table and looking at the green lights near all elements of the system is the dream of every IT engineer doing IT monitoring. But in my experience that is usually all it is, a dream. In my work, I support more than 100 different systems, services, and servers and this infrastructure expands, changes, acquires new items and connections constantly. I work with Zabbix and although it meets performance requirements, I need to monitor not only the IT infrastructure but also myself. If I don’t put new elements on monitoring, I will not receive the alerts when one of these elements doesn’t answer. Until, of course, angry users simulate a DDoS attack on the external interface of my department when one of the services is down.

To avoid such incidents and ensure I don’t wake up in a cold sweat trying to remember whether I put a new configuration item on monitoring, I decided to find if there was a solution to automate these processes and assign this routine work to machines. This is how I found Acure. No, it is not a typo, it’s the name of an AIOps platform — the protagonist of this story.

What is AIOps?

I heard the term “Artificial Intelligence for IT operations”, or “AIOps”, a few years ago thanks to Gartner (who else?) This is the definition:

AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight. AIOps platforms enable the concurrent use of multiple data sources, data collection methods, analytical (real-time and deep) technologies, and presentation technologies.

In short there are two aspects to AIOps — big data and machine learning. And now you should see a clever picture.

Ah, here it is.

The source: Gartner, Inc

It was clear to see that for my case I would benefit from an AIOps platform but I needed to pick the right one. The problem I was running into was a serious cost barrier, since I didn’t have an Elon Musk sized budget. So I started looking into free solutions and found Acure.io.

A cure?

A little background on Acure. Acure is a new incident control and automation platform created by a company out of Latvia and boy did they pack in a lot of features. It caught my attention since they market it as being made for engineers by engineers which was huge for me since there is no one who understands all the pain points and insides of IT monitoring better than a fellow engineer. It boasts flexible and open architecture, root cause and impact analysis, topology models, a single screen for the state of the entire IT system, integrations with popular monitoring systems (including Zabbix of course), and a low-code engine. It all sounded great, especially the low-code because even though I’m an IT wizard, I am not a master at coding so I can make all the configurations and changes by myself.

Soon Acure will release the SaaS version, but they also have an on-premise license for enterprises. I contacted the team, explained my demands and they kindly provided me an on-premise version so I could test it. I got to configure the data streams, automate the process of configuration item discovering, put it on monitoring and even start auto-healing scripts. All of these processes are fully described in the informative Acure User Guides.

I only wrote about my case and showed my interface, settings, buttons, and scripts but don’t worry about lots of configurations. Once configured, Acure will make your future work processes the way easier.

As I mentioned, Acure is a SaaS solution and on-premise versions are requested individually. Therefore, I will not focus on the installation process (in the SaaS version it will just mean creating space) and get straight to my points.

Aggregation and analysis of events from external systems

At first, I set up a Data stream to receive data from the other system. In Acure, this process is very easy and intuitive. In the section Data Collection — Data Streams just press the +Add stream button and fill in the following fields.

It is a great feature that Acure allows you to choose a template of a popular monitoring system with preconfigured tasks and handlers. You don’t need to puzzle over a complex configuration since the system will do everything for you.

As I’ve already said I am working with Zabbix so I’ve chosen its template.

Then I connected Zabbix and Acure using the connection URL and logging into Zabbix.

The configuration template already contains tasks for binding Zabbix to Acure, but in my case I needed to add one more custom task to take data from Zabbix for further topology creation. I wrote the script in YAML and made a request to the Zabbix API:

jobs:
  - steps:
      - run: 'curl -H \"Content-Type: application/json\" -s --request POST --data-raw \"${zabbixGetHostsJson}\" ${zabbixUri}'
        env:
          zabbixUri: https://<zabbix-domain>/api_jsonrpc.php
          zabbixGetHostsJson: >
            {
            "jsonrpc": "2.0",
            "method": "host.get",
            "params": {
                "filter": {
                    "output": "extend"
                },
                "selectGroups": "extend"
            },
            "id": 1,
            "auth": "<zabbix-token>"
            }
        outputs:
          data: $._outputs.shell
    artifacts:
      - data: '{{ outputs.data }}'
        send-to:
          api:
            uri: https://<acure-domain>/api/public/cl/v1/stream-data
            headers:
              x-smon-stream-key: $.vars.stream.key
              x-smon-userspace-id: $.userspaceId
            media-type: application/json

This script can easily be used as a template by inserting it manually and changing URL of primary system, the source for data receiving and token according to your data.

The API token was taken from Zabbix:

I then saved the changes and voila! The stream was added into Data Streams.

The magic doesn’t stop here though. The task of topology data collecting cannot be run on the internal agent because of Security Policies. I had to connect the external agent before running it. That could sound a little bit complicated but it really isn’t that bad. Once this step is taken care of, you won’t need to worry about it again.

I then added a New Coordinator and set up the custom agent inside it.

After setting up, I returned to Data Stream and changed the agent in the manually added custom task to the newly created one.

Finally, I started the stream. In Events and Logs I could check the events to make sure that the data was being collected by the system.

Having “fallen” into one of the events reporting a problem, I saw not only its table view with information about all the elements contained in Zabbix but also the standard JSON structure for Zabbix, for example:

Info about the Zabbix hosts

JSON structure for one of the Zabbix events

Based on the events coming from the primary monitoring system, I built the Resource Service Model.

Building of Resource Service Model

The resource service model (RSM) is a list of configuration items (CI) from the system and the connections between them. In Acure, the RSM is based on topology but I won’t get ahead of myself. Let’s get back to the JSON structure. The following parameters is what the RSM is built on.

For further automatic creation of configuration items I determined that:

CI’s name == host.name (Zabbix host name)

Parent CI’s name == host.groups[0].name

Related object => Zabbix node

In the Automation section I added a new scenario for my Working Group.

The process of scenario creation is clearly described in the documentation. After reading it and looking at a couple of examples, anyone can create a scenario of any complexity without any hard coding. The low-code engine makes this process very easy for non-programmers like me. Just add blocks, bind them, and create the script.

Each scenario starts from the “OnLogEvent” block where you receive the topology events. Mine was no exception.

I then checked the streams: my stream name was matching the value in the script (in my case, Zabbix Sync), so the script was executed further. If not, the script will not be executed.

To grow the topology “bush” from one information system, I manually created a root CI through the CI creation option in the Service Model Graph tab.

Having created a CI, I took its ID from the link and pasted it into the following block in the script.

When creating a scenario, all groups were linked to this root CI, which displays the health of the entire system on the topology.

I don’t want to spoil the topology just yet, so let’s return to creation of a single script, which can be conditionally divided into two parts:

Creation of configuration items (CI).
Binding the nodes of primary monitoring systems (for further binding of triggers).

I improved my script, made various checks, added the automatic creation of CIs and configured bindings between them, compiled and ran the scenario.

The execution of this script showed me the Service Model graph which displays the statuses of all components and the health of the system in real time.

I added new CIs to test auto discovering and immediately saw it on the map. To be sure that the system didn’t lie to me, I checked the addition of the new items in the Event Log. And wow! All the new CIs were added automatically.

But I would not be me if I limited myself to auto discovering.

Auto rules and auto actions

As you already understood, Acure takes over some tasks and automates routine processes that were previously performed manually. I decided to check whether the system itself will be able to carry out the algorithm of certain actions and rules that will be triggered by certain changes in the CI.

In the “Rules and Actions” section I added a New Rule for all CIs and events with a priority of 2 or higher.

I configured an automatic action that will be performed if this rule is triggered. In my case, these were two notifications by e-mail with an interval of 2 and 30 minutes and the execution of an auto-repair script if the incident has not been fixed during this time.

Alert templates are also configurable inside Acure. I used the preconfigured template but also was able to add any text and attach the necessary files using Markdown and HTML for markup.

As a result, when an event with a high priority is detected, this auto-action script should be be triggered: the necessary notifications get sent and the auto-healing will run. Automatically! While you’re comfortably drinking coffee with your legs on the table.

Conclusion

In this article, using the Acure platform as an example, I showed why there is no reason to be nervous when it comes to delegating routine work to machines and how it can free up your time for more interesting tasks.

The system has a convenient and intuitive interface and several cool features: the ability to integrate with other monitoring systems, topology, low-code, the status of the entire infrastructure on one screen, and full automation of processes ranging from setting configuration items for monitoring to running automatic scripts. When an event enters the system, it changes the Service Model itself or its state. Acure automatically allows the tracking of all these changes and takes actions related to them. Objects are automatically added, primary local monitoring system triggers are automatically bound, escalation policies are automatically propagated, system health is automatically calculated, and automation scripts are run.

It’s great how such a simple solution can be so powerful. This is the result when monitoring systems are made by those who use these systems themselves.

I also liked the development team — the guys were responsive, quickly answered questions and provided all the necessary information and licenses. Even though Acure is a young platform and it has something to strive for, I see huge potential.

Soon the guys should release the new SaaS version, which I’ll be looking forward to while drinking coffee, of course ;)

But for now share your thoughts, feedback and see you in comments.