DEV Community

gdcohen
gdcohen

Posted on

This Slack App Speeds-up Incident Resolution Using ML

If your team (like many others) uses Slack to collaborate during incident management and triage, this new Slack app will be a big-time saver.

Let's say a monitoring tool in your environment triggered an incident, and now you're working with colleagues to figure out what went wrong and how to resolve it. You start digging into logs and metrics, looking for anything unusual, with only the incident symptoms and your hunch to guide your searches. Incident response is a big and growing pain point for teams managing modern applications -- contributing to lost productivity, stress and a big economic cost overall.

Zebrium's machine learning platform consumes a feed of raw logs and metrics, and automatically learns data structures, event types and normal patterns. It automatically detects anomalies, and uses a virtual tracing technique to identify correlations between anomalies that characterize real incidents. This lets Zebrium automatically detect incidents. But for the moment let's continue with the scenario described above -- you already have an incident identified based on alerts from some other tool. Well Zebrium can also consume an external incident signal to inform and correlate its incident detection and root cause. To do this, just enable the slack app in your workspace, configure the Zebrium data collectors (one helm command in k8s), and type one command in your slack channel (or get your Bot to do it) to ask Zebrium for help.

Zebrium will pull together an incident report describing the sequence of anomalous events, related anomalous metrics, and the services and nodes participating in the incident, and the worst symptom of the incident -- combined this forms a "virtual trace". Typically, the report will give you enough detail and context to identify root cause. And when needed, it also enables near instant drill down to examine the surrounding events and metrics. Because the "virtual trace" has already narrowed down the context and time range, this helps you get to resolution much faster than a hunch and blind searching.

Here's how it works

1 - Setup Zebrium data collectors -- takes 2 minutes and a single helm command for k8s environments, or a couple of steps for other environments.

2 - Install the Slack app for your workspace. You can do this from Zebrium settings.

Zebrium Slack App

3 - Now let's say you setup a slack channel for your Virtual War Room. You get the team together and you're looking at data from the APM tool and see some troubling stat.

Slack with Zebrium to augment incidents with root cause4 - Type "/zebrium incident analyze (with the option to specify the incident time)" to call on Zebrium for analysis and root cause.

PagerDuty with Zebrium to augment incidents with root cause 2

5 - Simply click and see full incident details.\
Zebrium Incident showing root cause and correlated metrics anomalies

Let Zebrium take care of MTTR

Using the Zebrium Slack App, you can now leverage Zebrium ML to augment incidents that have been detected by any tool. In doing so, our virtual tracing will automatically identify the impacted services and nodes, and pull in the sequence of anomalous events and metrics that best describe the incident root cause. And you'll get this without all the hunting, scrambling and adrenaline that is normally associated with a war room. Now that's MTTR!

You can learn more or get started for free by visiting https://www.zebrium.com.

Posted with permission of the author: Ajay Singh

Top comments (0)