DEV Community

lbonanomi
lbonanomi

Posted on

Tin Ear: Reducing alert email noise with python

Or maybe you shouldn't bring me every little piece of trash you happen to pick up.
-Fight Club

Established IT shops always seem to have a lumpy collection of shell scripts to monitor at least part of the infrastructure and send admins emails. Email scripts can be great because they're quick to setup and tweak, but there's usually no throttling of alerts or indicator that the problem has cleared, and too-frequent alarms just become noise to the ops.

Let's use whisper time-series databases to wrap /bin/mail and cut our alert noise.

Alerting on confirmed issues

Let's agree on a few rules for our mail wrapper:

  • An error condition must appear 3 times before we alert, so we don't keep crying wolf.
  • After the initial alert, we should send followups much less frequently to reduce alarm fatigue.
  • If the error condition ceases and we get three good results, we will send a stand down message.

We'll start our mail wrapper project by instancing a whisper database based on the subject of a mail that would be sent:

import base64
import os
import whisper

RETAINER = [(300, 3)]                       # [(seconds_in_period, slots_in_period)]

whisper_db_dir = '/var/tmp/whisperDB/'
whisper_db_name = str(whisper_db_dir + base64.b64encode(sys.argv[1]) + '.wsp')

if not os.path.exists(whisper_db_dir):
    os.path.mkdir(whisper_db_dir)

if not os.path.exists(whisper_db_name):
    whisper.create(whisper_db_name, RETAINER, aggregationMethod='last')

This chunk will create directory /var/tmp/whisperDB and an empty whisper database for every alert email subject. Each database will contain 3 time period "windows" 5-minutes-long to store check states, which we can update with simple function calls: whisper.update(whisper_db_name, 0) for okay states and whisper.update(whisper_db_name, 1) for failures.

Next let's determine if the check script's status has occurred frequently enough to alert on. If the same check has failed 3 times we will send an initial email and then change our 5 minute status windows into 15 minute status windows so admins don't start tuning-out repetitive mails:

(times, fail_buffer) = whisper.fetch(db_name, 315550800)

if fail_buffer.count(1) > 2:
    new_whisper_db_name = db_name + '.wsp2'
    whisper.create(new_whisper_db_name, FOLLOWUP, aggregationMethod='last')
    whisper.update(new_whisper_db_name, 1)
    os.rename(new_whisper_db_name, db_name)

    for admin in sys.argv[2:]:
        os.system('mail -s "' + sys.argv[1] + '" ' + admin + 'M/dev/null')

While we're here let's also be ready to send an all-clear message if the current alert window is tripled and the last 3 checks have returned an okay status:

(times, fail_buffer) = whisper.fetch(db_name, 315550800)

if fail_buffer.count(1) == 0:
    if whisper.info(db_name)['archives'][0]['secondsPerPoint'] == FOLLOWUP[0][0]:
        new_whisper_db_name = db_name + '.wsp2'
        whisper.create(new_whisper_db_name, RETAINER, aggregationMethod='last')
        whisper.update(new_whisper_db_name, 0)
        os.rename(new_whisper_db_name, db_name)

        for admin in sys.argv[2:]:
            os.system('mail -s "' + sys.argv[1] + '" ' + admin + 'M/dev/null')

I couldn't find a way to capture $? outside of a shell, so we'll hack-around this with some symlinks. Let's link the base script tinear to a name that indicates a failed check (tinear.nok) and a name that indicates a successful check (tinear.ok) so we can update our whisper database according the name of the symlink that gets called:

if os.path.basename(sys.argv[0]) == "tinear.ok":
    whisper.update(whisper_db_name, 0)

if os.path.basename(sys.argv[0]) == "tinear.nok":
    whisper.update(whisper_db_name, 1)

Now we can alter our scripts to employ bash ternary operators with our existing shell scripts the way we would with a normal call to mail -s:

some_check && tinear.ok "some_check is recovered" admin@example.com || tinear.nok "some_check has failed" admin@example.com

Let's put it all together:

#!/bin/python
"""alert concentrating mail handler"""

import base64
import os
import whisper
import sys

RETAINER = [(300, 3)]                       # [(seconds_in_period, slots_in_period)]
FOLLOWUP = [(900, 3)]

def waterlevel(db_name):
    """Reduce alert frequency after initial alert, reset on all-clear"""

    (times, fail_buffer) = whisper.fetch(db_name, 315550800)

    if fail_buffer.count(1) > 2:
        new_whisper_db_name = db_name + '.wsp2'
        whisper.create(new_whisper_db_name, FOLLOWUP, aggregationMethod='last')
        whisper.update(new_whisper_db_name, 1)
        os.rename(new_whisper_db_name, db_name)

        for admin in sys.argv[2:]:
            os.system('mail -s "' + sys.argv[1] + '" ' + admin + '</dev/null')

    if fail_buffer.count(1) == 0:
        if whisper.info(db_name)['archives'][0]['secondsPerPoint'] == FOLLOWUP[0][0]:
            new_whisper_db_name = db_name + '.wsp2'
            whisper.create(new_whisper_db_name, RETAINER, aggregationMethod='last')
            whisper.update(new_whisper_db_name, 0)
            os.rename(new_whisper_db_name, db_name)

            for admin in sys.argv[2:]:
                os.system('mail -s "' + sys.argv[1] + '" ' + admin + '</dev/null')

    return(0)

whisper_db_dir = '/var/tmp/whisperDB/'
whisper_db_name = str(whisper_db_dir + base64.b64encode(sys.argv[1]) + '.wsp')

if not os.path.exists(whisper_db_dir):
    os.path.mkdir(whisper_db_dir)

if not os.path.exists(whisper_db_name):
    whisper.create(whisper_db_name, RETAINER, aggregationMethod='last')

if os.path.basename(sys.argv[0]) == "tinear.ok":
    whisper.update(whisper_db_name, 0)

if os.path.basename(sys.argv[0]) == "tinear.nok":
    whisper.update(whisper_db_name, 1)

waterlevel(whisper_db_name)

Latest comments (0)