DEV Community

Kix Panganiban
Kix Panganiban

Posted on

Get Started with Uptime Monitoring using Bantay

One of the key metrics in DevOps is availability, that is: measuring how much, over a given period, your service or app is available or accessible. Often, availability is paired with scalability, or the measure of how well your service performs in proportion to a growing number of users. Among other things, availability and scalability comprise a big chunk of observability in control theory -- the practice of inferring the internal state of a system through external observations. We'll get back to observability at a later post, but in this one, we'll focus on just availability, and how to get started with it.

The most straightforward way of measuring availability is by measuring service uptime. Often, DevOps engineers and SREs aim to achieve the five-nines of availability, which means that a service is available 99.999% of the time.

Let's define a couple of goals:

  1. We can see if a service is "up" by performing an HTTP GET request on a known endpoint
  2. We get notified whenever a service "goes down" or "comes back up" (ie its state of availability changes)
  3. And finally, we can log all of these somewhere for posterity

Introducing Bantay

Sometime back, I needed to achieve pretty much those same three goals with a couple of constraints: one, that the manner by which I achieve those goals is cheap (or free), and two, I have total and absolute control over my data and how I perform my monitoring. While solutions such as Pingdom, Rollbar, New Relic, and Statuspage exist, none of them are completely free and none of them offer complete control over my data. Hence, I built my own: Bantay.

Bantay on Github

Bantay aims to be a lightweight, extensible uptime monitor with support for alerts and notifications.

It's very easy to get started. First, we write a configuration file called checks.yml:

---
server:
  poll_interval: 10
checks:
  - name: Dev.to
    url: https://dev.to/
    valid_status: 200
    body_match: dev
  - name: Local Server
    url: http://localhost:5555/
    valid_status: 200
reporters:
  - type: log
Enter fullscreen mode Exit fullscreen mode

Let's go through the YAML file line by line:

server:
  poll_interval: 10
Enter fullscreen mode Exit fullscreen mode

Here we define a server section, and we tell it to have a poll_interval of 10. When we run Bantay in server mode later, this is the frequency with which it will perform uptime checks.

checks:
  - name: Dev.to
    url: https://dev.to/
    valid_status: 200
    body_match: dev
  - name: Local Server
    url: http://localhost:5555/
    valid_status: 200
Enter fullscreen mode Exit fullscreen mode

Next we define a checks section, with a couple of entries: Dev.to and Local Server. The fields are pretty self-explanatory, with url being the endpoint which Bantay will perform an HTTP GET to check uptime, valid_status being the HTTP status code we expect to get, and body_match being an optional string in the response body we expect to see.

reporters:
  - type: log
Enter fullscreen mode Exit fullscreen mode

In the reporters section, we put one object with the type log. This will log the checks in stderr/stdout.

Before we actually start Bantay, let's go ahead and quickly start a Python HTTP server to listen on port 5555 locally (four our Local Server check):

# on Py2
$ python -m SimpleHTTPServer 5555
# on Py3
$ python3 -m http.server 5555
Enter fullscreen mode Exit fullscreen mode

For Mac OS users: Modify checks.yml to use http://docker.for.mac.host.internal:5555/ instead of http://localhost:5555/

Finally, we pull the latest Bantay Docker image, and run a check:

$ docker run -v "$(pwd)/checks.yml":/opt/bantay/bin/checks.yml --net=host fipanganiban/bantay:latest bantay check
Enter fullscreen mode Exit fullscreen mode

We should get something similar to:

Your first Bantay check

Looks good!

If we kill the running Python server and run Bantay check again, we should get:

A failed Bantay check

Bantay Server

A one-off check does little to help us measure availability. Most of the time, we want to perform these checks regularly and get notified whenever something goes down after a check. For that, we run Bantay in server mode:

# start the local Python HTTP server again
$ python3 -m http.server 5555
# and start Bantay in server mode
$ docker run -v "$(pwd)/checks.yml":/opt/bantay/bin/checks.yml --net=host --name bantay fipanganiban/bantay:latest bantay server
Enter fullscreen mode Exit fullscreen mode

We can also add a Slack reporter to let us know when a service goes down. Add the following to the bottom of your checks.yml file (replacing YOUR-SLACK-CHANNEL-HERE and YOUR-SLACK-TOKEN-HERE):

  - type: slack
    options:
      slack_channel: YOUR-SLACK-CHANNEL-HERE
      slack_token: YOUR-SLACK-TOKEN-HERE
Enter fullscreen mode Exit fullscreen mode

Now, when we kill the Python server again, Bantay should detect that it went down and we get a handy notification through Slack:

Slack down alert

And if we start the Python server again, Bantay should detect that as well:

Slack up alert

Final notes

And that's it! You should now be able to set basic uptime checks with Bantay, in just a few lines of YAML. At the time of writing, Bantay also supports notifying via email (using Mailgun), and sending metrics to InfluxDB (for graphing and storing history). Learn more about all its current features, and how to build Bantay as a binary, in its Github repo: https://github.com/kixpanganiban/bantay

Top comments (0)