137Foundry

Posted on Jun 26

How to Build a Simple Aggregator for SaaS Vendor Status Pages Your Team Relies On

#tutorial #productivity #sre

By the time a medium-sized engineering organization has been in operation for two years, it depends on between twenty and forty SaaS vendors. Source control, CI, observability, error tracking, feature flags, identity, email, payment processing, customer support, analytics, data warehouse. When one of them is down, the symptom shows up as an unrelated bug somewhere else in your stack, and an engineer spends thirty minutes debugging code that is actually working before someone thinks to check the vendor's status page.

A small aggregator that pings every vendor status page on a schedule and surfaces any degraded service into a single dashboard saves that thirty minutes per incident. Over a year, for a typical team, that is dozens of debugging sessions avoided.

This is a walkthrough of how to build a basic version of that aggregator, with caveats about what to do once you outgrow it.

What you are building

The minimum viable aggregator does three things:

Reads a configured list of vendor status page URLs
Polls each URL every few minutes and parses the current overall status
Surfaces degraded or down statuses in one place (a Slack channel, a small web dashboard, or both)

The whole thing fits in 200 lines of Python or Node, runs on a $5/month VPS, and pays for itself in the first incident it catches.

Step 1: pick the status page format

The good news: most modern SaaS vendors use one of three status page providers, and all three expose machine-readable status data via standardized endpoints.

Statuspage.io / Atlassian Statuspage (used by GitHub, Stripe, Twilio, hundreds of others): exposes a JSON API at <status-url>/api/v2/status.json and /api/v2/incidents.json. Documentation at statuspage.io.
Status.io (used by some smaller vendors): exposes a JSON API at <status-url>/api/v2/status.
Better Stack (formerly Better Uptime): exposes status via a public API or RSS feed.

A handful of vendors run custom status pages that require HTML scraping. For those, the simplest approach is a regex against the page that extracts a status string. It is brittle but it works as a fallback.

Step 2: build the config

A YAML config that lists each vendor with their status URL and the provider type:

vendors:
  - name: GitHub
    url: https://www.githubstatus.com
    provider: statuspage
  - name: Stripe
    url: https://status.stripe.com
    provider: statuspage
  - name: Datadog
    url: https://status.datadoghq.com
    provider: statuspage
  - name: Vercel
    url: https://www.vercel-status.com
    provider: statuspage
  - name: Okta
    url: https://status.okta.com
    provider: statuspage

Add an owner field if you want to route alerts to specific teams when a vendor goes down. Add a critical boolean to distinguish vendors whose downtime should page on-call from vendors that just need to be noted.

Step 3: the polling loop

A simple Python loop:

import requests
import yaml
import time

CONFIG = yaml.safe_load(open('vendors.yaml'))
POLL_INTERVAL = 120  # seconds

def fetch_statuspage(url):
    r = requests.get(f"{url}/api/v2/status.json", timeout=10)
    r.raise_for_status()
    data = r.json()
    return data["status"]["indicator"], data["status"]["description"]

PROVIDERS = {
    "statuspage": fetch_statuspage,
    # add status_io, better_stack, etc as needed
}

state = {}  # vendor name -> last known status

while True:
    for v in CONFIG["vendors"]:
        try:
            ind, desc = PROVIDERS[v["provider"]](v["url"])
            prev = state.get(v["name"], "none")
            if ind != "none" and ind != prev:
                notify(f"{v['name']}: {desc} ({ind})")
            state[v["name"]] = ind
        except Exception as e:
            print(f"check failed for {v['name']}: {e}")
    time.sleep(POLL_INTERVAL)

The notify function sends to whatever output you want: Slack webhook, PagerDuty, email, a log file. For a small team a Slack webhook is enough.

Step 4: respect the vendor's status page rate limits

Statuspage providers do not generally publish strict rate limits, but polling every 30 seconds across 30 vendors is rude. Two minutes per vendor is plenty for almost any use case. The vendor publishes status as soon as their internal monitoring detects an incident; you do not need to be the first to know.

For Atlassian Statuspage specifically, the Statuspage API documentation recommends polling no more than once per minute.

Step 5: add the historical layer

Once the basic alerting works, the next upgrade is a simple historical log. Every time the status of any vendor changes, write a row to a CSV or a SQLite database:

timestamp, vendor, old_status, new_status, description

After a month of running, this gives you a per-vendor incident frequency and duration log. This is the data the vendor's own status page would have given you if you knew where to look, but having it locally aggregated across all your vendors makes patterns visible. The 137Foundry engineering team has used this to flag vendors whose incident frequency was rising in the lead-up to renewal negotiations.

Step 6: dashboard

A simple web page that reads the current state dict and renders a colored tile per vendor:

Green: operational
Yellow: degraded performance or partial outage
Red: major outage

Static HTML with a meta refresh tag works. Pin it on a TV in the office or open it as a tab when the team starts the day. Better dashboards (Grafana fed from your historical SQLite) are a natural upgrade later.

What this aggregator does not do

A few honest limitations of the minimum-viable version:

It does not catch silent degradations. Some vendors only update their status page hours after the actual incident starts. For those, your own application's error rates are the better signal.
It does not correlate with your own incidents. A more sophisticated version cross-references vendor outages against your own monitoring (Sentry, Datadog, custom metrics) to flag when a vendor problem is affecting your users.
It does not predict outages. It reports them as they happen. Predictive vendor reliability is a research problem and not solvable in 200 lines of Python.

For the predictive piece, the Cloud Security Alliance CCM is a reasonable starting point for thinking about vendor risk over longer time horizons.

When to outgrow the aggregator

Commercial products exist that do this and more: AppEnsure, StatusGator, Better Stack's vendor monitoring tier, several others. For larger organizations (more than 100 engineers, more than 50 vendors), one of these is usually worth paying for. The signs you have outgrown the homegrown version:

You want SLO compliance tracking against vendor SLA commitments
You want alert routing by service area, not just by vendor
You want correlation with your own observability stack
You want vendor postmortem aggregation for renewal negotiations

For a team of 20 to 100 engineers, the 200-line homegrown version is usually enough. For larger teams, the commercial options pay back in features that take months to build internally.

How this connects to vendor procurement

A status aggregator is operationally useful, but it also generates data that feeds back into procurement decisions. A vendor whose status page shows 12 incidents in a quarter, when you previously evaluated them at 99.95 percent SLA, is signaling something at renewal. A vendor with low historical incident frequency makes the renewal-pricing conversation easier.

For broader context on how vendor operational health rolls up into procurement evaluation, the longer guide at How to Build a Vendor Risk Scorecard for SaaS Procurement Decisions walks through the reliability and SLA dimension in detail.

A few useful external references:

The aggregator is one weekend of work. After that, every minute it saves debugging the wrong layer is a small dividend you do not have to think about.

DEV Community