Building a Multi-Region Health-Check Aggregator for Video CDN Edges

#php #go #sqlite #monitoring

A viewer in São Paulo hits a dead edge node and they don't file a bug report. They close the tab and go watch something else. That single closed tab is invisible to a normal uptime monitor, because the monitor is almost certainly probing from one location — and from that one location, everything looks green. On DailyWatch we run a free video discovery platform for an English-speaking audience scattered across every timezone, with an origin behind Cloudflare and LiteSpeed handling the actual requests. For months our status page proudly said "100% uptime" while real users in specific regions were eating timeouts. The single global health check was lying to us. This is how we replaced it with a multi-region aggregator that tells the truth.

The lie of a single uptime check

The fundamental flaw of one probe is sampling bias. If your monitor lives in us-east-1 and your us-east-1 edge is fine, you report green — even if your Singapore and São Paulo edges are dropping connections. Video traffic makes this worse than it sounds, because the things that break regionally are rarely the things a basic check looks at:

A misbehaving BGP route that only affects one continent's transit.
A Cloudflare colo evicting your assets from cache, so one region falls back to a slow origin.
An edge node where the disk filled up and TLS handshakes started timing out.
DNS that resolves correctly everywhere except the one resolver your real users hit.

A GET / returning 200 from a single vantage point tells you almost nothing about any of these. What we actually needed was a system that answers a sharper question: "From the places our users actually are, is each edge serving fast and correct responses right now?" That requires probing from multiple regions, storing every sample, and aggregating with enough nuance that one flaky probe doesn't page someone at 3am.

Defining healthy for a video edge

Before writing any code, we wrote down what "healthy" means for our specific workload. "Returns 200" is necessary but nowhere near sufficient. An edge can return 200 while being useless. Our definition has three parts:

Reachability: the TCP connection and TLS handshake complete inside a tight budget (we use 800ms connect, 2s total).
Latency: time-to-first-byte (TTFB) is the metric that correlates with people bouncing. We track p95 TTFB per edge, not averages, because averages hide the tail that actually annoys users.
Correctness: the body contains an expected marker. A 200 that returns a Cloudflare error page or an empty body is a failure dressed up as success.

That last point matters more than people expect. When an origin behind a CDN goes sideways, you often get a perfectly valid HTTP 200 wrapping a generic error page. Checking only the status code means you never notice.

Probing from inside each region

The naive first version, which I still keep around for local debugging, fans out from a single box using PHP's curl_multi. It is genuinely useful because it is concurrent and runs anywhere PHP runs, but it shares the original sin: one vantage point.

<?php
declare(strict_types=1);

// region-probe.php — concurrent fan-out from ONE box (debug only)
const REGIONS = [
    'iad' => 'https://iad.edge.dailywatch.video/healthz',
    'fra' => 'https://fra.edge.dailywatch.video/healthz',
    'sin' => 'https://sin.edge.dailywatch.video/healthz',
    'gru' => 'https://gru.edge.dailywatch.video/healthz',
];

function probeAll(array $regions, int $timeoutMs = 2000): array
{
    $mh = curl_multi_init();
    $handles = [];

    foreach ($regions as $code => $url) {
        $ch = curl_init($url);
        curl_setopt_array($ch, [
            CURLOPT_RETURNTRANSFER    => true,
            CURLOPT_TIMEOUT_MS        => $timeoutMs,
            CURLOPT_CONNECTTIMEOUT_MS => 800,
            CURLOPT_NOSIGNAL          => true,
            CURLOPT_HTTPHEADER        => ['User-Agent: dw-healthcheck/1.0'],
        ]);
        curl_multi_add_handle($mh, $ch);
        $handles[$code] = $ch;
    }

    do {
        $status = curl_multi_exec($mh, $running);
        if ($running) {
            curl_multi_select($mh, 0.5);
        }
    } while ($running && $status === CURLM_OK);

    $results = [];
    foreach ($handles as $code => $ch) {
        $body = (string) curl_multi_getcontent($ch);
        $results[$code] = [
            'http'    => (int) curl_getinfo($ch, CURLINFO_RESPONSE_CODE),
            'ttfb_ms' => round(curl_getinfo($ch, CURLINFO_STARTTRANSFER_TIME) * 1000, 1),
            'ok'      => str_contains($body, '"ok":true'),
            'error'   => curl_error($ch) ?: null,
        ];
        curl_multi_remove_handle($mh, $ch);
        curl_close($ch);
    }
    curl_multi_close($mh);
    return $results;
}

print_r(probeAll(REGIONS));

The production version flips the topology. Instead of one box reaching out to every edge, we deploy a tiny prober binary into several cheap regional locations — a $4 VPS per region works fine, and so does fly.io or a small box near each user cluster. Each prober checks the edges it cares about and POSTs results to a central aggregator. Go is a great fit here: a single static binary, no runtime to install, and httptrace gives us real TTFB instead of a coarse total time.

package main

import (
    "bytes"
    "context"
    "encoding/json"
    "net/http"
    "net/http/httptrace"
    "os"
    "strings"
    "sync"
    "time"
)

type Result struct {
    Vantage   string  `json:"vantage"`
    Target    string  `json:"target"`
    HTTP      int     `json:"http"`
    TTFBms    float64 `json:"ttfb_ms"`
    OK        bool    `json:"ok"`
    Err       string  `json:"err,omitempty"`
    CheckedAt int64   `json:"checked_at"`
}

func probe(ctx context.Context, vantage, target string) Result {
    r := Result{Vantage: vantage, Target: target, CheckedAt: time.Now().Unix()}
    req, _ := http.NewRequestWithContext(ctx, http.MethodGet, target, nil)
    req.Header.Set("User-Agent", "dw-healthcheck/1.0")

    start := time.Now()
    var ttfb time.Duration
    trace := &httptrace.ClientTrace{
        GotFirstResponseByte: func() { ttfb = time.Since(start) },
    }
    req = req.WithContext(httptrace.WithClientTrace(req.Context(), trace))

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        r.Err = err.Error()
        return r
    }
    defer resp.Body.Close()

    buf := make([]byte, 512)
    n, _ := resp.Body.Read(buf)
    r.HTTP = resp.StatusCode
    r.TTFBms = float64(ttfb.Microseconds()) / 1000.0
    r.OK = resp.StatusCode == 200 && strings.Contains(string(buf[:n]), "\"ok\":true")
    return r
}

func main() {
    vantage := os.Getenv("VANTAGE") // e.g. "sin", "gru"
    targets := strings.Split(os.Getenv("TARGETS"), ",")
    sink := os.Getenv("AGGREGATOR_URL")

    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
    defer cancel()

    var wg sync.WaitGroup
    results := make([]Result, len(targets))
    for i, t := range targets {
        wg.Add(1)
        go func(i int, t string) {
            defer wg.Done()
            results[i] = probe(ctx, vantage, strings.TrimSpace(t))
        }(i, t)
    }
    wg.Wait()

    body, _ := json.Marshal(results)
    http.Post(sink, "application/json", bytes.NewReader(body))
}

A cron entry (or a 30-second loop) runs this in each region. The probers are stateless and disposable; if one dies, we simply stop receiving its votes, which the aggregator handles gracefully. The key property is that the failure detection now happens from the same network paths real users traverse, not from a privileged datacenter that has clean transit to everything.

Storing samples in SQLite

The aggregator's storage is deliberately boring: SQLite. Our whole stack already leans on it (SQLite with FTS5 powers the on-site search), the operational cost is zero, and a health-check workload is tiny — a few hundred writes a minute at most. With WAL mode and STRICT tables it comfortably handles concurrent prober POSTs while the status endpoint reads.

PRAGMA journal_mode = WAL;

CREATE TABLE IF NOT EXISTS health_samples (
    id         INTEGER PRIMARY KEY,
    vantage    TEXT    NOT NULL,   -- where the probe ran FROM
    target     TEXT    NOT NULL,   -- which edge was probed
    http       INTEGER NOT NULL,
    ttfb_ms    REAL,
    ok         INTEGER NOT NULL,   -- 0 or 1
    err        TEXT,
    checked_at INTEGER NOT NULL    -- unix seconds
) STRICT;

-- The only query pattern that matters: recent samples per target.
CREATE INDEX IF NOT EXISTS idx_samples_recent
    ON health_samples (target, checked_at DESC);

I store raw samples rather than pre-aggregated state on purpose. Raw samples let me change the scoring logic later without losing history, recompute p95 over any window, and debug "why did this page fire?" by reading the exact votes that triggered it. A nightly job trims anything older than a few days so the database stays small enough to fit in page cache.

Scoring with quorum in Python

Here is where most homegrown monitors go wrong: they treat a single failed probe as an outage and page someone. Networks are noisy. A single dropped packet between your São Paulo prober and your São Paulo edge is not an outage — it is Tuesday. The fix is quorum: only declare an edge down when multiple vantage points agree, within a recent window. The Python rollup below is what our aggregator runs every time the status endpoint is regenerated.

import sqlite3
import time
import json
from collections import defaultdict

WINDOW = 300   # look back 5 minutes
QUORUM = 2     # this many vantages must see failure before we call it down

def health_rollup(db_path: str) -> dict:
    now = int(time.time())
    con = sqlite3.connect(db_path)
    con.row_factory = sqlite3.Row
    rows = con.execute(
        """
        SELECT target, vantage, ok, ttfb_ms
        FROM health_samples
        WHERE checked_at >= ?
        """,
        (now - WINDOW,),
    ).fetchall()
    con.close()

    by_target = defaultdict(list)
    for r in rows:
        by_target[r["target"]].append(r)

    report = {}
    for target, samples in by_target.items():
        vantages = defaultdict(list)
        for s in samples:
            vantages[s["vantage"]].append(s)

        down_votes = 0
        latencies = []
        for vantage, vs in vantages.items():
            ok_ratio = sum(s["ok"] for s in vs) / len(vs)
            if ok_ratio < 0.5:           # this vantage mostly sees failure
                down_votes += 1
            latencies += [s["ttfb_ms"] for s in vs if s["ttfb_ms"]]

        latencies.sort()
        p95 = latencies[int(len(latencies) * 0.95) - 1] if latencies else None

        report[target] = {
            "state": "down" if down_votes >= QUORUM else "up",
            "down_vantages": down_votes,
            "total_vantages": len(vantages),
            "p95_ttfb_ms": p95,
            "samples": len(samples),
        }
    return report

if __name__ == "__main__":
    print(json.dumps(health_rollup("health.db"), indent=2))

Notice that scoring happens per vantage first, then across vantages. A single vantage that flaps gets one vote at most, no matter how many noisy samples it produces, because we collapse its samples into an ok_ratio before counting. Only when QUORUM distinct regions independently agree do we flip an edge to down. This one design choice killed roughly 90% of our false pages. The p95 latency travels alongside the up/down state, so we can also alert on "healthy but slow," which for a video site is its own kind of outage.

Serving status behind LiteSpeed and Cloudflare

The last piece is exposing the aggregated state. Our public status endpoint is plain PHP — same runtime as the rest of the site — reading the SQLite rollup. The interesting part is caching. The endpoint can get hammered (status pages attract anxious refreshing during incidents), so we lean on LiteSpeed's cache and Cloudflare in front of it, with a short TTL and stale-while-revalidate so a thundering herd never reaches PHP more than once every few seconds.

<?php
declare(strict_types=1);

// status.php — public aggregated edge health, cached at the edge
$db = new SQLite3(__DIR__ . '/health.db', SQLITE3_OPEN_READONLY);
$db->busyTimeout(2000);

$window = 300;
$since  = time() - $window;
$quorum = 2;

$stmt = $db->prepare('
    SELECT target,
           COUNT(DISTINCT vantage)                 AS vantages,
           SUM(CASE WHEN ok = 0 THEN 1 ELSE 0 END) AS failures,
           COUNT(*)                                AS samples,
           ROUND(AVG(ttfb_ms), 1)                  AS avg_ttfb
    FROM health_samples
    WHERE checked_at >= :since
    GROUP BY target
');
$stmt->bindValue(':since', $since, SQLITE3_INTEGER);
$res = $stmt->execute();

$out = [];
while ($row = $res->fetchArray(SQLITE3_ASSOC)) {
    $row['state'] = $row['failures'] > ($row['samples'] / 2) ? 'degraded' : 'healthy';
    $out[$row['target']] = $row;
}

header('Content-Type: application/json');
// Short edge cache; serve stale while one request refreshes in the background.
header('Cache-Control: public, max-age=15, stale-while-revalidate=30');
header('X-LiteSpeed-Cache-Control: public, max-age=15');

echo json_encode([
    'generated_at' => time(),
    'window_sec'   => $window,
    'targets'      => $out,
], JSON_PRETTY_PRINT);

The X-LiteSpeed-Cache-Control header lets LiteSpeed cache the response on the origin box, while the standard Cache-Control with stale-while-revalidate lets Cloudflare absorb the public traffic. During an actual incident this matters: you do not want your status page to fall over precisely when everyone is checking it. One important subtlety — the SQL GROUP BY here is a simplified per-target view for human readers; the real up/down decision still comes from the Python quorum logic that respects distinct vantages. Mixing those two would reintroduce the flapping problem, so we keep the strict quorum decision as the source of truth and treat this endpoint's state as a coarse summary.

Lessons from running this in production

A few things only became obvious after this ran for real:

Probe the thing users hit, not a synthetic path. Our first /healthz returned a static string and stayed green while the actual video listing pages were broken. Now the health endpoint exercises the same code path a real page does, including a SQLite read.
Tail latency is the real signal. Averages told us everything was fine while p95 quietly doubled. Track p95 per edge and alert on it independently of up/down.
Disposable probers beat one fancy monitor. Cheap, stateless Go binaries in several regions gave better coverage than any single hosted monitoring service we trialed, for a fraction of the cost.
Quorum is non-negotiable. Without it you will train your team to ignore the pager, which is worse than having no monitoring at all.
Keep storage boring. SQLite handled this without a second thought and never became the thing we had to babysit.

Conclusion

A single global uptime check is comforting precisely because it rarely tells you anything is wrong — it is structurally biased toward green. For anything serving a geographically spread audience, and especially for latency-sensitive video, you need probes from inside the regions your users live in, raw samples you can re-score later, and a quorum rule that refuses to panic over a single noisy vantage. The whole thing fits in a Go binary, a SQLite file, a Python rollup, and a cached PHP endpoint — no heavyweight observability platform required. It is now the first place we look when something feels off, and far more importantly, it catches the regional failures that used to only reveal themselves as a quietly closed tab in São Paulo.