DEV Community

ahmet gedik
ahmet gedik

Posted on

Building a Multi-Region Health-Check Aggregator for Our Video CDN Fleet

It was 03:14 on a Saturday when our pager went off. Watch time across our Tokyo PoP had dropped 38% in 15 minutes, but every monitoring tool we had was green. Cloudflare's dashboard said "all good." UptimeRobot was pinging the apex domain successfully. Our origin LiteSpeed boxes were serving requests in under 80ms. And yet thousands of Japanese viewers were silently failing to load the HLS manifest for popular videos.

The problem turned out to be a single PoP in Osaka where Cloudflare was caching a stale master.m3u8 with a now-deleted variant URL. From any vantage point outside that specific edge, the file looked fine. From inside Japan on a JP residential ISP, it was broken. Our existing health checks — the kind that ping /healthz from a single US-east AWS region — could not see this. We were blind.

At TopVideoHub we aggregate trending video across nine regions in APAC, and "the manifest serves correctly in Tokyo" is just as important to us as "the database accepts writes." So we built a multi-region health-check aggregator that probes real CDN paths from real network locations, rolls the results up into a single SQLite store, and alerts on per-region anomalies rather than global outages. This post walks through what we measure, how the probes are structured, how the aggregator stores and exposes the data, and what surprised us when we put it in production.

What we actually need to measure

The first mistake we made was treating "is the site up?" as a single boolean. With a video site behind Cloudflare, there are at least four independent failure modes per region:

  • Origin reachability. Can LiteSpeed actually be reached from the edge? Usually yes, but a Cloudflare misroute can break this for a specific ASN combination.
  • HTML cache correctness. Is the cached /category/anime page returning the right vary set and a 200, not a stale 500 from an earlier deploy?
  • HLS manifest integrity. Does master.m3u8 parse, list variants, and do those variants resolve?
  • Search relevance. Is the Japanese/Korean/Chinese query returning results, or has the FTS5 tokenizer been silently broken by a schema migration?

A green ping doesn't help with any of those. We needed probes that exercise real product surface area, from real geographies, with assertions specific to each surface.

The metrics we ended up collecting per probe:

  • latency_ms — TTFB to first byte
  • total_ms — full body received
  • http_status
  • cf_cache_status from the cf-cache-status header
  • cf_pop from cf-ray (last three letters identify the PoP)
  • body_sha256 — content fingerprint so we can detect divergence between regions
  • assertion_failures — a JSON blob of which content checks failed

That last column is the one that made the system actually useful. A 200 OK with a 4 KB body is "healthy" by classical monitoring. A 200 OK with a 4 KB body when the page should be 180 KB is a silent disaster, and assertions catch it.

Architecture in one paragraph

Probes run as small Go binaries on cheap VPS instances we already keep around in nine regions for IP diversity. Each probe hits a list of URLs supplied by the central aggregator, runs HTTP requests with strict timeouts, performs content assertions, and POSTs a JSON batch back to the aggregator every 60 seconds. The aggregator is a PHP 8.4 endpoint that validates the payload, writes it to a SQLite database with two tables (probe_results, probe_alerts), and a small ingestion worker rolls hourly aggregates into a materialized view. The whole thing serves a /ops/healthboard page from LiteSpeed with the standard page cache disabled — we want fresh numbers.

No Kubernetes, no Prometheus, no Grafana. The total runtime cost is about $24/month including the probe VPS fleet.

The probe worker

Each probe is a single Go binary because we wanted accurate timing without GC-induced jitter masking real latency, and because we wanted to ship one static file via scp to whatever VPS we had handy. The structure is intentionally boring.

package main

import (
    "crypto/sha256"
    "encoding/hex"
    "encoding/json"
    "fmt"
    "io"
    "net/http"
    "os"
    "regexp"
    "strings"
    "time"
)

type Target struct {
    URL         string   `json:"url"`
    Kind        string   `json:"kind"` // "page", "manifest", "search"
    MustContain []string `json:"must_contain"`
    MinBytes    int      `json:"min_bytes"`
}

type Result struct {
    Region    string   `json:"region"`
    URL       string   `json:"url"`
    Kind      string   `json:"kind"`
    Status    int      `json:"status"`
    LatencyMs int64    `json:"latency_ms"`
    TotalMs   int64    `json:"total_ms"`
    CFPop     string   `json:"cf_pop"`
    CFCache   string   `json:"cf_cache_status"`
    BodySHA   string   `json:"body_sha256"`
    BodyBytes int      `json:"body_bytes"`
    Failures  []string `json:"assertion_failures"`
    Timestamp int64    `json:"ts"`
}

var rayRe = regexp.MustCompile(`-([A-Z]{3})$`)

func probe(region string, t Target, client *http.Client) Result {
    r := Result{Region: region, URL: t.URL, Kind: t.Kind, Timestamp: time.Now().Unix()}
    start := time.Now()
    req, _ := http.NewRequest("GET", t.URL, nil)
    req.Header.Set("User-Agent", "TVH-Probe/1.0")
    req.Header.Set("Accept-Language", regionLang(region))

    resp, err := client.Do(req)
    if err != nil {
        r.Failures = append(r.Failures, "transport: "+err.Error())
        return r
    }
    defer resp.Body.Close()
    r.LatencyMs = time.Since(start).Milliseconds()

    body, err := io.ReadAll(io.LimitReader(resp.Body, 8<<20))
    if err != nil {
        r.Failures = append(r.Failures, "read: "+err.Error())
        return r
    }
    r.TotalMs = time.Since(start).Milliseconds()
    r.Status = resp.StatusCode
    r.BodyBytes = len(body)
    sum := sha256.Sum256(body)
    r.BodySHA = hex.EncodeToString(sum[:])
    r.CFCache = resp.Header.Get("cf-cache-status")

    if ray := resp.Header.Get("cf-ray"); ray != "" {
        if m := rayRe.FindStringSubmatch(ray); len(m) == 2 {
            r.CFPop = m[1]
        }
    }

    if r.BodyBytes < t.MinBytes {
        r.Failures = append(r.Failures, fmt.Sprintf("size %d<min %d", r.BodyBytes, t.MinBytes))
    }
    bodyStr := string(body)
    for _, needle := range t.MustContain {
        if !strings.Contains(bodyStr, needle) {
            r.Failures = append(r.Failures, "missing: "+needle)
        }
    }
    return r
}

func regionLang(region string) string {
    switch region {
    case "JP":
        return "ja,en;q=0.8"
    case "KR":
        return "ko,en;q=0.8"
    case "TW", "HK":
        return "zh-TW,zh;q=0.8,en;q=0.6"
    case "VN":
        return "vi,en;q=0.8"
    case "TH":
        return "th,en;q=0.8"
    }
    return "en"
}

func main() {
    region := os.Getenv("PROBE_REGION")
    if region == "" {
        fmt.Fprintln(os.Stderr, "PROBE_REGION required")
        os.Exit(1)
    }
    client := &http.Client{Timeout: 15 * time.Second}

    resp, err := http.Get(os.Getenv("PROBE_TARGETS_URL"))
    if err != nil {
        fmt.Fprintln(os.Stderr, "fetch targets:", err)
        os.Exit(1)
    }
    var targets []Target
    json.NewDecoder(resp.Body).Decode(&targets)
    resp.Body.Close()

    var results []Result
    for _, t := range targets {
        results = append(results, probe(region, t, client))
    }

    payload, _ := json.Marshal(map[string]any{
        "region":  region,
        "results": results,
    })
    post, _ := http.NewRequest("POST", os.Getenv("PROBE_SINK"), strings.NewReader(string(payload)))
    post.Header.Set("Content-Type", "application/json")
    post.Header.Set("Authorization", "Bearer "+os.Getenv("PROBE_TOKEN"))
    client.Do(post)
}
Enter fullscreen mode Exit fullscreen mode

Three things to call out about this probe:

  • The Accept-Language header changes per region. Cloudflare and our LiteSpeed origin both vary on it in subtle ways, and the Japanese homepage of an Asia-Pacific aggregator looks different from the English one. Probing without the right language is testing a code path no real user takes.
  • We read the body with io.LimitReader capped at 8 MiB. If something on the origin starts streaming a 2 GB file because of a bad migration, we don't want every probe VPS to OOM.
  • The cf-ray header's trailing three characters identify the Cloudflare PoP. Capturing this lets us answer "which edge are we actually hitting from this region?" — the answer is sometimes surprising (a probe in Bangkok served by SIN, etc.).

This binary is driven by cron every 60 seconds on each VPS. We deliberately do not run it as a long-lived daemon; one-shot cron means a crash on Tuesday doesn't go unnoticed until Friday.

The aggregator

The receiver is plain PHP 8.4. We already run LiteSpeed in front of our PHP for the main site, so adding one more endpoint costs us nothing. The whole receiver is under 150 lines.

<?php
declare(strict_types=1);

final class ProbeAggregator
{
    public function __construct(private \PDO $pdo) {}

    public function ingest(string $rawBody, string $authHeader): array
    {
        if (!hash_equals('Bearer ' . ($_ENV['PROBE_TOKEN'] ?? ''), $authHeader)) {
            http_response_code(401);
            return ['error' => 'unauthorized'];
        }

        $payload = json_decode($rawBody, true, 8, JSON_THROW_ON_ERROR);
        if (!is_array($payload['results'] ?? null)) {
            http_response_code(400);
            return ['error' => 'no results'];
        }

        $region = (string)($payload['region'] ?? 'XX');
        $stmt = $this->pdo->prepare(<<<SQL
            INSERT INTO probe_results
              (region, url, kind, status, latency_ms, total_ms,
               cf_pop, cf_cache, body_sha, body_bytes, failures, ts)
            VALUES (:region, :url, :kind, :status, :latency_ms, :total_ms,
                    :cf_pop, :cf_cache, :body_sha, :body_bytes, :failures, :ts)
        SQL);

        $this->pdo->beginTransaction();
        $inserted = 0;
        foreach ($payload['results'] as $r) {
            $stmt->execute([
                ':region'     => $region,
                ':url'        => (string)$r['url'],
                ':kind'       => (string)$r['kind'],
                ':status'     => (int)$r['status'],
                ':latency_ms' => (int)$r['latency_ms'],
                ':total_ms'   => (int)$r['total_ms'],
                ':cf_pop'     => (string)($r['cf_pop'] ?? ''),
                ':cf_cache'   => (string)($r['cf_cache_status'] ?? ''),
                ':body_sha'   => (string)($r['body_sha256'] ?? ''),
                ':body_bytes' => (int)$r['body_bytes'],
                ':failures'   => json_encode($r['assertion_failures'] ?? []),
                ':ts'         => (int)$r['ts'],
            ]);
            $inserted++;
        }
        $this->pdo->commit();

        $this->detectAnomalies($region);

        return ['ok' => true, 'inserted' => $inserted];
    }

    private function detectAnomalies(string $region): void
    {
        $stmt = $this->pdo->prepare(<<<SQL
            SELECT url,
                   AVG(CASE WHEN status >= 200 AND status < 400 THEN 1 ELSE 0 END) AS ok_rate,
                   COUNT(*) AS samples
            FROM probe_results
            WHERE region = :region
              AND ts > strftime('%s','now') - 600
            GROUP BY url
            HAVING samples >= 5 AND ok_rate < 0.6
        SQL);
        $stmt->execute([':region' => $region]);

        $ins = $this->pdo->prepare(<<<SQL
            INSERT INTO probe_alerts (region, url, ok_rate, ts)
            VALUES (:region, :url, :ok_rate, :ts)
            ON CONFLICT(region, url) DO UPDATE SET
              ok_rate = excluded.ok_rate, ts = excluded.ts
        SQL);
        foreach ($stmt as $row) {
            $ins->execute([
                ':region'  => $region,
                ':url'     => $row['url'],
                ':ok_rate' => (float)$row['ok_rate'],
                ':ts'      => time(),
            ]);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The anomaly detector is intentionally dumb: any URL with five or more samples in the last ten minutes whose success rate drops below 60% gets upserted into probe_alerts. A separate cron reads that table and posts to a Discord webhook. Smarter detection (EWMA, change-point detection) is on the backlog, but the dumb threshold has caught every real outage we've had so far. Premature sophistication in alerting almost always means false positives at 4 AM, and false positives are how teams learn to mute pages.

SQLite schema and queries

We use SQLite for the same reason we use it everywhere else on the site: it's one file, it backs up via rsync, and it scales further than people expect. At ~7.5 million probe rows per month across nine regions, the database file is about 480 MB. Query latency for the dashboard is single-digit milliseconds.

PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA temp_store = MEMORY;
PRAGMA mmap_size = 268435456;

CREATE TABLE probe_results (
    id          INTEGER PRIMARY KEY,
    region      TEXT NOT NULL,
    url         TEXT NOT NULL,
    kind        TEXT NOT NULL,
    status      INTEGER NOT NULL,
    latency_ms  INTEGER NOT NULL,
    total_ms    INTEGER NOT NULL,
    cf_pop      TEXT,
    cf_cache    TEXT,
    body_sha    TEXT,
    body_bytes  INTEGER,
    failures    TEXT,
    ts          INTEGER NOT NULL
) STRICT;

CREATE INDEX idx_probe_region_ts ON probe_results(region, ts DESC);
CREATE INDEX idx_probe_url_ts    ON probe_results(url, ts DESC);

CREATE TABLE probe_alerts (
    region   TEXT NOT NULL,
    url      TEXT NOT NULL,
    ok_rate  REAL NOT NULL,
    ts       INTEGER NOT NULL,
    PRIMARY KEY (region, url)
) STRICT;

CREATE TABLE probe_hourly (
    region        TEXT NOT NULL,
    url           TEXT NOT NULL,
    hour_ts       INTEGER NOT NULL,
    samples       INTEGER NOT NULL,
    ok_count      INTEGER NOT NULL,
    p50_latency   INTEGER NOT NULL,
    p95_latency   INTEGER NOT NULL,
    PRIMARY KEY (region, url, hour_ts)
) STRICT;
Enter fullscreen mode Exit fullscreen mode

The interesting query is the one that powers the dashboard. We want, per region and per URL, the rolling 5-minute success rate plus a flag for "divergence" — the case where this region sees a different body_sha than the global mode.

WITH recent AS (
    SELECT region, url, status, total_ms, body_sha
    FROM probe_results
    WHERE ts > strftime('%s','now') - 300
),
sha_counts AS (
    SELECT url, body_sha, COUNT(*) AS n
    FROM recent
    GROUP BY url, body_sha
),
majority_sha AS (
    SELECT url, body_sha
    FROM sha_counts s
    WHERE n = (SELECT MAX(n) FROM sha_counts s2 WHERE s2.url = s.url)
)
SELECT r.region,
       r.url,
       COUNT(*) AS samples,
       AVG(CASE WHEN r.status BETWEEN 200 AND 399 THEN 1.0 ELSE 0.0 END) AS ok_rate,
       SUM(CASE WHEN r.body_sha != m.body_sha THEN 1 ELSE 0 END) AS divergent
FROM recent r
JOIN majority_sha m USING (url)
GROUP BY r.region, r.url
ORDER BY r.region, r.url;
Enter fullscreen mode Exit fullscreen mode

The body_sha divergence column is what would have caught the Osaka incident. When eight regions see SHA abc... for the manifest and JP sees SHA def..., that's not a flap, that's a cache poisoning event at one edge. We render the dashboard cell red regardless of HTTP status.

For the FTS5 search probe specifically — testing that "アニメ" still returns results from our CJK-tokenized index — we needed an assertion stronger than HTTP 200. The probe target manifest is generated by a small Python script that lives next to our content vault:

import json
from pathlib import Path

REGIONS = ["US", "GB", "JP", "KR", "TW", "SG", "VN", "TH", "HK"]

SEARCH_QUERIES = {
    "JP": ["アニメ", "音楽", "ゲーム実況"],
    "KR": ["케이팝", "먹방", "드라마"],
    "TW": ["遊戲", "美食", "音樂"],
    "HK": ["遊戲", "美食", "音樂"],
    "VN": ["bóng đá", "âm nhạc", "phim"],
    "TH": ["เพลง", "เกม", "ละคร"],
}

BASE = "https://topvideohub.com"

def build_targets(region: str) -> list[dict]:
    targets = [
        {
            "url": f"{BASE}/",
            "kind": "page",
            "must_contain": ["TopVideoHub", '<main id="content"'],
            "min_bytes": 40_000,
        },
        {
            "url": f"{BASE}/category/music",
            "kind": "page",
            "must_contain": ["category-music", "video-card"],
            "min_bytes": 30_000,
        },
    ]
    for q in SEARCH_QUERIES.get(region, ["music"]):
        targets.append({
            "url": f"{BASE}/search?q={q}",
            "kind": "search",
            "must_contain": ["search-results", q],
            "min_bytes": 8_000,
        })
    return targets

if __name__ == "__main__":
    out = Path("targets")
    out.mkdir(exist_ok=True)
    for r in REGIONS:
        (out / f"{r}.json").write_text(
            json.dumps(build_targets(r), ensure_ascii=False, indent=2),
            encoding="utf-8",
        )
    print(f"wrote {len(REGIONS)} target files")
Enter fullscreen mode Exit fullscreen mode

Putting the query string itself into must_contain is a deliberate test of two things at once: that FTS5 returned something (we render the query back into the results page), and that the response wasn't a cached page for a different query. URL-keyed cache poisoning is real, particularly after a deploy that changes how query parameters affect cache keys.

Surfacing it through LiteSpeed and Cloudflare

The aggregator endpoint at /ops/probe-ingest and the dashboard at /ops/healthboard both live on the main LiteSpeed origin. We did three small but important things to keep them from breaking the main site:

  • CacheDisable yes for any URL starting with /ops/. The whole point of a health dashboard is fresh data, and the LiteSpeed page cache would happily serve a 2-minute-old snapshot otherwise.
  • A Cloudflare page rule that disables Cloudflare's CDN cache for /ops/* and forces the WAF into "high" mode. The ingest endpoint is rate-limited at the WAF layer to 30 requests/minute per probe ASN, which more than covers our once-a-minute cron with headroom.
  • Basic auth on /ops/healthboard enforced by the PHP app rather than Cloudflare Access. We want the page to load even if Cloudflare itself is the thing that broke — which has happened.

That last point bit us early. The first version of the dashboard required Cloudflare Access to view, and on the day Cloudflare's APAC dashboard had its own incident, we couldn't see our own monitoring because the SSO challenge couldn't load. Now the dashboard is reachable on a long random path with HTTP basic auth, and works even with Cloudflare in Under Attack Mode.

What broke first

The list of things that surprised us in the first month, roughly in order:

  • The JP probe VPS was on the same backbone as our Tokyo Cloudflare PoP. Latency from there looked great, success rate was 100% even when real users were complaining. We moved it to a different provider on a different transit and immediately started seeing the real picture. The lesson generalizes: a probe colocated with the thing it probes is barely a probe at all.
  • Our must_contain assertions tripped on cookie banner A/B tests. Whenever marketing flipped the consent banner copy, a third of probes started "failing." We moved to asserting on structural HTML (<main id="content">) instead of marketing copy. Anything a non-engineer can change without telling you should not be an assertion target.
  • The Korean probe started returning a translated page because our content negotiation hit a code path that, for Accept-Language: ko, applied an experimental Korean UI. The probe was working perfectly; the assertion was wrong. We added per-region must_contain lists.
  • SQLite VACUUM during ingest caused a 12-second stall. We moved VACUUM to a weekly cron at 03:00 UTC and added PRAGMA wal_autocheckpoint=1000 so checkpoint pauses stay small.
  • Discord webhook 429s during a real incident meant we got fewer alerts during the worst moment. We now batch alerts every 30 seconds and include a "this is alert #N for this URL in the last hour" counter.

None of these are exotic. They're the ordinary lessons of running synthetic monitoring against a real system. The point of the exercise is to surface them in your own architecture, not to copy ours.

Conclusion

The version of this system we run today does about 1.3 million probes a month across nine regions, has caught three Cloudflare cache anomalies that ordinary uptime monitoring missed, and costs less than a single APM seat at any commercial vendor. The trick was not the technology — Go binary, PHP receiver, SQLite store, LiteSpeed in front — but the discipline of writing assertions about what real users see, from places real users live, in the languages they actually speak. A monitoring system that only knows English HTTP 200s is not monitoring an Asia-Pacific video site; it's monitoring a US health-check endpoint. Build for the surface area you actually serve, store the raw data so future-you can ask new questions, and prefer dumb thresholds you trust to clever detection you don't.

Top comments (0)