It was 03:14 on a Saturday when our pager went off. Watch time across our Tokyo PoP had dropped 38% in 15 minutes, but every monitoring tool we had was green. Cloudflare's dashboard said "all good." UptimeRobot was pinging the apex domain successfully. Our origin LiteSpeed boxes were serving requests in under 80ms. And yet thousands of Japanese viewers were silently failing to load the HLS manifest for popular videos.
The problem turned out to be a single PoP in Osaka where Cloudflare was caching a stale master.m3u8 with a now-deleted variant URL. From any vantage point outside that specific edge, the file looked fine. From inside Japan on a JP residential ISP, it was broken. Our existing health checks — the kind that ping /healthz from a single US-east AWS region — could not see this. We were blind.
At TopVideoHub we aggregate trending video across nine regions in APAC, and "the manifest serves correctly in Tokyo" is just as important to us as "the database accepts writes." So we built a multi-region health-check aggregator that probes real CDN paths from real network locations, rolls the results up into a single SQLite store, and alerts on per-region anomalies rather than global outages. This post walks through what we measure, how the probes are structured, how the aggregator stores and exposes the data, and what surprised us when we put it in production.
What we actually need to measure
The first mistake we made was treating "is the site up?" as a single boolean. With a video site behind Cloudflare, there are at least four independent failure modes per region:
- Origin reachability. Can LiteSpeed actually be reached from the edge? Usually yes, but a Cloudflare misroute can break this for a specific ASN combination.
-
HTML cache correctness. Is the cached
/category/animepage returning the rightvaryset and a 200, not a stale 500 from an earlier deploy? -
HLS manifest integrity. Does
master.m3u8parse, list variants, and do those variants resolve? - Search relevance. Is the Japanese/Korean/Chinese query returning results, or has the FTS5 tokenizer been silently broken by a schema migration?
A green ping doesn't help with any of those. We needed probes that exercise real product surface area, from real geographies, with assertions specific to each surface.
The metrics we ended up collecting per probe:
-
latency_ms— TTFB to first byte -
total_ms— full body received http_status-
cf_cache_statusfrom thecf-cache-statusheader -
cf_popfromcf-ray(last three letters identify the PoP) -
body_sha256— content fingerprint so we can detect divergence between regions -
assertion_failures— a JSON blob of which content checks failed
That last column is the one that made the system actually useful. A 200 OK with a 4 KB body is "healthy" by classical monitoring. A 200 OK with a 4 KB body when the page should be 180 KB is a silent disaster, and assertions catch it.
Architecture in one paragraph
Probes run as small Go binaries on cheap VPS instances we already keep around in nine regions for IP diversity. Each probe hits a list of URLs supplied by the central aggregator, runs HTTP requests with strict timeouts, performs content assertions, and POSTs a JSON batch back to the aggregator every 60 seconds. The aggregator is a PHP 8.4 endpoint that validates the payload, writes it to a SQLite database with two tables (probe_results, probe_alerts), and a small ingestion worker rolls hourly aggregates into a materialized view. The whole thing serves a /ops/healthboard page from LiteSpeed with the standard page cache disabled — we want fresh numbers.
No Kubernetes, no Prometheus, no Grafana. The total runtime cost is about $24/month including the probe VPS fleet.
The probe worker
Each probe is a single Go binary because we wanted accurate timing without GC-induced jitter masking real latency, and because we wanted to ship one static file via scp to whatever VPS we had handy. The structure is intentionally boring.
package main
import (
"crypto/sha256"
"encoding/hex"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"regexp"
"strings"
"time"
)
type Target struct {
URL string `json:"url"`
Kind string `json:"kind"` // "page", "manifest", "search"
MustContain []string `json:"must_contain"`
MinBytes int `json:"min_bytes"`
}
type Result struct {
Region string `json:"region"`
URL string `json:"url"`
Kind string `json:"kind"`
Status int `json:"status"`
LatencyMs int64 `json:"latency_ms"`
TotalMs int64 `json:"total_ms"`
CFPop string `json:"cf_pop"`
CFCache string `json:"cf_cache_status"`
BodySHA string `json:"body_sha256"`
BodyBytes int `json:"body_bytes"`
Failures []string `json:"assertion_failures"`
Timestamp int64 `json:"ts"`
}
var rayRe = regexp.MustCompile(`-([A-Z]{3})$`)
func probe(region string, t Target, client *http.Client) Result {
r := Result{Region: region, URL: t.URL, Kind: t.Kind, Timestamp: time.Now().Unix()}
start := time.Now()
req, _ := http.NewRequest("GET", t.URL, nil)
req.Header.Set("User-Agent", "TVH-Probe/1.0")
req.Header.Set("Accept-Language", regionLang(region))
resp, err := client.Do(req)
if err != nil {
r.Failures = append(r.Failures, "transport: "+err.Error())
return r
}
defer resp.Body.Close()
r.LatencyMs = time.Since(start).Milliseconds()
body, err := io.ReadAll(io.LimitReader(resp.Body, 8<<20))
if err != nil {
r.Failures = append(r.Failures, "read: "+err.Error())
return r
}
r.TotalMs = time.Since(start).Milliseconds()
r.Status = resp.StatusCode
r.BodyBytes = len(body)
sum := sha256.Sum256(body)
r.BodySHA = hex.EncodeToString(sum[:])
r.CFCache = resp.Header.Get("cf-cache-status")
if ray := resp.Header.Get("cf-ray"); ray != "" {
if m := rayRe.FindStringSubmatch(ray); len(m) == 2 {
r.CFPop = m[1]
}
}
if r.BodyBytes < t.MinBytes {
r.Failures = append(r.Failures, fmt.Sprintf("size %d<min %d", r.BodyBytes, t.MinBytes))
}
bodyStr := string(body)
for _, needle := range t.MustContain {
if !strings.Contains(bodyStr, needle) {
r.Failures = append(r.Failures, "missing: "+needle)
}
}
return r
}
func regionLang(region string) string {
switch region {
case "JP":
return "ja,en;q=0.8"
case "KR":
return "ko,en;q=0.8"
case "TW", "HK":
return "zh-TW,zh;q=0.8,en;q=0.6"
case "VN":
return "vi,en;q=0.8"
case "TH":
return "th,en;q=0.8"
}
return "en"
}
func main() {
region := os.Getenv("PROBE_REGION")
if region == "" {
fmt.Fprintln(os.Stderr, "PROBE_REGION required")
os.Exit(1)
}
client := &http.Client{Timeout: 15 * time.Second}
resp, err := http.Get(os.Getenv("PROBE_TARGETS_URL"))
if err != nil {
fmt.Fprintln(os.Stderr, "fetch targets:", err)
os.Exit(1)
}
var targets []Target
json.NewDecoder(resp.Body).Decode(&targets)
resp.Body.Close()
var results []Result
for _, t := range targets {
results = append(results, probe(region, t, client))
}
payload, _ := json.Marshal(map[string]any{
"region": region,
"results": results,
})
post, _ := http.NewRequest("POST", os.Getenv("PROBE_SINK"), strings.NewReader(string(payload)))
post.Header.Set("Content-Type", "application/json")
post.Header.Set("Authorization", "Bearer "+os.Getenv("PROBE_TOKEN"))
client.Do(post)
}
Three things to call out about this probe:
- The
Accept-Languageheader changes per region. Cloudflare and our LiteSpeed origin both vary on it in subtle ways, and the Japanese homepage of an Asia-Pacific aggregator looks different from the English one. Probing without the right language is testing a code path no real user takes. - We read the body with
io.LimitReadercapped at 8 MiB. If something on the origin starts streaming a 2 GB file because of a bad migration, we don't want every probe VPS to OOM. - The
cf-rayheader's trailing three characters identify the Cloudflare PoP. Capturing this lets us answer "which edge are we actually hitting from this region?" — the answer is sometimes surprising (a probe in Bangkok served by SIN, etc.).
This binary is driven by cron every 60 seconds on each VPS. We deliberately do not run it as a long-lived daemon; one-shot cron means a crash on Tuesday doesn't go unnoticed until Friday.
The aggregator
The receiver is plain PHP 8.4. We already run LiteSpeed in front of our PHP for the main site, so adding one more endpoint costs us nothing. The whole receiver is under 150 lines.
<?php
declare(strict_types=1);
final class ProbeAggregator
{
public function __construct(private \PDO $pdo) {}
public function ingest(string $rawBody, string $authHeader): array
{
if (!hash_equals('Bearer ' . ($_ENV['PROBE_TOKEN'] ?? ''), $authHeader)) {
http_response_code(401);
return ['error' => 'unauthorized'];
}
$payload = json_decode($rawBody, true, 8, JSON_THROW_ON_ERROR);
if (!is_array($payload['results'] ?? null)) {
http_response_code(400);
return ['error' => 'no results'];
}
$region = (string)($payload['region'] ?? 'XX');
$stmt = $this->pdo->prepare(<<<SQL
INSERT INTO probe_results
(region, url, kind, status, latency_ms, total_ms,
cf_pop, cf_cache, body_sha, body_bytes, failures, ts)
VALUES (:region, :url, :kind, :status, :latency_ms, :total_ms,
:cf_pop, :cf_cache, :body_sha, :body_bytes, :failures, :ts)
SQL);
$this->pdo->beginTransaction();
$inserted = 0;
foreach ($payload['results'] as $r) {
$stmt->execute([
':region' => $region,
':url' => (string)$r['url'],
':kind' => (string)$r['kind'],
':status' => (int)$r['status'],
':latency_ms' => (int)$r['latency_ms'],
':total_ms' => (int)$r['total_ms'],
':cf_pop' => (string)($r['cf_pop'] ?? ''),
':cf_cache' => (string)($r['cf_cache_status'] ?? ''),
':body_sha' => (string)($r['body_sha256'] ?? ''),
':body_bytes' => (int)$r['body_bytes'],
':failures' => json_encode($r['assertion_failures'] ?? []),
':ts' => (int)$r['ts'],
]);
$inserted++;
}
$this->pdo->commit();
$this->detectAnomalies($region);
return ['ok' => true, 'inserted' => $inserted];
}
private function detectAnomalies(string $region): void
{
$stmt = $this->pdo->prepare(<<<SQL
SELECT url,
AVG(CASE WHEN status >= 200 AND status < 400 THEN 1 ELSE 0 END) AS ok_rate,
COUNT(*) AS samples
FROM probe_results
WHERE region = :region
AND ts > strftime('%s','now') - 600
GROUP BY url
HAVING samples >= 5 AND ok_rate < 0.6
SQL);
$stmt->execute([':region' => $region]);
$ins = $this->pdo->prepare(<<<SQL
INSERT INTO probe_alerts (region, url, ok_rate, ts)
VALUES (:region, :url, :ok_rate, :ts)
ON CONFLICT(region, url) DO UPDATE SET
ok_rate = excluded.ok_rate, ts = excluded.ts
SQL);
foreach ($stmt as $row) {
$ins->execute([
':region' => $region,
':url' => $row['url'],
':ok_rate' => (float)$row['ok_rate'],
':ts' => time(),
]);
}
}
}
The anomaly detector is intentionally dumb: any URL with five or more samples in the last ten minutes whose success rate drops below 60% gets upserted into probe_alerts. A separate cron reads that table and posts to a Discord webhook. Smarter detection (EWMA, change-point detection) is on the backlog, but the dumb threshold has caught every real outage we've had so far. Premature sophistication in alerting almost always means false positives at 4 AM, and false positives are how teams learn to mute pages.
SQLite schema and queries
We use SQLite for the same reason we use it everywhere else on the site: it's one file, it backs up via rsync, and it scales further than people expect. At ~7.5 million probe rows per month across nine regions, the database file is about 480 MB. Query latency for the dashboard is single-digit milliseconds.
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA temp_store = MEMORY;
PRAGMA mmap_size = 268435456;
CREATE TABLE probe_results (
id INTEGER PRIMARY KEY,
region TEXT NOT NULL,
url TEXT NOT NULL,
kind TEXT NOT NULL,
status INTEGER NOT NULL,
latency_ms INTEGER NOT NULL,
total_ms INTEGER NOT NULL,
cf_pop TEXT,
cf_cache TEXT,
body_sha TEXT,
body_bytes INTEGER,
failures TEXT,
ts INTEGER NOT NULL
) STRICT;
CREATE INDEX idx_probe_region_ts ON probe_results(region, ts DESC);
CREATE INDEX idx_probe_url_ts ON probe_results(url, ts DESC);
CREATE TABLE probe_alerts (
region TEXT NOT NULL,
url TEXT NOT NULL,
ok_rate REAL NOT NULL,
ts INTEGER NOT NULL,
PRIMARY KEY (region, url)
) STRICT;
CREATE TABLE probe_hourly (
region TEXT NOT NULL,
url TEXT NOT NULL,
hour_ts INTEGER NOT NULL,
samples INTEGER NOT NULL,
ok_count INTEGER NOT NULL,
p50_latency INTEGER NOT NULL,
p95_latency INTEGER NOT NULL,
PRIMARY KEY (region, url, hour_ts)
) STRICT;
The interesting query is the one that powers the dashboard. We want, per region and per URL, the rolling 5-minute success rate plus a flag for "divergence" — the case where this region sees a different body_sha than the global mode.
WITH recent AS (
SELECT region, url, status, total_ms, body_sha
FROM probe_results
WHERE ts > strftime('%s','now') - 300
),
sha_counts AS (
SELECT url, body_sha, COUNT(*) AS n
FROM recent
GROUP BY url, body_sha
),
majority_sha AS (
SELECT url, body_sha
FROM sha_counts s
WHERE n = (SELECT MAX(n) FROM sha_counts s2 WHERE s2.url = s.url)
)
SELECT r.region,
r.url,
COUNT(*) AS samples,
AVG(CASE WHEN r.status BETWEEN 200 AND 399 THEN 1.0 ELSE 0.0 END) AS ok_rate,
SUM(CASE WHEN r.body_sha != m.body_sha THEN 1 ELSE 0 END) AS divergent
FROM recent r
JOIN majority_sha m USING (url)
GROUP BY r.region, r.url
ORDER BY r.region, r.url;
The body_sha divergence column is what would have caught the Osaka incident. When eight regions see SHA abc... for the manifest and JP sees SHA def..., that's not a flap, that's a cache poisoning event at one edge. We render the dashboard cell red regardless of HTTP status.
For the FTS5 search probe specifically — testing that "アニメ" still returns results from our CJK-tokenized index — we needed an assertion stronger than HTTP 200. The probe target manifest is generated by a small Python script that lives next to our content vault:
import json
from pathlib import Path
REGIONS = ["US", "GB", "JP", "KR", "TW", "SG", "VN", "TH", "HK"]
SEARCH_QUERIES = {
"JP": ["アニメ", "音楽", "ゲーム実況"],
"KR": ["케이팝", "먹방", "드라마"],
"TW": ["遊戲", "美食", "音樂"],
"HK": ["遊戲", "美食", "音樂"],
"VN": ["bóng đá", "âm nhạc", "phim"],
"TH": ["เพลง", "เกม", "ละคร"],
}
BASE = "https://topvideohub.com"
def build_targets(region: str) -> list[dict]:
targets = [
{
"url": f"{BASE}/",
"kind": "page",
"must_contain": ["TopVideoHub", '<main id="content"'],
"min_bytes": 40_000,
},
{
"url": f"{BASE}/category/music",
"kind": "page",
"must_contain": ["category-music", "video-card"],
"min_bytes": 30_000,
},
]
for q in SEARCH_QUERIES.get(region, ["music"]):
targets.append({
"url": f"{BASE}/search?q={q}",
"kind": "search",
"must_contain": ["search-results", q],
"min_bytes": 8_000,
})
return targets
if __name__ == "__main__":
out = Path("targets")
out.mkdir(exist_ok=True)
for r in REGIONS:
(out / f"{r}.json").write_text(
json.dumps(build_targets(r), ensure_ascii=False, indent=2),
encoding="utf-8",
)
print(f"wrote {len(REGIONS)} target files")
Putting the query string itself into must_contain is a deliberate test of two things at once: that FTS5 returned something (we render the query back into the results page), and that the response wasn't a cached page for a different query. URL-keyed cache poisoning is real, particularly after a deploy that changes how query parameters affect cache keys.
Surfacing it through LiteSpeed and Cloudflare
The aggregator endpoint at /ops/probe-ingest and the dashboard at /ops/healthboard both live on the main LiteSpeed origin. We did three small but important things to keep them from breaking the main site:
-
CacheDisable yesfor any URL starting with/ops/. The whole point of a health dashboard is fresh data, and the LiteSpeed page cache would happily serve a 2-minute-old snapshot otherwise. - A Cloudflare page rule that disables Cloudflare's CDN cache for
/ops/*and forces the WAF into "high" mode. The ingest endpoint is rate-limited at the WAF layer to 30 requests/minute per probe ASN, which more than covers our once-a-minute cron with headroom. - Basic auth on
/ops/healthboardenforced by the PHP app rather than Cloudflare Access. We want the page to load even if Cloudflare itself is the thing that broke — which has happened.
That last point bit us early. The first version of the dashboard required Cloudflare Access to view, and on the day Cloudflare's APAC dashboard had its own incident, we couldn't see our own monitoring because the SSO challenge couldn't load. Now the dashboard is reachable on a long random path with HTTP basic auth, and works even with Cloudflare in Under Attack Mode.
What broke first
The list of things that surprised us in the first month, roughly in order:
- The JP probe VPS was on the same backbone as our Tokyo Cloudflare PoP. Latency from there looked great, success rate was 100% even when real users were complaining. We moved it to a different provider on a different transit and immediately started seeing the real picture. The lesson generalizes: a probe colocated with the thing it probes is barely a probe at all.
-
Our
must_containassertions tripped on cookie banner A/B tests. Whenever marketing flipped the consent banner copy, a third of probes started "failing." We moved to asserting on structural HTML (<main id="content">) instead of marketing copy. Anything a non-engineer can change without telling you should not be an assertion target. -
The Korean probe started returning a translated page because our content negotiation hit a code path that, for
Accept-Language: ko, applied an experimental Korean UI. The probe was working perfectly; the assertion was wrong. We added per-regionmust_containlists. -
SQLite VACUUM during ingest caused a 12-second stall. We moved
VACUUMto a weekly cron at 03:00 UTC and addedPRAGMA wal_autocheckpoint=1000so checkpoint pauses stay small. - Discord webhook 429s during a real incident meant we got fewer alerts during the worst moment. We now batch alerts every 30 seconds and include a "this is alert #N for this URL in the last hour" counter.
None of these are exotic. They're the ordinary lessons of running synthetic monitoring against a real system. The point of the exercise is to surface them in your own architecture, not to copy ours.
Conclusion
The version of this system we run today does about 1.3 million probes a month across nine regions, has caught three Cloudflare cache anomalies that ordinary uptime monitoring missed, and costs less than a single APM seat at any commercial vendor. The trick was not the technology — Go binary, PHP receiver, SQLite store, LiteSpeed in front — but the discipline of writing assertions about what real users see, from places real users live, in the languages they actually speak. A monitoring system that only knows English HTTP 200s is not monitoring an Asia-Pacific video site; it's monitoring a US health-check endpoint. Build for the surface area you actually serve, store the raw data so future-you can ask new questions, and prefer dumb thresholds you trust to clever detection you don't.
Top comments (0)