Last Tuesday at 03:14 UTC, our Frankfurt edge started returning 200 OK on /healthz while quietly serving 12-second TTFB on actual video manifest requests. The TCP socket was alive. Nginx was answering. The upstream YouTube Data API proxy was timing out behind a stuck DNS resolver, and our naive ping-based monitoring saw none of it. Eight regions, four sites, and roughly 40,000 daily manifest requests later, I rewrote the whole health-check layer from scratch.
This is the story of that rewrite — what I learned running TrendVidStream, an 8-region streaming-platform discovery service deployed via FTP to LiteSpeed shared hosts, and why a 200-line PHP aggregator beat the Prometheus + Blackbox stack I tried first.
The Problem With GET /healthz
The canonical health check is a lie. A handler that returns {"ok": true} proves exactly one thing: PHP-FPM accepted a connection and executed a closure. It does not prove:
- The SQLite database file is readable and not locked by a stuck cron
- The YouTube Data API key for this region still has quota
- The FTS5 search index isn't corrupted from a half-finished migration
- The LiteSpeed page cache directory is writable
- DNS resolution to
googleapis.comactually works from this host - The 6-hour cron that refreshes regional video pools last ran successfully
I've seen every one of those fail independently while /healthz returned 200. A real health check has to exercise the dependency graph, not just the entry point.
The second problem is aggregation. With 8 regions (US, GB, DE, FR, IN, BR, AU, CA on our largest site) across 4 domains, that's 32 endpoint × site combinations. Polling each one from a single monitor location gives you exactly one observer's view of the network. A monitor in Frankfurt that can reach Frankfurt happily reports green while a user in São Paulo sees 504s because the BR edge can't talk to the origin.
Designing the Probe Contract
First rule: the probe endpoint must do real work, but it must be cheap enough to call every 60 seconds without distorting metrics. I settled on three tiers:
-
Liveness (
/health/live): process is up, can return a response. ~1ms. -
Readiness (
/health/ready): all hard dependencies respond within budget. ~50-200ms. -
Deep (
/health/deep?token=...): exercises the full stack including external APIs. 1-3 seconds. Token-gated, called every 5 minutes max.
Here's the readiness probe in PHP 8.4. It runs every dependency check in a budget and returns structured JSON with per-check latency:
<?php
declare(strict_types=1);
final class HealthProbe
{
private const BUDGET_MS = 250;
public function __construct(
private readonly \PDO $db,
private readonly string $cacheDir,
private readonly string $region,
) {}
public function ready(): array
{
$start = hrtime(true);
$checks = [
'sqlite_read' => fn() => $this->checkSqliteRead(),
'sqlite_write' => fn() => $this->checkSqliteWrite(),
'fts5_index' => fn() => $this->checkFtsIndex(),
'cache_writable'=> fn() => $this->checkCacheDir(),
'cron_recency' => fn() => $this->checkCronRecency(),
'dns_resolve' => fn() => $this->checkDns('youtube.googleapis.com'),
];
$results = [];
$overall = 'pass';
foreach ($checks as $name => $fn) {
$t0 = hrtime(true);
try {
$detail = $fn();
$status = 'pass';
} catch (\Throwable $e) {
$detail = ['error' => $e->getMessage()];
$status = 'fail';
$overall = 'fail';
}
$results[$name] = [
'status' => $status,
'latency_ms' => (int)((hrtime(true) - $t0) / 1_000_000),
'detail' => $detail,
];
if (((hrtime(true) - $start) / 1_000_000) > self::BUDGET_MS) {
$overall = 'degraded';
break;
}
}
return [
'status' => $overall,
'region' => $this->region,
'timestamp' => gmdate('c'),
'total_ms' => (int)((hrtime(true) - $start) / 1_000_000),
'checks' => $results,
];
}
private function checkSqliteRead(): array
{
$row = $this->db->query('SELECT COUNT(*) AS n FROM videos LIMIT 1')->fetch();
return ['rows' => (int)$row['n']];
}
private function checkSqliteWrite(): array
{
$this->db->exec('CREATE TABLE IF NOT EXISTS _health (ts INTEGER) WITHOUT ROWID');
$this->db->exec('INSERT OR REPLACE INTO _health(ts) VALUES (' . time() . ')');
return ['wrote' => true];
}
private function checkFtsIndex(): array
{
$stmt = $this->db->query("SELECT COUNT(*) AS n FROM videos_fts WHERE videos_fts MATCH 'test'");
return ['matches' => (int)$stmt->fetch()['n']];
}
private function checkCacheDir(): array
{
if (!is_writable($this->cacheDir)) {
throw new \RuntimeException("cache dir not writable: {$this->cacheDir}");
}
return ['path' => $this->cacheDir];
}
private function checkCronRecency(): array
{
$row = $this->db->query(
"SELECT MAX(last_run) AS last FROM cron_log WHERE job='fetch_videos'"
)->fetch();
$age = time() - (int)$row['last'];
$maxAge = 8 * 3600;
if ($age > $maxAge) {
throw new \RuntimeException("cron stale: {$age}s > {$maxAge}s");
}
return ['age_seconds' => $age];
}
private function checkDns(string $host): array
{
$ip = gethostbyname($host);
if ($ip === $host) {
throw new \RuntimeException("DNS resolution failed for {$host}");
}
return ['host' => $host, 'ip' => $ip];
}
}
A few things I want to call out. The budget pattern (BUDGET_MS = 250) is important on shared hosting: if SQLite is under lock contention and the read takes 800ms, we don't pile up more requests waiting for the remaining checks. We return degraded and ship what we have. The cron_recency check is the one that would have caught my Frankfurt outage — when the regional fetch job hangs, the DB stops getting fresh rows, and /health/ready goes red within the SLO window.
The Aggregator: Polling 32 Endpoints in Parallel
The naive aggregator loops over each endpoint sequentially. With 32 endpoints and a 250ms budget each, you're looking at 8 seconds of wall-clock per poll cycle. Unacceptable. We need true concurrent fan-out, and on the host I'm running the aggregator on (a $5 VPS), I don't want to spin up an event loop framework just for this.
Here's the Go version I shipped. Go's sync.WaitGroup + http.Client with a sane transport handles 32 concurrent probes in under 400ms:
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"sync"
"time"
)
type Endpoint struct {
Site string `json:"site"`
Region string `json:"region"`
URL string `json:"url"`
}
type ProbeResult struct {
Endpoint Endpoint `json:"endpoint"`
Status string `json:"status"`
LatencyMs int64 `json:"latency_ms"`
HTTPCode int `json:"http_code"`
Body json.RawMessage `json:"body,omitempty"`
Error string `json:"error,omitempty"`
ObservedAt time.Time `json:"observed_at"`
}
var client = &http.Client{
Timeout: 3 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 64,
MaxIdleConnsPerHost: 4,
IdleConnTimeout: 30 * time.Second,
DisableCompression: false,
},
}
func probe(ctx context.Context, ep Endpoint) ProbeResult {
start := time.Now()
res := ProbeResult{Endpoint: ep, ObservedAt: start.UTC()}
req, err := http.NewRequestWithContext(ctx, "GET", ep.URL, nil)
if err != nil {
res.Status = "error"
res.Error = err.Error()
return res
}
req.Header.Set("User-Agent", "tvs-aggregator/1.0")
resp, err := client.Do(req)
res.LatencyMs = time.Since(start).Milliseconds()
if err != nil {
res.Status = "unreachable"
res.Error = err.Error()
return res
}
defer resp.Body.Close()
res.HTTPCode = resp.StatusCode
body, _ := io.ReadAll(io.LimitReader(resp.Body, 16*1024))
res.Body = json.RawMessage(body)
switch {
case resp.StatusCode == 200:
res.Status = "healthy"
case resp.StatusCode >= 500:
res.Status = "failing"
default:
res.Status = "degraded"
}
return res
}
func PollAll(endpoints []Endpoint) []ProbeResult {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
results := make([]ProbeResult, len(endpoints))
var wg sync.WaitGroup
wg.Add(len(endpoints))
sem := make(chan struct{}, 16)
for i, ep := range endpoints {
go func(i int, ep Endpoint) {
defer wg.Done()
sem <- struct{}{}
defer func() { <-sem }()
results[i] = probe(ctx, ep)
}(i, ep)
}
wg.Wait()
return results
}
func main() {
endpoints := []Endpoint{
{"trendvidstream.com", "US", "https://us.trendvidstream.com/health/ready"},
{"trendvidstream.com", "GB", "https://gb.trendvidstream.com/health/ready"},
{"trendvidstream.com", "DE", "https://de.trendvidstream.com/health/ready"},
// ... 29 more
}
t0 := time.Now()
results := PollAll(endpoints)
fmt.Printf("polled %d endpoints in %s\n", len(results), time.Since(t0))
for _, r := range results {
fmt.Printf("%-25s %-3s %-10s %4dms %d\n",
r.Endpoint.Site, r.Endpoint.Region, r.Status, r.LatencyMs, r.HTTPCode)
}
}
The semaphore (sem with capacity 16) is the unsung hero. Without it, 32 simultaneous outbound connections from a small VPS can saturate the local conntrack table and start dropping packets, which makes the aggregator itself look unhealthy. Capping concurrency at 16 keeps the kernel happy and still gets us full coverage in under 500ms.
Multi-Vantage-Point Polling
One aggregator location is not enough. A check from Frankfurt that reports us.trendvidstream.com as healthy proves nothing about what a real user in Texas sees. The fix is cheap: run the aggregator from 3 vantage points (I use VPSes in Virginia, Frankfurt, and Singapore) and require quorum before flipping a region to degraded.
Quorum logic in Python — this is what the alerting layer actually consumes:
from dataclasses import dataclass
from collections import defaultdict
from typing import Iterable
@dataclass(frozen=True)
class Observation:
vantage: str
site: str
region: str
status: str
latency_ms: int
observed_at: float
def aggregate(observations: Iterable[Observation], quorum: int = 2) -> dict:
by_target: dict[tuple[str, str], list[Observation]] = defaultdict(list)
for obs in observations:
by_target[(obs.site, obs.region)].append(obs)
verdict = {}
for (site, region), obs_list in by_target.items():
status_counts: dict[str, int] = defaultdict(int)
for o in obs_list:
status_counts[o.status] += 1
bad = status_counts.get('failing', 0) + status_counts.get('unreachable', 0)
if bad >= quorum:
final = 'failing'
elif status_counts.get('degraded', 0) >= quorum:
final = 'degraded'
elif status_counts.get('healthy', 0) >= quorum:
final = 'healthy'
else:
final = 'inconclusive'
latencies = [o.latency_ms for o in obs_list]
verdict[f"{site}/{region}"] = {
'status': final,
'vantages_reporting': len(obs_list),
'p50_latency_ms': sorted(latencies)[len(latencies) // 2],
'max_latency_ms': max(latencies),
'breakdown': dict(status_counts),
}
return verdict
The quorum=2 default means a single flaky vantage point can't trigger a page. This eliminated something like 80% of the false positives I had with single-source monitoring. The remaining inconclusive state matters: if only one vantage reports and it says healthy, you don't actually know — log it as a data quality issue and don't alert on it.
Wiring It Into the FTP Deploy
Our deploy is FTP-based to LiteSpeed shared hosts (you work with what you've got — these hosts cost $40/year and serve 8M requests/month happily). The health endpoints get pushed with the rest of the PHP code. The aggregator config is generated at deploy time from deploy_hosts.conf:
#!/bin/bash
# generate-endpoints.sh — run after each FTP deploy
set -euo pipefail
CONFIG_OUT="aggregator/endpoints.json"
REGIONS=(US GB DE FR IN BR AU CA)
echo '[' > "$CONFIG_OUT"
first=true
while IFS= read -r host; do
[[ "$host" =~ ^# ]] && continue
[[ -z "$host" ]] && continue
for region in "${REGIONS[@]}"; do
$first || echo ',' >> "$CONFIG_OUT"
first=false
cat >> "$CONFIG_OUT" <<EOF
{"site": "$host", "region": "$region", "url": "https://$host/health/ready?region=$region"}
EOF
done
done < deploy_hosts.conf
echo ']' >> "$CONFIG_OUT"
scp "$CONFIG_OUT" aggregator-vps:/opt/health-aggregator/endpoints.json
ssh aggregator-vps 'systemctl reload health-aggregator'
The key lesson from FTP-based deploys: line-ending hygiene matters more than you'd think. deploy_hosts.conf with Windows \r\n line endings will silently corrupt the URLs in endpoints.json and the aggregator will spend an entire night reporting that every site is unreachable because it's trying to GET https://trendvidstream.com\r/health/ready. A sed -i 's/\r$//' deploy_hosts.conf in the pre-deploy hook catches it.
What Counts as a Real Failure
Not every red probe is a page-worthy event. I learned this the hard way after waking up at 2am for the third time in a week to a transient DNS hiccup that resolved itself in 30 seconds. The rules I now follow:
- Soft failures (single probe red, recovers within 2 cycles): log, don't alert
- Sustained failures (3+ consecutive cycles red from quorum vantages): page on-call
- Pattern failures (one region red across all 4 sites): page immediately — it's almost certainly infrastructure, not application
- Catastrophic failures (all 8 regions red on one site): page immediately — origin is gone
- Slow failures (latency p95 > 2× rolling 24h baseline for 10 minutes): ticket, don't page
The pattern failure case is the one most monitoring stacks get wrong. If de.trendvidstream.com, de.dailywatch.video, and de.viralvidvault.com all go red within 60 seconds of each other, that's not an application bug — that's the German upstream DNS resolver dying, or a Cloudflare regional issue, or the upstream YouTube API rate-limiting from that geo. Treating it as a per-site application incident wastes 20 minutes of investigation time.
Here's how I encode the pattern detection in the aggregator's verdict step:
def detect_patterns(verdicts: dict, threshold: int = 3) -> list[dict]:
by_region: dict[str, list[str]] = defaultdict(list)
by_site: dict[str, list[str]] = defaultdict(list)
for target, v in verdicts.items():
if v['status'] not in ('failing', 'degraded'):
continue
site, region = target.split('/')
by_region[region].append(site)
by_site[site].append(region)
patterns = []
for region, sites in by_region.items():
if len(sites) >= threshold:
patterns.append({
'type': 'regional_outage',
'region': region,
'affected_sites': sites,
'severity': 'page',
})
for site, regions in by_site.items():
if len(regions) >= 6:
patterns.append({
'type': 'site_outage',
'site': site,
'affected_regions': regions,
'severity': 'page',
})
return patterns
Storage and Retention
The aggregator writes every probe result to a local SQLite database with WAL mode enabled. At ~32 endpoints × 3 vantages × 1 poll/minute, that's roughly 138,000 rows/day. SQLite handles this without complaint, and I keep 30 days hot and roll older data into compressed monthly archives.
The schema is dead simple:
CREATE TABLE IF NOT EXISTS probe_results (
id INTEGER PRIMARY KEY,
vantage TEXT NOT NULL,
site TEXT NOT NULL,
region TEXT NOT NULL,
status TEXT NOT NULL,
http_code INTEGER,
latency_ms INTEGER NOT NULL,
observed_at INTEGER NOT NULL
) STRICT;
CREATE INDEX idx_probe_target_time ON probe_results(site, region, observed_at);
CREATE INDEX idx_probe_time ON probe_results(observed_at);
CREATE TABLE IF NOT EXISTS verdicts (
id INTEGER PRIMARY KEY,
site TEXT NOT NULL,
region TEXT NOT NULL,
status TEXT NOT NULL,
decided_at INTEGER NOT NULL,
evidence TEXT NOT NULL
) STRICT;
Using STRICT mode catches type bugs I'd otherwise miss until grafana started rendering nonsense. The evidence column is JSON containing the raw vantage breakdown that produced the verdict — invaluable for post-mortems when someone asks "why did you page me?"
What I'd Do Differently
If I were starting over:
- I'd skip the Go aggregator and go straight to a single-process Python
asyncioimplementation. The Go version is fast but operationally heavier than I needed for 32 endpoints. - I'd put the verdict storage in PostgreSQL from day one. SQLite is fine for a single aggregator, but the moment I wanted to read from a dashboard on a different host, I had to add a sync layer.
- I'd separate the probe payload from the probe transport. Right now my deep probe returns latency and runs the checks. Splitting them would let me reuse the same probe contract for synthetic monitoring from real browsers.
- I'd add per-endpoint adaptive intervals. The US edge gets 30s polls because it serves 60% of traffic. The CA edge gets 120s polls because nobody pages over a 3-minute CA outage.
Conclusion
A health check is a contract about what "working" means, and the default 200 OK contract is wrong for anything more complex than a static site. Once you have a probe that exercises the real dependency graph, you need vantage-point diversity to avoid being fooled by the network, and pattern detection to tell application bugs apart from infrastructure events.
The whole system — probe handler, Go aggregator, Python verdict logic, SQLite storage — is under 800 lines of code. It catches every real outage I've seen in the last six months and has paged me exactly twice for false positives, both during the first week before I tuned the quorum threshold. That's a much better trade than the Prometheus + Blackbox + Alertmanager stack I tried first, which was 4x the code and missed the Frankfurt outage that started this whole project.
Top comments (0)