A region went dark and we found out from Twitter
It was a Sunday morning when our Japan feed stopped returning fresh video results. Nothing crashed. No 500s in the logs. The multi-region cron that pulls streaming-platform metadata kept running, kept exiting 0, kept writing rows into SQLite. The problem was that every one of those rows pointed at a manifest URL that now answered with a polite, cacheable 403. We run TrendVidStream across eight regions, discovering and indexing streaming sources, and the only reason I learned the JP manifests were dead was a user email three hours later.
That is the failure mode this article is about: services that are up but not healthy. A process can pass every liveness check ever written and still be quietly serving garbage. For a video discovery platform, "garbage" means a manifest that returns a 200 with an empty body, an #EXTM3U that points at expired segments, or a CDN that suddenly geo-blocks the exact region you built the feed for.
This post walks through the probe we built to catch that: a small Go binary that fetches manifests across regions, decides whether each one is actually healthy, and exports the verdict as Prometheus metrics. I will show the probe core, the custom collector, the entrypoint, and how it bridges into a PHP 8.4 + SQLite + FTP-deploy stack that was never designed with Go in mind.
Liveness is not health, and uptime lies
Most monitoring you inherit answers one question: is the process accepting connections? That is liveness, and it is necessary but almost useless on its own. The incidents that actually hurt us all live in the gap between liveness and health:
- A manifest URL returns
200 OKbut the body is empty or truncated. - TLS still handshakes, but the upstream CDN started returning
403for our region's egress IP. - The manifest parses fine, but every segment it references is stale by six hours.
- Latency crept from 120ms to 9 seconds and the player times out before the first frame.
- One region of eight is fully dead while the aggregate dashboard still reads "99% up."
A health probe is a deliberate, opinionated test that asserts the thing does the job, not just the thing answers the phone. For us the job is: a given streaming source, fetched from a given region's vantage point, returns a parseable manifest within a deadline. Everything below is in service of encoding that sentence into metrics.
What the probe actually measures
Before writing code, I wrote down the signals worth emitting. Keep this list short — every metric you publish is a metric someone has to reason about at 3am.
- Reachability — did we get a response at all, or a DNS/TLS/connection error?
- Status class — was it a 2xx, or did we get redirected into a login wall / blocked?
-
Manifest validity — does the body actually look like HLS (
#EXTM3U) or DASH (<MPD)? - Latency — how long did the fetch take, in milliseconds?
- Region — every signal is labeled by the vantage point it was measured from.
Three gauges cover all of it: video_manifest_up (the composite verdict), video_manifest_latency_ms, and video_manifest_http_status. Resist the urge to add a histogram per source until you actually need percentiles — cardinality is the silent killer of Prometheus deployments, and with eight regions times a few hundred sources you are already multiplying labels fast.
A minimal probe in Go
The heart of the system is one pure-ish function: given an HTTP client and a target, fetch the manifest and return a verdict. It does not touch metrics, it does not log, it does not know about regions beyond carrying the label through. That isolation is what makes it testable.
package probe
import (
"context"
"io"
"net/http"
"strings"
"time"
)
// Result captures everything we later turn into a metric.
type Result struct {
Target string
Region string
StatusCode int
OK bool
LatencyMS float64
Err string
}
// ProbeManifest fetches an HLS/DASH manifest and judges its health.
// "Healthy" means: a 2xx, within the deadline, with a body that
// actually smells like a manifest.
func ProbeManifest(ctx context.Context, client *http.Client, target, region string) Result {
start := time.Now()
res := Result{Target: target, Region: region}
req, err := http.NewRequestWithContext(ctx, http.MethodGet, target, nil)
if err != nil {
res.Err = err.Error()
return res
}
req.Header.Set("User-Agent", "tvs-health-probe/1.0")
resp, err := client.Do(req)
res.LatencyMS = float64(time.Since(start).Milliseconds())
if err != nil {
res.Err = err.Error()
return res
}
defer resp.Body.Close()
res.StatusCode = resp.StatusCode
head, _ := io.ReadAll(io.LimitReader(resp.Body, 512))
body := string(head)
res.OK = resp.StatusCode >= 200 && resp.StatusCode < 300 &&
(strings.Contains(body, "#EXTM3U") || strings.Contains(body, "<MPD"))
if !res.OK && res.Err == "" {
res.Err = "unhealthy: status or body is not a manifest"
}
return res
}
A few decisions worth defending:
-
io.LimitReader(resp.Body, 512)— we only need the first few hundred bytes to recognize a manifest. Reading the whole thing wastes bandwidth across eight regions on every scrape and invites memory pressure when a misconfigured source streams megabytes. -
The deadline comes from
context, not from the client alone. The caller owns the timeout, which means the collector can enforce a budget that is shorter than Prometheus's scrape timeout. - Errors are strings on the result, not returned values. A probe "failing" is normal, expected data — it is not an exception. Modeling it as a value keeps the fan-out loop trivial.
Turning probe results into Prometheus metrics
The official client_golang library gives you two ways to publish metrics. The easy way — prometheus.NewGauge and friends — keeps state in process and you .Set() it from a background loop. That works, but it has a nasty edge: if a target disappears from your config, its last value sits there forever, and your dashboard shows a manifest as "up" hours after you stopped watching it.
The better fit for a probe is a custom collector. You implement Describe and Collect, and the metrics are generated fresh on every scrape from whatever the current target list is. No stale series, no manual cleanup.
package main
import (
"context"
"net/http"
"sync"
"time"
"github.com/prometheus/client_golang/prometheus"
"yourmodule/probe"
)
type Target struct {
URL string `json:"url"`
Region string `json:"region"`
}
type ProbeCollector struct {
targets func() []Target
client *http.Client
up *prometheus.Desc
latency *prometheus.Desc
httpCode *prometheus.Desc
}
func NewProbeCollector(targets func() []Target) *ProbeCollector {
return &ProbeCollector{
targets: targets,
client: &http.Client{Timeout: 8 * time.Second},
up: prometheus.NewDesc(
"video_manifest_up",
"1 if the manifest is reachable and parseable, else 0.",
[]string{"target", "region"}, nil,
),
latency: prometheus.NewDesc(
"video_manifest_latency_ms",
"Time to fetch the manifest head, in milliseconds.",
[]string{"target", "region"}, nil,
),
httpCode: prometheus.NewDesc(
"video_manifest_http_status",
"Last HTTP status code returned by the manifest.",
[]string{"target", "region"}, nil,
),
}
}
func (c *ProbeCollector) Describe(ch chan<- *prometheus.Desc) {
ch <- c.up
ch <- c.latency
ch <- c.httpCode
}
func (c *ProbeCollector) Collect(ch chan<- prometheus.Metric) {
targets := c.targets()
results := make([]probe.Result, len(targets))
var wg sync.WaitGroup
for i, t := range targets {
wg.Add(1)
go func(i int, t Target) {
defer wg.Done()
ctx, cancel := context.WithTimeout(context.Background(), 8*time.Second)
defer cancel()
results[i] = probe.ProbeManifest(ctx, c.client, t.URL, t.Region)
}(i, t)
}
wg.Wait()
for _, r := range results {
up := 0.0
if r.OK {
up = 1.0
}
ch <- prometheus.MustNewConstMetric(c.up, prometheus.GaugeValue, up, r.Target, r.Region)
ch <- prometheus.MustNewConstMetric(c.latency, prometheus.GaugeValue, r.LatencyMS, r.Target, r.Region)
ch <- prometheus.MustNewConstMetric(c.httpCode, prometheus.GaugeValue, float64(r.StatusCode), r.Target, r.Region)
}
}
The collector fans every target out to its own goroutine and waits for all of them, so the wall-clock cost of a scrape is the slowest single probe, not the sum. With an 8-second per-probe timeout the whole scrape can never exceed that, which lets you set Prometheus's scrape_timeout to 10s with confidence.
One honest caveat: probing synchronously on every scrape only scales to a few hundred targets. Once you cross that, move the probing into a background ticker that writes into a cached slice protected by a mutex, and have Collect read the cache. The shape of the metrics stays identical; only the timing of when the work happens changes. We did not need that until we passed ~400 sources.
Probing across eight regions
The exporter itself is stateless — it loads its target list from a JSON file on disk and serves /metrics. Keeping it stateless is deliberate: it means I can deploy the same binary to a node in every region and the only thing that differs is which targets.json got dropped next to it.
package main
import (
"encoding/json"
"log"
"net/http"
"os"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func loadTargets(path string) func() []Target {
return func() []Target {
f, err := os.Open(path)
if err != nil {
log.Printf("targets: %v", err)
return nil
}
defer f.Close()
var t []Target
if err := json.NewDecoder(f).Decode(&t); err != nil {
log.Printf("decode targets: %v", err)
return nil
}
return t
}
}
func main() {
path := os.Getenv("PROBE_TARGETS")
if path == "" {
path = "/var/lib/tvs/targets.json"
}
reg := prometheus.NewRegistry()
reg.MustRegister(NewProbeCollector(loadTargets(path)))
http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{
EnableOpenMetrics: true,
}))
log.Println("probe exporter listening on :9180")
log.Fatal(http.ListenAndServe(":9180", nil))
}
Notice loadTargets re-reads the file on every scrape because it returns a closure that Collect calls each time. That means the PHP side can rewrite targets.json whenever sources change and the exporter picks it up on the next scrape — no restart, no reload signal, no coupling. Using a fresh prometheus.NewRegistry() instead of the default global registry keeps the /metrics output clean of the Go runtime metrics you usually do not want from a tiny probe (register collectors.NewGoCollector() explicitly if you do).
Wiring it into a PHP, SQLite, and FTP stack
Here is where reality bites. Our application is PHP 8.4 with SQLite FTS5 for the search index, and it deploys over FTP — there is no fancy container pipeline, no Kubernetes, no service mesh. Dropping a Go binary into that world has to be done without disturbing it.
The contract is one file: targets.json. The PHP cron that already runs after every multi-region fetch gets one extra step — export the set of active manifests the probe should watch. The Go exporter never touches the database; it only reads that JSON. This keeps the two worlds decoupled and means the Go side cannot corrupt or lock the SQLite file that the live site depends on.
<?php
declare(strict_types=1);
// cron/export_probe_targets.php
// Runs after the multi-region fetch cron. Writes the set of live
// streaming-source manifests the Go exporter should watch.
const TARGETS_PATH = '/var/lib/tvs/targets.json';
$db = new PDO('sqlite:' . __DIR__ . '/../data/tvs.sqlite');
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$rows = $db->query(
"SELECT manifest_url AS url, region
FROM sources
WHERE active = 1
AND manifest_url LIKE 'http%'
ORDER BY region"
)->fetchAll(PDO::FETCH_ASSOC);
$json = json_encode($rows, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES);
// Atomic write: rename is atomic on the same filesystem, so the
// exporter never reads a half-written file mid-scrape.
$tmp = TARGETS_PATH . '.tmp';
file_put_contents($tmp, $json, LOCK_EX);
rename($tmp, TARGETS_PATH);
fwrite(STDERR, sprintf("wrote %d probe targets\n", count($rows)));
The atomic write via rename is the detail people skip and regret. The Go side reads the file on every scrape; if PHP wrote in place, a scrape landing mid-write would see truncated JSON and the whole region's metrics would vanish for one interval, tripping a false alert. Write to a temp file, rename over the target — rename(2) is atomic on the same filesystem.
For deployment, the PHP app still goes over FTP exactly as before. The Go binary does not — you cross-compile it once (GOOS=linux GOARCH=amd64 go build), scp the static binary to each region node, and run it under systemd. A Go binary has no runtime to install, so the FTP-shaped hosting world it lives next to never has to know it exists.
Alerting that respects multi-region reality
The single most important lesson from the JP incident: do not alert per manifest. Individual streaming sources blip constantly — a CDN hiccup, a transient 503, a slow region. If you page on every video_manifest_up == 0 you will train yourself to ignore the pager within a week.
What actually matters is a region losing coverage. So the alert gate asks: across all manifests in a region, what fraction is healthy? Only when that drops below a quorum do we escalate. This Python script runs from cron, queries Prometheus, and exits non-zero so a wrapper or MAILTO escalates.
#!/usr/bin/env python3
"""Region health gate for the multi-region video probe.
Queries Prometheus and only escalates when a *region* is degraded, not
when a single manifest blips. Run from cron every few minutes.
"""
import sys
import requests
PROM = "http://127.0.0.1:9090"
MIN_HEALTHY_RATIO = 0.6 # region is "down" if <60% of its manifests are up
def region_health() -> dict[str, float]:
query = "avg by (region) (video_manifest_up)"
resp = requests.get(
f"{PROM}/api/v1/query", params={"query": query}, timeout=10
)
resp.raise_for_status()
out: dict[str, float] = {}
for series in resp.json()["data"]["result"]:
region = series["metric"]["region"]
out[region] = float(series["value"][1])
return out
def main() -> int:
health = region_health()
degraded = {r: v for r, v in health.items() if v < MIN_HEALTHY_RATIO}
if not degraded:
print("all regions healthy:", health)
return 0
for region, ratio in degraded.items():
print(f"ALERT region={region} healthy_ratio={ratio:.0%}", file=sys.stderr)
return 1
if __name__ == "__main__":
raise SystemExit(main())
The avg by (region) (video_manifest_up) query collapses every manifest's 0/1 gauge into a single ratio per region. A region at 0.95 is fine — a couple of sources are flaky, that is normal. A region at 0.30 means the vantage point itself is compromised: blocked egress, a regional CDN outage, or a config that shipped wrong. That is the signal worth a human's attention, and it is exactly the one our old liveness checks could never produce.
If you prefer this in Prometheus's own alerting, the same logic is avg by (region) (video_manifest_up) < 0.6 with a for: 5m clause to ride out transients. I keep both — the PromQL alert for real-time paging and the Python gate as a belt-and-suspenders cron that can also write to our internal status table.
What this bought us
Since shipping the probe, the class of incident that opened this post — a region serving 200s over dead manifests — gets caught in minutes by us instead of hours later by a user. Concretely:
- Mean time to detection for a regional manifest failure dropped from "whenever someone emails" to under five minutes.
- We caught a CDN that started geo-blocking one region's egress IP before it affected discovery rankings.
- The
video_manifest_latency_msgauge surfaced a slow origin that was technically "up" but pushing first-frame time past the player timeout.
Conclusion
The trap is believing that a green liveness check means your system works. For anything that proxies, fetches, or indexes external content — and a multi-region video discovery platform is nothing but that — health is a separate, stronger claim that you have to assert deliberately. A Go probe makes that cheap: one pure function that judges a manifest, a custom collector that generates fresh metrics on every scrape, a stateless binary fed by a JSON file, and an alert that pages on regions rather than individual blips.
The whole thing is a few hundred lines and one static binary that sits quietly next to a PHP-and-FTP stack without disturbing it. If you operate anything where "up" and "working" can drift apart, build the probe before the next Sunday morning teaches you the difference the hard way.
Top comments (0)