Driving Video Region Router Config With etcd Watches and Leases

#etcd #go #php #distributedsystems

We route viewers to the nearest video edge based on a region map: a table that says "traffic from eu-west goes to edge pool 3, traffic from ap-south goes to edge pool 7, and if pool 7 is degraded, fail over to pool 2." For a long time that map lived in a PHP config file that we deployed with the rest of the app. The day a CDN partner had a regional outage in Singapore, we learned exactly how bad that design was: changing one routing rule meant a full deploy across every node, and during the ~90 seconds it took to roll out, half our Southeast Asia traffic kept hammering a dead pool. At DailyWatch that is the difference between a video starting in 400ms and a spinner that makes someone close the tab.

The fix was to pull the region map out of the deploy artifact entirely and put it in etcd, where it can be changed atomically, watched in real time, and survive a node restart without anyone touching a config file. This post walks through how we model video region routing in etcd, how we propagate changes to a fleet of routers in under a second, and the specific traps (lease expiry, watch reconnection, thundering-herd reloads) that bit us along the way.

Why etcd and not just a database row

We already run SQLite with FTS5 for our video catalog search, and the obvious lazy move is to add a region_routes table and poll it. We tried that first. Polling a row every few seconds gives you latency you can measure in seconds and a load pattern that scales linearly with your node count. etcd buys you three things a polled row does not:

Watches. Instead of asking "did anything change?" on a timer, every router holds an open watch and gets pushed the new value the instant it is committed. Propagation is a network round-trip, not a poll interval.
Leases. A key can be tied to a lease with a TTL. If the process that wrote it dies and stops renewing, the key disappears automatically. This is perfect for "this edge pool is healthy" markers that should vanish when the health-checker crashes.
Atomic multi-key transactions. You can swap an entire region map in one transaction so a router never observes a half-updated state where eu-west points at a pool that the same update was about to delete.

etcd is a Raft-backed consistent store. It is not a cache and not a high-throughput data store, so we keep it for control-plane data only: routing rules, pool health, feature flags. The video metadata stays in SQLite, served from disk behind LiteSpeed, and the hot path never touches etcd directly. That separation matters and I'll come back to it.

Modeling the region map as keys

etcd is a flat, sorted key-value space, so the design work is in the key layout. We namespace everything under a versioned prefix so we can run a v1 and v2 schema side by side during a migration:

/dw/v1/routes/eu-west      -> {"pool":"edge-3","fallback":"edge-2","weight":100}
/dw/v1/routes/ap-south     -> {"pool":"edge-7","fallback":"edge-2","weight":100}
/dw/v1/routes/us-east      -> {"pool":"edge-1","fallback":"edge-5","weight":100}
/dw/v1/pools/edge-3/health -> "ok"   (lease-bound, 10s TTL)
/dw/v1/pools/edge-7/health -> "ok"   (lease-bound, 10s TTL)

The routes/ subtree is the intent: where operators want traffic to go. The pools/.../health subtree is reality: which pools are actually answering. A router computes the effective target by combining the two — use pool if its health key exists, otherwise drop to fallback. Because the health keys are lease-bound, a crashed pool's health marker evaporates within the TTL window without any external system having to delete it.

The prefix structure also makes watches cheap. A router watches the single prefix /dw/v1/ and receives every route and health change in one stream. That is one open watch per router regardless of how many regions or pools you have.

Writing config atomically from the control plane

Operators change routing through a small control tool, not by editing etcd keys by hand. The important property is atomicity: when we shift a region to a new pool, the read side must never catch a torn write. etcd transactions (Txn) give us compare-and-swap and multi-op-in-one-commit. Here is the Python writer we use, built on the official etcd3 client:

import json
import etcd3

client = etcd3.client(host="10.0.0.10", port=2379)

def shift_region(region: str, new_pool: str, fallback: str, weight: int = 100):
    """Atomically repoint a region to a new edge pool."""
    key = f"/dw/v1/routes/{region}"
    value = json.dumps({
        "pool": new_pool,
        "fallback": fallback,
        "weight": weight,
    })

    # Read the current revision so we can detect a concurrent writer.
    current, meta = client.get(key)
    if current is None:
        # Key does not exist yet: only create if still absent.
        success, _ = client.transaction(
            compare=[client.transactions.version(key) == 0],
            success=[client.transactions.put(key, value)],
            failure=[],
        )
    else:
        # Key exists: only write if nobody changed it since we read.
        success, _ = client.transaction(
            compare=[client.transactions.mod(key) == meta.mod_revision],
            success=[client.transactions.put(key, value)],
            failure=[],
        )

    if not success:
        raise RuntimeError(f"concurrent update to {region}, retry")
    print(f"shifted {region} -> {new_pool} (fallback {fallback})")

if __name__ == "__main__":
    # Singapore CDN is down: move ap-south off edge-7 onto edge-2.
    shift_region("ap-south", new_pool="edge-2", fallback="edge-1")

The compare-and-swap on mod_revision is what makes concurrent operators safe. If two engineers both try to repoint ap-south at the same time, one transaction succeeds and the other fails the compare and raises, instead of silently clobbering. For a full map swap you would put several keys into a single success=[...] list so they all land in one Raft commit.

Health markers with leases

The health-checker is a separate process running next to each edge pool. It writes a lease-bound health key and keeps the lease alive. If the checker dies, the lease is not renewed, etcd expires it, and every watching router learns the pool is gone — all without a central coordinator deciding the pool is dead. This is the piece that turned a manual incident response into a self-healing one.

import time
import etcd3

client = etcd3.client(host="10.0.0.10", port=2379)

def run_health_marker(pool: str, ttl: int = 10):
    lease = client.lease(ttl)
    key = f"/dw/v1/pools/{pool}/health"
    client.put(key, "ok", lease=lease)
    print(f"registered {pool} with {ttl}s lease")

    while True:
        if pool_is_serving(pool):
            # Renew before expiry; refresh at one third of the TTL.
            lease.refresh()
        else:
            # Stop refreshing; key expires within ttl and routers fail over.
            print(f"{pool} unhealthy, letting lease lapse")
            return
        time.sleep(ttl / 3)

def pool_is_serving(pool: str) -> bool:
    # Real check: HTTP probe a known manifest on the edge, verify 200 + body.
    # Returning True here for illustration.
    return True

if __name__ == "__main__":
    run_health_marker("edge-7")

The critical detail is refreshing at one third of the TTL, not at the TTL boundary. A single dropped renewal packet near the deadline would otherwise expire a perfectly healthy pool and cause a spurious failover. With a 10-second TTL and a ~3.3-second refresh interval, you tolerate two consecutive missed renewals before anything fails over. Tune the TTL to your tolerance: shorter means faster detection of real failures but less slack for network blips.

Consuming the map on the router

Our routers are written in Go because the watch loop wants real concurrency and the official go.etcd.io/etcd/client/v3 library is the reference implementation of the protocol. Each router loads the full prefix once at startup, then holds a watch from the revision it loaded at, so it never misses an update between the initial read and the start of the watch.

package main

import (
    "context"
    "encoding/json"
    "log"
    "sync"
    "time"

    clientv3 "go.etcd.io/etcd/client/v3"
)

type Route struct {
    Pool     string `json:"pool"`
    Fallback string `json:"fallback"`
    Weight   int    `json:"weight"`
}

type RouteTable struct {
    mu     sync.RWMutex
    routes map[string]Route
    health map[string]bool
}

func (t *RouteTable) Target(region string) string {
    t.mu.RLock()
    defer t.mu.RUnlock()
    r, ok := t.routes[region]
    if !ok {
        return ""
    }
    if t.health[r.Pool] {
        return r.Pool
    }
    return r.Fallback // primary unhealthy, use fallback
}

func main() {
    cli, err := clientv3.New(clientv3.Config{
        Endpoints:   []string{"10.0.0.10:2379"},
        DialTimeout: 5 * time.Second,
    })
    if err != nil {
        log.Fatal(err)
    }
    defer cli.Close()

    table := &RouteTable{routes: map[string]Route{}, health: map[string]bool{}}

    // 1. Initial snapshot of the whole prefix.
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    resp, err := cli.Get(ctx, "/dw/v1/", clientv3.WithPrefix())
    cancel()
    if err != nil {
        log.Fatal(err)
    }
    for _, kv := range resp.Kvs {
        table.apply(string(kv.Key), kv.Value, false)
    }
    startRev := resp.Header.Revision + 1

    // 2. Watch from the revision right after the snapshot. No gap, no miss.
    watchChan := cli.Watch(context.Background(), "/dw/v1/",
        clientv3.WithPrefix(), clientv3.WithRev(startRev))

    log.Printf("router ready, watching from rev %d", startRev)
    for wr := range watchChan {
        if wr.Err() != nil {
            log.Printf("watch error, will rewatch: %v", wr.Err())
            continue
        }
        for _, ev := range wr.Events {
            deleted := ev.Type == clientv3.EventTypeDelete
            table.apply(string(ev.Kv.Key), ev.Kv.Value, deleted)
        }
    }
}

func (t *RouteTable) apply(key string, val []byte, deleted bool) {
    t.mu.Lock()
    defer t.mu.Unlock()
    switch {
    case len(key) > len("/dw/v1/routes/") && key[:14] == "/dw/v1/routes/":
        region := key[len("/dw/v1/routes/"):]
        if deleted {
            delete(t.routes, region)
            return
        }
        var r Route
        if err := json.Unmarshal(val, &r); err == nil {
            t.routes[region] = r
        }
    case len(key) > len("/dw/v1/pools/") && key[:13] == "/dw/v1/pools/":
        // .../pools/edge-7/health
        pool := key[len("/dw/v1/pools/") : len(key)-len("/health")]
        t.health[pool] = !deleted // delete (lease expiry) => unhealthy
    }
}

The snapshot-then-watch-from-revision pattern is the single most important thing to get right. If you take the snapshot and then start a watch "from now," any write that lands in between is lost forever and your router silently runs stale config. Reading resp.Header.Revision and watching from +1 closes that gap because etcd revisions are a global monotonic counter.

Note also that the in-memory table is guarded by an RWMutex. The hot request path calls Target() with a read lock, which lets thousands of concurrent lookups proceed in parallel; only the watch goroutine takes the write lock, and only when something actually changes. The routing decision itself never makes a network call to etcd — it reads local memory.

Bridging etcd into the PHP application layer

Most of our request handling is PHP 8.4 behind LiteSpeed, and PHP's process-per-request model is a poor fit for holding a long-lived etcd watch. We do not connect PHP to etcd directly. Instead the Go router writes a flat snapshot of the effective route table to a local file whenever it changes, and PHP reads that file through opcache-friendly require. This keeps the hot path zero-network and lets LiteSpeed's per-worker opcache hold the parsed result.

<?php
declare(strict_types=1);

final class RegionRouter
{
    private const SNAPSHOT = '/dev/shm/dw_routes.php';

    /** @var array<string, array{pool:string, fallback:string}> */
    private array $table;

    public function __construct()
    {
        // The Go watcher writes this file atomically (write temp, rename).
        // On /dev/shm it is a RAM read; opcache caches the parsed array.
        $this->table = is_file(self::SNAPSHOT)
            ? (require self::SNAPSHOT)
            : [];
    }

    public function edgeFor(string $region): string
    {
        $route = $this->table[$region] ?? $this->table['default'] ?? null;
        if ($route === null) {
            return 'edge-1'; // hard-coded safety net
        }
        return $route['pool'];
    }
}

// Usage inside a request handler:
$region = $_SERVER['HTTP_CF_IPCOUNTRY'] ?? 'US'; // Cloudflare geo header
$router = new RegionRouter();
$edge   = $router->edgeFor(strtolower($region));
header('X-DW-Edge: ' . $edge);

The Go side writes that file with a temp-write-then-rename so PHP never reads a half-written snapshot — rename is atomic on the same filesystem. Putting the file on /dev/shm means the read is from RAM, and because the content is a return [...] PHP array, opcache holds the compiled form between requests. We feed the region from Cloudflare's CF-IPCOUNTRY header, which it injects at the edge, so we get geo classification for free without a GeoIP database in the app. The result is that a routing decision in PHP is an array lookup against an opcache-resident structure, while the actual source of truth still lives in etcd and updates within a second of any change.

The traps that bit us

A few hard-won lessons, in case you build something similar:

Watch channels close. etcd will close your watch on a compaction or a leader change. The Go example logs and continues, but a production router must recreate the watch from its last seen revision, and handle ErrCompacted by taking a fresh snapshot. Treat the watch as something that will break, not something that might.
Compaction eats history. etcd compacts old revisions to reclaim space. If your router was disconnected longer than your compaction window, watching from the old revision fails. Always have the resnapshot fallback.
Leases need a connection. A lease is renewed over a client connection. If the health-checker's connection to etcd drops, the lease lapses even if the pool is fine. We mitigate by running etcd as a 3- or 5-node cluster and pointing clients at all members, so a single member failure does not orphan leases.
Don't put the hot path in etcd. etcd is a control plane. Tens of thousands of route lookups per second belong in process memory, refreshed by a watch. We learned to keep etcd for the thing that changes rarely and is read by watch, never by per-request Get.
Version your key prefix. /dw/v1/ cost nothing to add up front and saved us during a schema change, because v1 and v2 readers coexisted while we migrated writers.

Conclusion

Moving the video region map from a deployed config file into etcd changed how we handle CDN incidents. What used to be a code change and a fleet-wide deploy is now a single atomic transaction that propagates to every router in under a second, with health markers that self-expire when a pool dies and fail traffic over without a human in the loop. The architecture that made it work is boring on purpose: etcd holds only the small, slowly-changing control data; watches push changes instead of polls pulling them; leases encode liveness; and the hot request path — whether Go or PHP behind LiteSpeed — always reads from local memory, never from etcd directly. If you run any kind of geo-aware routing and you are still shipping that map in your deploy artifact, the watch-plus-lease pattern is worth the afternoon it takes to wire up.