DEV Community

ahmet gedik
ahmet gedik

Posted on

Coordinating Video Region Routers With etcd Distributed Configuration

The 2 AM failover we couldn't fix without a deploy

Last March our Singapore origin started serving stale trending lists to the Japanese and Korean edges. The fix was trivial — reroute jp and kr traffic to our Tokyo pull node and drop the weight on Singapore to zero. The problem was how. Our region routing table lived in a PHP config array baked into the deploy artifact. Changing one weight meant a git commit, a CI run, an FTP push to four LiteSpeed nodes, and a Cloudflare cache purge. Twenty minutes of latency on a routing decision that should take two seconds.

At TopVideoHub we aggregate trending video across thirteen Asia-Pacific markets, each with its own language edge — Japanese, Korean, Traditional and Simplified Chinese, Thai, Vietnamese, and more. A "region router" decides which origin node pulls trending data for a given market, how heavily to weight each candidate origin, and which fallback to use when an origin degrades. That decision changes constantly: an origin gets rate-limited, a CJK tokenizer index rebuild slows a node, a holiday traffic spike shifts the optimal topology. Configuration that can only change at deploy time is configuration that's always slightly wrong.

This is the story of moving that routing table into etcd, and the concrete patterns — watches, leases, transactions, and a polyglot read path — that made it safe to change live.

Why etcd and not a database flag table

The obvious alternative was a routing_config table in SQLite or a shared MySQL instance. We already run SQLite with an FTS5 CJK tokenizer for search, so adding a config table costs nothing. But config-in-a-database has two failure modes that bit us before:

  • Polling latency and load. Every router node has to poll the table on an interval. Poll too often and you hammer the DB; poll too rarely and a routing change takes a minute to propagate. There's no clean push.
  • No coordination primitives. When two operators (or two automated controllers) write at once, last-write-wins silently clobbers one change. There's no atomic compare-and-swap across the cluster, no lease that expires when a node dies.

etcd is built for exactly this. It gives us:

  • Watches — a long-lived stream that pushes the new value the instant a key changes, so propagation is sub-second instead of poll-interval-bound.
  • Leases — a TTL attached to a key, so an origin node that goes silent automatically drops out of the routing table when its lease expires.
  • Transactions — atomic compare-and-swap on a key revision, so concurrent writers can't clobber each other.
  • A linearizable, Raft-backed store — every node reads a consistent view; there is no split-brain on the config itself.

We were not trying to replace our data plane. etcd holds control-plane data: a few kilobytes of routing weights and origin health that change often and must be correct everywhere at once. The video payloads stay in SQLite and on the origins.

Modeling the region routing config

We namespace everything under /topvideohub/routing/. The layout is deliberately flat so a single prefix watch covers the whole config:

  • /topvideohub/routing/edge/<market> — the routing policy for one market edge, a JSON document: ordered candidate origins with weights and a fallback.
  • /topvideohub/routing/origin/<id>/health — a lease-bound key each origin node writes to prove it is alive.
  • /topvideohub/routing/version — a monotonic integer bumped on every coherent config change, used as a sanity check by readers.

An edge document looks like this:

{
  "market": "jp",
  "language": "ja",
  "candidates": [
    {"origin": "tokyo-1", "weight": 70},
    {"origin": "singapore-1", "weight": 30}
  ],
  "fallback": "singapore-1",
  "min_healthy_weight": 50
}
Enter fullscreen mode Exit fullscreen mode

The min_healthy_weight field is a guard: if the live, healthy candidates don't sum to at least that weight, the router refuses to trust the document and uses the fallback. That single field has saved us from routing all of Japan to a node that had silently lost its lease.

Seeding and validating config with Python

We never write raw JSON into etcd by hand. Every change goes through a small Python tool that validates the document against a schema before the write and uses an atomic transaction so a concurrent edit can't be lost. We use the etcd3 client.

import json
import sys
import etcd3

SCHEMA_KEYS = {"market", "language", "candidates", "fallback", "min_healthy_weight"}
VALID_ORIGINS = {"tokyo-1", "tokyo-2", "singapore-1", "seoul-1", "taipei-1"}


def validate(doc: dict) -> None:
    missing = SCHEMA_KEYS - doc.keys()
    if missing:
        raise ValueError(f"missing keys: {missing}")
    if not doc["candidates"]:
        raise ValueError("candidates must be non-empty")
    total = 0
    for c in doc["candidates"]:
        if c["origin"] not in VALID_ORIGINS:
            raise ValueError(f"unknown origin {c['origin']}")
        if not 0 <= c["weight"] <= 100:
            raise ValueError(f"weight out of range for {c['origin']}")
        total += c["weight"]
    if total != 100:
        raise ValueError(f"weights must sum to 100, got {total}")
    if doc["fallback"] not in VALID_ORIGINS:
        raise ValueError("fallback is not a valid origin")


def put_edge(client: etcd3.Etcd3Client, market: str, doc: dict) -> None:
    validate(doc)
    key = f"/topvideohub/routing/edge/{market}"
    payload = json.dumps(doc, ensure_ascii=False, separators=(",", ":"))

    # Atomic compare-and-swap on the current value so a concurrent
    # operator edit between our read and write is not silently lost.
    current, meta = client.get(key)
    success, _ = client.transaction(
        compare=[
            client.transactions.value(key) == (current or b"")
        ],
        success=[client.transactions.put(key, payload)],
        failure=[],
    )
    if not success:
        raise RuntimeError("key changed under us; re-read and retry")

    client.put("/topvideohub/routing/version", str(int(meta_rev(client)) + 1))
    print(f"wrote {key}")


def meta_rev(client: etcd3.Etcd3Client) -> str:
    v, _ = client.get("/topvideohub/routing/version")
    return (v or b"0").decode()


if __name__ == "__main__":
    doc = json.load(open(sys.argv[1], encoding="utf-8"))
    cli = etcd3.client(host="10.0.0.10", port=2379)
    put_edge(cli, doc["market"], doc)
Enter fullscreen mode Exit fullscreen mode

Two things matter here. First, validation runs before the write, so a typo'd origin or weights that don't sum to 100 never reach the cluster. Second, the transaction with a value comparison is an atomic compare-and-swap — if another operator wrote between our read and our write, the transaction fails and we retry rather than clobber. ensure_ascii=False keeps the CJK language tags readable in etcdctl get output, which matters when you're debugging at 2 AM.

A Go region router that watches for changes

The router itself is a small Go service that runs next to each LiteSpeed node. It loads the full routing prefix once on startup, then opens a watch and applies deltas in-memory. Reads of the routing table are served from an atomic snapshot, so request-path lookups never touch etcd directly and never block on the network.

package router

import (
    "context"
    "encoding/json"
    "log"
    "strings"
    "sync/atomic"

    clientv3 "go.etcd.io/etcd/client/v3"
)

const prefix = "/topvideohub/routing/edge/"

type Candidate struct {
    Origin string `json:"origin"`
    Weight int    `json:"weight"`
}

type Edge struct {
    Market           string      `json:"market"`
    Candidates       []Candidate `json:"candidates"`
    Fallback         string      `json:"fallback"`
    MinHealthyWeight int         `json:"min_healthy_weight"`
}

type Router struct {
    table atomic.Pointer[map[string]Edge]
}

func New(ctx context.Context, cli *clientv3.Client) (*Router, error) {
    r := &Router{}
    m := map[string]Edge{}

    // 1. Load the current state and capture the revision to watch from.
    resp, err := cli.Get(ctx, prefix, clientv3.WithPrefix())
    if err != nil {
        return nil, err
    }
    for _, kv := range resp.Kvs {
        r.applyInto(m, string(kv.Key), kv.Value)
    }
    r.table.Store(&m)

    // 2. Watch from resp.Header.Revision+1 so we miss nothing in the gap.
    go r.watch(ctx, cli, resp.Header.Revision+1)
    return r, nil
}

func (r *Router) watch(ctx context.Context, cli *clientv3.Client, rev int64) {
    wch := cli.Watch(ctx, prefix, clientv3.WithPrefix(), clientv3.WithRev(rev))
    for wr := range wch {
        if wr.Err() != nil {
            log.Printf("watch error, will reconnect: %v", wr.Err())
            return
        }
        // Copy-on-write: clone the current table, mutate, swap atomically.
        old := *r.table.Load()
        next := make(map[string]Edge, len(old))
        for k, v := range old {
            next[k] = v
        }
        for _, ev := range wr.Events {
            key := string(ev.Kv.Key)
            if ev.Type == clientv3.EventTypeDelete {
                delete(next, strings.TrimPrefix(key, prefix))
                continue
            }
            r.applyInto(next, key, ev.Kv.Value)
        }
        r.table.Store(&next)
    }
}

func (r *Router) applyInto(m map[string]Edge, key string, val []byte) {
    var e Edge
    if err := json.Unmarshal(val, &e); err != nil {
        log.Printf("skip bad edge doc %s: %v", key, err)
        return
    }
    m[strings.TrimPrefix(key, prefix)] = e
}

// Resolve picks an origin for a market given the set of healthy origins.
func (r *Router) Resolve(market string, healthy map[string]bool) string {
    m := *r.table.Load()
    e, ok := m[market]
    if !ok {
        return ""
    }
    healthyWeight := 0
    for _, c := range e.Candidates {
        if healthy[c.Origin] {
            healthyWeight += c.Weight
        }
    }
    if healthyWeight < e.MinHealthyWeight {
        return e.Fallback
    }
    // Highest-weight healthy candidate wins; deterministic and cheap.
    best, bestW := e.Fallback, -1
    for _, c := range e.Candidates {
        if healthy[c.Origin] && c.Weight > bestW {
            best, bestW = c.Origin, c.Weight
        }
    }
    return best
}
Enter fullscreen mode Exit fullscreen mode

The important details:

  • Load-then-watch from the captured revision. We watch from resp.Header.Revision+1, the revision returned by the initial Get. This closes the race window where a change between the load and the watch would otherwise be missed.
  • Copy-on-write with atomic.Pointer. The request path calls Resolve, which reads an immutable snapshot via a single atomic load — no mutex contention on hot reads. The watch goroutine builds a new map and swaps the pointer.
  • Bad documents are skipped, not fatal. A malformed JSON value logs and is ignored rather than crashing the router. Combined with the Python-side validation, a bad write essentially can't happen, but defense in depth is cheap.
  • The min_healthy_weight guard falls back the moment live capacity drops below threshold.

Origin health with leases

Health is the other half. Each origin node holds an etcd lease and writes its health key under that lease. If the node dies, the lease expires and the key vanishes — no separate liveness check, no stale "healthy" flag. The router subscribes to the health prefix the same way it watches edges and feeds the live set into the healthy map passed to Resolve.

func keepAlive(ctx context.Context, cli *clientv3.Client, originID string) error {
    // 10s TTL: if the node stops renewing, it drops out within ~10s.
    lease, err := cli.Grant(ctx, 10)
    if err != nil {
        return err
    }
    key := "/topvideohub/routing/origin/" + originID + "/health"
    if _, err := cli.Put(ctx, key, "ok", clientv3.WithLease(lease.ID)); err != nil {
        return err
    }
    ch, err := cli.KeepAlive(ctx, lease.ID)
    if err != nil {
        return err
    }
    for range ch { // drain keepalive acks until ctx is cancelled
    }
    return ctx.Err()
}
Enter fullscreen mode Exit fullscreen mode

With a 10-second TTL, an origin that crashes or is partitioned drops out of every router's healthy set within ten seconds, cluster-wide, with zero manual intervention. That is the property a database flag table simply cannot give you cheaply.

Reading the same config from PHP 8.4

Not everything is Go. Our admin panel and a handful of cron jobs run on PHP 8.4 under LiteSpeed, and they sometimes need to read the effective routing config — for example, to show operators which origin a market is currently pinned to. We do not run a watch from PHP; that's the router's job. PHP reads on demand through etcd's gRPC-gateway HTTP/JSON API, which is enabled by default on port 2379. Keys and values are base64-encoded on the wire.

<?php
declare(strict_types=1);

final class EtcdReader
{
    public function __construct(
        private readonly string $endpoint = 'http://10.0.0.10:2379',
    ) {}

    /** Range-read a prefix and return decoded key => value pairs. */
    public function getPrefix(string $prefix): array
    {
        // rangeEnd is the prefix with its last byte incremented — etcd's
        // convention for "all keys under this prefix".
        $rangeEnd = substr($prefix, 0, -1)
            . chr(ord(substr($prefix, -1)) + 1);

        $body = json_encode([
            'key'       => base64_encode($prefix),
            'range_end' => base64_encode($rangeEnd),
        ], JSON_THROW_ON_ERROR);

        $ch = curl_init($this->endpoint . '/v3/kv/range');
        curl_setopt_array($ch, [
            CURLOPT_POST           => true,
            CURLOPT_POSTFIELDS     => $body,
            CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT        => 2,
        ]);
        $raw = curl_exec($ch);
        if ($raw === false) {
            throw new RuntimeException('etcd unreachable: ' . curl_error($ch));
        }
        curl_close($ch);

        $resp = json_decode($raw, true, 512, JSON_THROW_ON_ERROR);
        $out  = [];
        foreach ($resp['kvs'] ?? [] as $kv) {
            $key = base64_decode($kv['key']);
            $out[$key] = base64_decode($kv['value']);
        }
        return $out;
    }

    /** Effective origin for a market, mirroring the Go resolver. */
    public function resolveMarket(string $market, array $healthy): ?string
    {
        $pairs = $this->getPrefix('/topvideohub/routing/edge/');
        $key   = "/topvideohub/routing/edge/{$market}";
        if (!isset($pairs[$key])) {
            return null;
        }
        $edge = json_decode($pairs[$key], true, 512, JSON_THROW_ON_ERROR);

        $healthyWeight = 0;
        foreach ($edge['candidates'] as $c) {
            if ($healthy[$c['origin']] ?? false) {
                $healthyWeight += $c['weight'];
            }
        }
        if ($healthyWeight < $edge['min_healthy_weight']) {
            return $edge['fallback'];
        }

        $best = $edge['fallback'];
        $bestW = -1;
        foreach ($edge['candidates'] as $c) {
            if (($healthy[$c['origin']] ?? false) && $c['weight'] > $bestW) {
                $best = $c['origin'];
                $bestW = $c['weight'];
            }
        }
        return $best;
    }
}
Enter fullscreen mode Exit fullscreen mode

A few PHP-specific notes from production:

  • Keep the curl timeout short (2s) and cache the result. A LiteSpeed worker should never block a page render on etcd. We cache the decoded prefix in APCu for a few seconds; the admin panel doesn't need sub-second freshness, and the router — which does — uses watches instead.
  • The resolveMarket logic is intentionally a line-for-line mirror of the Go Resolve. When two languages implement the same decision, drift is the enemy. We have a contract test that feeds both implementations the same fixtures and asserts identical output.
  • Always base64_decode both key and value. This trips up everyone the first time they hit the etcd HTTP gateway.

Guarding against split-brain and bad writes

etcd's Raft consensus means the store never splits its brain — a minority partition can't accept writes. But your application still can, so we layered on a few rules:

  • All writes go through the validating Python tool or an equivalent server endpoint. No etcdctl put of raw JSON in production. Validation before write is non-negotiable.
  • Compare-and-swap on every edge write, so concurrent operator edits fail loudly instead of clobbering.
  • Readers tolerate missing or malformed keys and fall back rather than crash. The router skips bad docs; PHP returns null and the caller shows "unknown."
  • Run an odd-sized etcd cluster (we run three nodes across two availability zones) so quorum is well-defined. A two-node cluster has no fault tolerance for writes — lose one and you lose quorum.
  • Watch reconnection is mandatory. Our watch goroutine returns on error; a supervising loop re-creates the client and reloads from the current revision. A router that silently stops receiving updates is worse than one that crashes.

We also keep a Cloudflare and LiteSpeed cache consideration in mind: the routing decision happens behind the cache, at origin-pull time, not per-request. Changing a routing weight does not require a cache purge, because cached pages are language-edge content, not routing state. Decoupling those two was part of why this migration was worth doing.

What we learned

Moving region routing from a baked-in PHP array to etcd changed our operational tempo more than any single performance optimization we've shipped. A routing change that used to be a twenty-minute deploy is now a two-second validated write that propagates to every node before you've switched terminal tabs.

The principles that generalized beyond our specific stack:

  • Put control-plane data in etcd, keep the data plane where it is. etcd is for kilobytes of fast-changing, must-be-consistent config — not for video payloads or search indexes. SQLite with the CJK tokenizer still owns the data plane.
  • Watches beat polling for anything that must propagate fast across many nodes, and leases give you liveness for free.
  • Validate before you write, and tolerate bad reads. Defense on both ends makes live config changes boring, which is exactly what you want at 2 AM.
  • When two languages read the same config, test that they agree. A contract test against shared fixtures kept our Go and PHP resolvers from drifting.

If you're running a multi-region service and your routing topology still lives in a deploy artifact, the next incident will remind you why it shouldn't. etcd is a small, sharp tool for exactly this problem.

Top comments (0)