ahmet gedik

Posted on Jun 10

Building a Chaos Testing Harness for Multi-Region Video API Endpoints

#testing #php #reliability #go

A truncated 200 took down discovery in three regions

At TrendVidStream we aggregate streaming-platform metadata across eight regions, and most of our reliability work used to assume failures announce themselves. A provider goes down, you get a 500 or a connection refused, your retry logic kicks in, life goes on. Then one night an upstream returned HTTP 200 OK with a body truncated at exactly 8 KB. Our ingest cron parsed the partial JSON without complaint, wrote half-populated rows into SQLite, rebuilt the FTS5 index, and the discovery endpoint started serving near-empty result sets to US, GB, and DE viewers for forty minutes before a user emailed us.

Nothing crashed. No alert fired. Every health check was green. The failure was a lie, and our test suite had no concept of a lying dependency. That incident is why we built a chaos testing harness specifically for the read paths that feed our video discovery API — not a generic Chaos Monkey clone, but a small, scriptable rig that injects the dishonest failures that actually happen in production: slow trickles, truncated bodies, wrong content lengths, garbage that still parses, and clock skew between regions.

This post walks through the harness we run today: a fault-injecting proxy in Python, a defensive PHP 8.4 client that survives it, and a Go driver that measures blast radius. It is deliberately low-tech — it has to deploy over the same FTP automation as everything else we ship.

What chaos testing means for a read-heavy video API

Most chaos engineering literature is written for microservice fleets with service meshes and sidecars. We have none of that. Our stack is PHP 8.4 behind LiteSpeed, SQLite with FTS5 for search, region-aware cron jobs that pull upstream metadata, and FTP-based deploys. Injecting a sidecar is not on the table.

So the goal is narrower and more useful: prove that a single bad upstream response cannot corrupt the cache or degrade more than one region. The properties we care about are:

Containment — a fault in the DE ingest path must not poison rows served to JP.
Honesty — if we cannot get a complete, valid response, we serve stale-but-correct data, never partial-fresh data.
Bounded latency — a slow upstream must trip a deadline well under the LiteSpeed 180s timeout, not ride it to the edge.
Idempotent recovery — after the fault clears, the next cron run fully heals the affected region with no manual step.

Everything in the harness exists to falsify one of those four claims.

The fault taxonomy we inject

Real upstreams fail in more interesting ways than timeout and 500. After a year of incident reviews, our taxonomy settled into seven fault types, each with a probability and a target region:

slowloris — send the body one byte every few hundred ms, never closing.
truncate — send a valid 200 with Content-Length for the full body but only flush the first N bytes.
lie_length — send a body shorter or longer than the advertised Content-Length.
valid_garbage — return well-formed JSON with the right shape but nonsense values (empty arrays, null IDs).
flap — alternate 200 and 503 on each request to defeat naive retry-once logic.
clock_skew — stamp Date and Last-Modified headers hours into the future.
dribble_then_reset — send headers, dribble half the body, then RST the connection.

The key design decision: faults are targeted by region, because containment is the property we most need to verify. We inject only into the DE path and assert that JP, US, and the other six stay clean.

A reverse proxy that lies

The heart of the harness is an aiohttp proxy that sits between our ingest cron and the real upstream. In normal mode it forwards transparently; with a fault profile loaded it corrupts responses according to the taxonomy. It runs locally on a staging box, never near production.

# chaos_proxy.py  —  run: python chaos_proxy.py --listen 8099 --upstream https://api.example.com
import argparse, asyncio, json, random
from aiohttp import web, ClientSession, ClientTimeout

FAULTS = {}  # loaded from --profile, keyed by region

def pick_fault(region: str):
    spec = FAULTS.get(region)
    if not spec:
        return None
    if random.random() > spec["probability"]:
        return None
    return random.choice(spec["types"])

async def handle(request: web.Request) -> web.StreamResponse:
    region = request.headers.get("X-Region", "US")
    fault = pick_fault(region)
    upstream = request.app["upstream"] + request.rel_url.path_qs

    timeout = ClientTimeout(total=20)
    async with ClientSession(timeout=timeout) as session:
        async with session.get(upstream, headers={"Accept": "application/json"}) as up:
            body = await up.read()

    if fault == "valid_garbage":
        body = json.dumps({"items": [], "region": region, "injected": True}).encode()
    elif fault == "clock_skew":
        resp = web.Response(body=body, status=200)
        resp.headers["Date"] = "Wed, 01 Jan 2031 00:00:00 GMT"
        return resp
    elif fault == "flap":
        request.app["flap"] = not request.app.get("flap", False)
        if request.app["flap"]:
            return web.Response(status=503, text="flap")

    # streaming faults need a manually driven response
    if fault in ("slowloris", "truncate", "lie_length", "dribble_then_reset"):
        resp = web.StreamResponse(status=200)
        if fault in ("truncate", "lie_length"):
            resp.headers["Content-Length"] = str(len(body))  # promise the full size
        await resp.prepare(request)
        if fault == "slowloris":
            for b in body:
                await resp.write(bytes([b]))
                await asyncio.sleep(0.3)
        elif fault == "truncate":
            await resp.write(body[: len(body) // 2])  # send half, then stop
        elif fault == "lie_length":
            await resp.write(body[:-32])  # 32 bytes short of the promise
        elif fault == "dribble_then_reset":
            await resp.write(body[: len(body) // 2])
            request.transport.abort()  # RST mid-stream
        await resp.write_eof()
        return resp

    return web.Response(body=body, status=200, content_type="application/json")

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--listen", type=int, default=8099)
    ap.add_argument("--upstream", required=True)
    ap.add_argument("--profile", help="JSON file of region->fault spec")
    args = ap.parse_args()
    if args.profile:
        FAULTS.update(json.load(open(args.profile)))
    app = web.Application()
    app["upstream"] = args.upstream.rstrip("/")
    app.router.add_route("*", "/{tail:.*}", handle)
    web.run_app(app, port=args.listen)

if __name__ == "__main__":
    main()

A profile looks like this — we only ever target one region at a time so we can assert containment on the rest:

{
  "DE": { "probability": 0.7, "types": ["truncate", "lie_length", "slowloris", "valid_garbage"] }
}

The lie_length case is the one that bit us originally: a Content-Length header that promises more bytes than arrive. A naive HTTP client either hangs waiting for the rest or, worse, hands you the short body as if it were complete.

Making the PHP endpoint survive the proxy

The defense lives in the ingest client. The old code did json_decode(file_get_contents($url)) and trusted whatever came back. The hardened version enforces a deadline, verifies the byte count against the advertised length, validates the decoded shape, and — critically — refuses to overwrite good cache rows with anything it cannot fully trust. It uses a tiny circuit breaker stored in SQLite so a flapping region trips fast and recovers on its own.

<?php
declare(strict_types=1);

// IngestClient.php  —  PHP 8.4
final class IngestClient
{
    private const DEADLINE_MS = 8000;       // well under LiteSpeed's 180s
    private const MIN_ITEMS   = 5;          // a real region never returns fewer
    private const TRIP_AFTER  = 3;          // consecutive failures before open

    public function __construct(private readonly PDO $db) {}

    public function fetchRegion(string $region, string $url): array
    {
        if ($this->breakerOpen($region)) {
            return ['ok' => false, 'reason' => 'breaker_open', 'region' => $region];
        }

        $ch = curl_init($url);
        curl_setopt_array($ch, [
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT_MS     => self::DEADLINE_MS,
            CURLOPT_HTTPHEADER     => ["X-Region: {$region}", 'Accept: application/json'],
            CURLOPT_FAILONERROR    => false,
        ]);
        $body   = curl_exec($ch);
        $status = curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
        $dlSize = (int) curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
        $claim  = (int) curl_getinfo($ch, CURLINFO_CONTENT_LENGTH_DOWNLOAD);
        $errno  = curl_errno($ch);
        curl_close($ch);

        $fail = fn(string $why) => $this->trip($region, $why);

        if ($errno !== 0 || $status !== 200 || $body === false) {
            return $fail("transport:{$errno}:{$status}");
        }
        // catch lie_length / truncate: advertised length must match what we read
        if ($claim > 0 && $claim !== $dlSize) {
            return $fail("length_mismatch:{$claim}!={$dlSize}");
        }
        $data = json_decode($body, true);
        if (!is_array($data) || !isset($data['items']) || !is_array($data['items'])) {
            return $fail('shape_invalid');
        }
        // catch valid_garbage: structurally fine but empty
        if (count($data['items']) < self::MIN_ITEMS) {
            return $fail('too_few_items');
        }

        $this->reset($region);
        return ['ok' => true, 'items' => $data['items'], 'region' => $region];
    }

    private function breakerOpen(string $region): bool
    {
        $st = $this->db->prepare(
            'SELECT fails, opened_until FROM breaker WHERE region = ?'
        );
        $st->execute([$region]);
        $row = $st->fetch(PDO::FETCH_ASSOC);
        return $row && (int) $row['opened_until'] > time();
    }

    private function trip(string $region, string $why): array
    {
        $this->db->prepare(
            'INSERT INTO breaker (region, fails, opened_until) VALUES (?, 1, 0)
             ON CONFLICT(region) DO UPDATE SET fails = fails + 1,
               opened_until = CASE WHEN fails + 1 >= ' . self::TRIP_AFTER .
            ' THEN ? ELSE opened_until END'
        )->execute([$region, time() + 300]);
        error_log("chaos-defense region={$region} tripped why={$why}");
        return ['ok' => false, 'reason' => $why, 'region' => $region];
    }

    private function reset(string $region): void
    {
        $this->db->prepare(
            'INSERT INTO breaker (region, fails, opened_until) VALUES (?, 0, 0)
             ON CONFLICT(region) DO UPDATE SET fails = 0, opened_until = 0'
        )->execute([$region]);
    }
}

The rule that makes containment work is what isn't here: there is no code path that writes $data['items'] to the cache unless ok is true. A failed region returns early, the existing rows stay untouched, and the discovery API keeps serving the last known-good snapshot. Stale beats wrong on a content discovery site — a video from six hours ago is fine; an empty grid is not.

The length check ($claim !== $dlSize) is the single line that would have prevented our original incident. curl exposes both the promised and actual download sizes, and comparing them is free.

Driving load and measuring blast radius

A fault that nobody hits proves nothing. The driver is a small Go program that hammers all eight region endpoints concurrently while a fault profile targets one of them, then reports per-region success rate and latency. We use Go here purely because goroutines make honest concurrent load trivial without a runtime to babysit.

// driver.go  —  go run driver.go -base http://staging.local -faulty DE -n 400
package main

import (
    "flag"
    "fmt"
    "net/http"
    "sort"
    "sync"
    "time"
)

var regions = []string{"US", "GB", "DE", "FR", "JP", "KR", "BR", "AU"}

type stat struct {
    ok, fail int
    lat      []time.Duration
}

func main() {
    base := flag.String("base", "http://localhost:8080", "discovery API base")
    n := flag.Int("n", 200, "requests per region")
    faulty := flag.String("faulty", "DE", "region under fault injection")
    flag.Parse()

    client := &http.Client{Timeout: 15 * time.Second}
    results := make(map[string]*stat, len(regions))
    var mu sync.Mutex
    var wg sync.WaitGroup

    for _, r := range regions {
        results[r] = &stat{}
        for i := 0; i < *n; i++ {
            wg.Add(1)
            go func(region string) {
                defer wg.Done()
                start := time.Now()
                req, _ := http.NewRequest("GET", *base+"/discover?region="+region, nil)
                resp, err := client.Do(req)
                d := time.Since(start)
                mu.Lock()
                defer mu.Unlock()
                s := results[region]
                if err != nil || resp.StatusCode != 200 {
                    s.fail++
                    return
                }
                resp.Body.Close()
                s.ok++
                s.lat = append(s.lat, d)
            }(r)
        }
    }
    wg.Wait()

    fmt.Printf("fault target: %s\n%-6s %-8s %-8s %-10s\n", *faulty, "region", "ok", "fail", "p95")
    for _, r := range regions {
        s := results[r]
        fmt.Printf("%-6s %-8d %-8d %-10s\n", r, s.ok, s.fail, p95(s.lat))
    }
}

func p95(d []time.Duration) time.Duration {
    if len(d) == 0 {
        return 0
    }
    sort.Slice(d, func(i, j int) bool { return d[i] < d[j] })
    return d[int(float64(len(d))*0.95)]
}

The output table is the whole point. A passing run looks like this: the faulty region shows degraded success (it is serving stale data, sometimes from an open breaker) but its p95 stays bounded, while every other region sits at 100% with normal latency. The moment another region's numbers move, containment is broken and we have a bug to fix before that code ships over FTP.

Wiring it into multi-region cron

The harness is not a one-off. Our region cron already loops over the region list to pull metadata, so the chaos run reuses the same loop on staging. A nightly job starts the proxy with a rotating single-region profile, runs the driver, and fails the build if any non-target region drops below 99.5% success or if any cache row for a clean region changes during the fault window.

#!/usr/bin/env bash
set -euo pipefail
REGIONS=(US GB DE FR JP KR BR AU)
TARGET=${REGIONS[$(( RANDOM % ${#REGIONS[@]} ))]}

# snapshot clean-region cache row counts before the storm
sqlite3 staging.db "SELECT region, count(*) FROM videos GROUP BY region" > /tmp/before.txt

jq -n --arg r "$TARGET" '{($r): {probability: 0.8,
  types: ["truncate","lie_length","slowloris","valid_garbage","flap"]}}' > /tmp/profile.json

python chaos_proxy.py --listen 8099 --upstream "$UPSTREAM" --profile /tmp/profile.json &
PROXY=$!
trap 'kill $PROXY' EXIT
sleep 1

go run driver.go -base http://staging.local -faulty "$TARGET" -n 400 | tee /tmp/run.txt

# assert containment: no clean region's row count moved
sqlite3 staging.db "SELECT region, count(*) FROM videos GROUP BY region" > /tmp/after.txt
if ! diff <(grep -v "^$TARGET|" /tmp/before.txt) <(grep -v "^$TARGET|" /tmp/after.txt); then
  echo "CONTAINMENT FAILURE: a clean region's cache changed during fault on $TARGET" >&2
  exit 1
fi
echo "chaos run clean — fault target $TARGET contained"

Randomizing the target each night means that over a couple of weeks every region gets exercised, and a containment regression in any one of them surfaces without us having to remember to test it.

What we found

Running this against our own code was humbling. The truncation defense worked on the first try, but three other things did not:

The flap fault broke our retry logic. We retried once on 503, and the alternating pattern meant the retry always hit the 200, masking that the upstream was unhealthy half the time. We now require two consecutive successes before clearing the breaker.
Clock skew poisoned conditional requests. A Last-Modified stamped in 2031 made our If-Modified-Since logic skip refreshes for that region indefinitely. We now clamp upstream timestamps to min(upstream, now).
The 8s deadline was too generous under slowloris. A trickle that delivered a valid body at byte 7,900 of an 8,000-byte deadline passed validation but starved a cron worker. We added a minimum-throughput check.

None of these would have shown up in a test that only mocked clean responses or hard failures. They lived in the gap between "working" and "broken" — exactly where lying dependencies operate.

Conclusion

Chaos testing for a read-heavy video API is not about randomly killing servers. It is about reproducing the dishonest failures real upstreams produce — truncated bodies, mismatched lengths, structurally-valid garbage, and skewed clocks — and proving that not one of them can escape a single region or corrupt your cache. The harness that does this is small: a lying proxy, a paranoid client that treats every response as guilty until verified, a concurrent driver that measures blast radius, and a cron wrapper that asserts containment and fails the build when it breaks.

The whole thing fits in a few hundred lines and deploys over the same boring FTP pipeline as the rest of our code, which is the only reason it actually runs every night instead of rotting in a branch. If your video API trusts its upstreams to fail honestly, build the proxy that lies to it first — you will find the bugs that wake you at 2am while you are still awake to fix them.

DEV Community