ahmet gedik

Posted on Jun 2

Building a Chaos Testing Harness for Video API Endpoints Across 8 Regions

#php #python #testing #chaos

Last Tuesday at 03:14 UTC, our Tokyo region's YouTube Data API quota silently degraded. Not a hard failure — the responses came back with HTTP 200, but nextPageToken started returning null after the first page on roughly 12% of requests. Our cron in Japan kept marching, our SQLite cache absorbed the partial results, and for about six hours every Asia-region visitor to TrendVidStream saw the same 50 trending videos instead of fresh discovery. No alert fired. No log line screamed. The site just got quietly boring in one hemisphere.

That incident is the reason I sat down and built a chaos testing harness specifically for our video discovery API layer. Not a generic fault injector — those exist and they're great for stateless microservices — but a domain-aware harness that understands what "broken" means when you're orchestrating eight regional crons against a third-party quota-limited API and writing into a SQLite FTS5 index that gets shipped over FTP to four production hosts. This post walks through how it's built, what it actually catches, and the surprisingly cheap PHP and Python it took to get there.

Why generic chaos tooling didn't fit

I tried toxiproxy first. Beautiful tool. Drops packets, adds latency, slices bandwidth. The problem is that our failure modes aren't network-shaped — they're semantic. The Tokyo incident wasn't a network failure; it was a partial-correctness failure where the upstream API returned syntactically valid but semantically wrong data. Toxiproxy can't simulate "the response body is well-formed JSON but the items array contains a region's stale cached results from yesterday."

The failure modes that actually hurt us, ranked by how often they've caused incidents:

Silent quota throttling — API returns 200s but with truncated pagination or stale data
Regional DNS poisoning — one region's edge resolves youtube.googleapis.com to a Cloudflare error page
Clock skew between cron hosts — two regions both think they're the "primary" for a category fetch window
Partial SQLite write during FTP deploy — data/videos.db gets shipped mid-transaction
FTS5 index corruption — full-text shadow tables desync from main rows after an interrupted backfill
Encoding drift — Korean and Vietnamese titles arrive as UTF-8 from the API but get re-encoded as Latin-1 by an intermediate proxy

A general-purpose chaos tool catches maybe one of these (the network-layer ones). The other five need a harness that understands our actual data model.

The architecture in one diagram's worth of words

The harness sits as a thin wrapper around our HTTP client and our SQLite writer. Every outbound request to a video API goes through a ChaosClient that consults a YAML scenario file. Every SQLite write goes through a ChaosWriter that can inject partial-commit failures, FTS5 desync, or corrupted UTF-8 sequences. Both produce a structured event log that gets diffed against a baseline run.

The key design choice: chaos is opt-in per environment and per scenario, never globally toggled. We have a CHAOS_PROFILE env var that names the active scenario file. In production it's always unset. In CI it cycles through a matrix. On my laptop I run individual scenarios by name when I'm debugging a specific class of incident.

The PHP side — wrapping the video API client

Our main fetcher is cron/fetch_videos.php. It already had a centralized HTTP helper, which made this drop-in. Here's the relevant piece of the chaos wrapper, slimmed for readability but functionally what we run:

<?php
declare(strict_types=1);

final class ChaosClient
{
    private ?array $scenario = null;
    private int $requestCount = 0;

    public function __construct(private readonly \PDO $db)
    {
        $profile = getenv('CHAOS_PROFILE');
        if ($profile === false || $profile === '') {
            return;
        }
        $path = __DIR__ . "/scenarios/{$profile}.json";
        if (!is_file($path)) {
            throw new \RuntimeException("Chaos profile not found: {$profile}");
        }
        $this->scenario = json_decode(file_get_contents($path), true, flags: JSON_THROW_ON_ERROR);
    }

    public function fetch(string $url, array $params, string $region): array
    {
        $this->requestCount++;
        $fault = $this->selectFault($region);

        if ($fault !== null) {
            $this->logFault($fault, $region, $url);
            return $this->applyFault($fault, $url, $params, $region);
        }

        return $this->realFetch($url, $params);
    }

    private function selectFault(string $region): ?array
    {
        if ($this->scenario === null) {
            return null;
        }
        foreach ($this->scenario['faults'] as $fault) {
            if (!in_array($region, $fault['regions'] ?? [$region], true)) {
                continue;
            }
            $trigger = $fault['trigger'] ?? [];
            if (isset($trigger['after_requests']) && $this->requestCount < $trigger['after_requests']) {
                continue;
            }
            if (isset($trigger['probability']) && mt_rand() / mt_getrandmax() > $trigger['probability']) {
                continue;
            }
            return $fault;
        }
        return null;
    }

    private function applyFault(array $fault, string $url, array $params, string $region): array
    {
        return match ($fault['type']) {
            'silent_truncation' => $this->silentlyTruncate($this->realFetch($url, $params)),
            'stale_response'    => $this->returnStale($region, $params['playlistId'] ?? 'global'),
            'encoding_drift'    => $this->corruptEncoding($this->realFetch($url, $params)),
            'http_200_empty'    => ['items' => [], 'nextPageToken' => null],
            'latency_spike'     => $this->delayAndFetch($url, $params, $fault['delay_ms']),
            default             => throw new \RuntimeException("Unknown fault: {$fault['type']}"),
        };
    }

    private function silentlyTruncate(array $response): array
    {
        $response['items'] = array_slice($response['items'] ?? [], 0, 3);
        unset($response['nextPageToken']);
        return $response;
    }

    private function returnStale(string $region, string $key): array
    {
        $stmt = $this->db->prepare(
            'SELECT response FROM chaos_baselines WHERE region = ? AND key = ? AND captured_at < ?'
        );
        $stmt->execute([$region, $key, time() - 86400]);
        $row = $stmt->fetch(\PDO::FETCH_ASSOC);
        return $row ? json_decode($row['response'], true) : ['items' => []];
    }

    private function corruptEncoding(array $response): array
    {
        foreach ($response['items'] ?? [] as &$item) {
            if (isset($item['snippet']['title'])) {
                $item['snippet']['title'] = mb_convert_encoding(
                    $item['snippet']['title'], 'ISO-8859-1', 'UTF-8'
                );
            }
        }
        return $response;
    }
}

The chaos_baselines table is populated by a separate "capture" mode that runs against the real API for one clean cycle and stores every response keyed by region and playlist. Stale-response injection then replays yesterday's data with today's timestamps — which is exactly what the Tokyo incident looked like.

Scenario files — keeping the chaos declarative

Scenarios live in cron/scenarios/*.json. Keeping them as plain data (not code) means our QA contractor can write new ones without touching PHP, and they diff cleanly in git when we tweak failure rates. A real one from our suite:

{
  "name": "tokyo_silent_degradation_2026_03",
  "description": "Reproduces 2026-03-14 incident: JP region returns 200 with truncated pagination after 40 requests",
  "faults": [
    {
      "type": "silent_truncation",
      "regions": ["JP", "KR", "TW"],
      "trigger": { "after_requests": 40, "probability": 0.12 }
    },
    {
      "type": "latency_spike",
      "regions": ["SG", "HK"],
      "trigger": { "probability": 0.05 },
      "delay_ms": 8000
    },
    {
      "type": "encoding_drift",
      "regions": ["VN", "TH"],
      "trigger": { "probability": 0.20 }
    }
  ],
  "assertions": {
    "max_lost_videos_percent": 5,
    "max_duplicate_inserts": 10,
    "required_alerts": ["truncation_detected", "encoding_anomaly"]
  }
}

The assertions block is what turns chaos into a test. After the harness runs a scenario, a verifier checks that the system behaved within those bounds — and crucially, that the right alerts fired. The Tokyo incident was bad not because the API misbehaved (APIs misbehave; that's normal) but because we had no alert that would fire on silent truncation. The assertion required_alerts: [truncation_detected] is what forced us to actually write that detector.

The Python verifier — diffing against baseline

The PHP side runs the cron under fault injection and writes its normal output. The verifier is a separate Python script that compares the post-chaos SQLite state against a clean-run baseline and against the scenario's assertions. Python because pandas and sqlite3 make this kind of diff trivial.

from __future__ import annotations
import json
import sqlite3
import sys
from dataclasses import dataclass
from pathlib import Path


@dataclass
class VerificationResult:
    scenario: str
    passed: bool
    failures: list[str]
    metrics: dict[str, float]


def verify(scenario_path: Path, baseline_db: Path, chaos_db: Path, alerts_log: Path) -> VerificationResult:
    scenario = json.loads(scenario_path.read_text())
    assertions = scenario.get("assertions", {})
    failures: list[str] = []

    baseline_videos = _load_videos(baseline_db)
    chaos_videos = _load_videos(chaos_db)

    lost = baseline_videos - chaos_videos
    lost_pct = (len(lost) / max(len(baseline_videos), 1)) * 100

    if lost_pct > assertions.get("max_lost_videos_percent", 100):
        failures.append(
            f"Lost {lost_pct:.1f}% of videos (limit {assertions['max_lost_videos_percent']}%)"
        )

    duplicates = _count_duplicates(chaos_db)
    if duplicates > assertions.get("max_duplicate_inserts", 0):
        failures.append(f"Found {duplicates} duplicate inserts")

    fired_alerts = _read_alerts(alerts_log)
    for required in assertions.get("required_alerts", []):
        if required not in fired_alerts:
            failures.append(f"Required alert never fired: {required}")

    fts_integrity = _check_fts_integrity(chaos_db)
    if not fts_integrity:
        failures.append("FTS5 shadow tables desynced from main rows")

    return VerificationResult(
        scenario=scenario["name"],
        passed=len(failures) == 0,
        failures=failures,
        metrics={
            "lost_videos_percent": round(lost_pct, 2),
            "duplicate_inserts": duplicates,
            "alerts_fired": len(fired_alerts),
        },
    )


def _load_videos(db_path: Path) -> set[str]:
    conn = sqlite3.connect(db_path)
    try:
        rows = conn.execute("SELECT video_id FROM videos WHERE region IS NOT NULL").fetchall()
        return {row[0] for row in rows}
    finally:
        conn.close()


def _count_duplicates(db_path: Path) -> int:
    conn = sqlite3.connect(db_path)
    try:
        return conn.execute(
            "SELECT COUNT(*) FROM (SELECT video_id, region, COUNT(*) c FROM videos GROUP BY video_id, region HAVING c > 1)"
        ).fetchone()[0]
    finally:
        conn.close()


def _check_fts_integrity(db_path: Path) -> bool:
    conn = sqlite3.connect(db_path)
    try:
        main_count = conn.execute("SELECT COUNT(*) FROM videos").fetchone()[0]
        fts_count = conn.execute("SELECT COUNT(*) FROM videos_fts").fetchone()[0]
        return abs(main_count - fts_count) < 5
    finally:
        conn.close()


def _read_alerts(log_path: Path) -> set[str]:
    if not log_path.exists():
        return set()
    return {line.strip().split("\t")[1] for line in log_path.read_text().splitlines() if "\t" in line}


if __name__ == "__main__":
    result = verify(
        Path(sys.argv[1]), Path(sys.argv[2]), Path(sys.argv[3]), Path(sys.argv[4])
    )
    print(json.dumps(result.__dict__, indent=2, default=list))
    sys.exit(0 if result.passed else 1)

This runs in CI on every PR that touches cron/ or app/Indexing.php. The matrix iterates over every scenario in cron/scenarios/, fresh SQLite each time, and the job fails if any scenario's assertions don't hold.

The FTP deploy chaos — the one nobody talks about

Most chaos engineering writing assumes Kubernetes. We deploy over FTP with lftp mirror -R from ops.sh. That gives us a failure surface most teams have forgotten exists: partial file transfer. If lftp dies mid-mirror, half the PHP files are new and half are old. If the connection drops while data/videos.db is being shipped, the remote SQLite is genuinely corrupt — not just stale.

The harness simulates this by running deploy into a staging directory and using truncate to randomly cut files mid-transfer:

#!/usr/bin/env bash
set -euo pipefail

STAGING="/tmp/chaos_deploy_$$"
mkdir -p "$STAGING"

rsync -a --exclude='.git' --exclude='data/*.log' ./ "$STAGING/"

VICTIMS=(
  "$STAGING/data/videos.db"
  "$STAGING/public/index.php"
  "$STAGING/app/Database.php"
  "$STAGING/cron/fetch_videos.php"
)

for file in "${VICTIMS[@]}"; do
  if [ -f "$file" ] && [ $((RANDOM % 4)) -eq 0 ]; then
    original_size=$(stat -c%s "$file")
    new_size=$((original_size * (RANDOM % 90 + 5) / 100))
    truncate -s "$new_size" "$file"
    echo "chaos: truncated $file from $original_size to $new_size bytes"
  fi
done

php -d display_errors=1 "$STAGING/public/index.php" > /tmp/chaos_response.txt 2>&1 || true

if grep -qE 'Fatal error|Parse error|Uncaught' /tmp/chaos_response.txt; then
  echo "FAIL: corrupted deploy produced PHP error visible to users"
  cat /tmp/chaos_response.txt
  exit 1
fi

rm -rf "$STAGING"
echo "PASS: corrupted deploy degraded gracefully"

The assertion here is weaker than the API harness — we're not checking correctness, just that a partial deploy doesn't render a white-screen-of-death to users. The fix it forced was wrapping our entry point in a top-level set_error_handler that serves a cached HTML snapshot if the bootstrap fails.

Catching the FTS5 desync class of bugs

SQLite FTS5 stores shadow tables (_data, _idx, _docsize, _config, _content) alongside the main table. If you INSERT INTO videos without going through the FTS5-aware path, the search index silently misses rows. Our backfill scripts have hit this twice. The chaos harness now injects this deliberately:

final class ChaosWriter
{
    public function __construct(
        private readonly \PDO $db,
        private readonly ?string $profile = null
    ) {}

    public function insertVideo(array $video): void
    {
        if ($this->profile === 'fts_desync' && mt_rand(1, 100) <= 15) {
            $stmt = $this->db->prepare(
                'INSERT INTO videos (video_id, title, region, channel) VALUES (?, ?, ?, ?)'
            );
            $stmt->execute([$video['id'], $video['title'], $video['region'], $video['channel']]);
            return;
        }

        $this->db->beginTransaction();
        try {
            $stmt = $this->db->prepare(
                'INSERT INTO videos (video_id, title, region, channel) VALUES (?, ?, ?, ?)'
            );
            $stmt->execute([$video['id'], $video['title'], $video['region'], $video['channel']]);
            $rowid = (int) $this->db->lastInsertId();
            $fts = $this->db->prepare('INSERT INTO videos_fts (rowid, title, channel) VALUES (?, ?, ?)');
            $fts->execute([$rowid, $video['title'], $video['channel']]);
            $this->db->commit();
        } catch (\Throwable $e) {
            $this->db->rollBack();
            throw $e;
        }
    }
}

When the fts_desync profile runs, ~15% of inserts skip the FTS shadow table. The verifier's _check_fts_integrity then catches the drift. This caught a real bug last month where our admin's manual video-add form bypassed the wrapper entirely.

What it actually catches in practice

After six weeks running this in CI on every cron-touching PR, here's the honest scorecard:

3 real bugs caught pre-merge — one FTS desync (the admin form), one race condition between two regions writing the same trending video, one missing UTF-8 normalization on Vietnamese titles
1 alert gap closed — silent truncation detector now exists and fires correctly
0 false positives in CI — every red build was a real issue
Roughly 90 seconds added per PR — the full scenario matrix runs in parallel; the slowest scenario is the latency-spike one which deliberately stalls

The cost was a weekend of building it and maybe two hours per month of maintaining scenarios. The benefit is that I now sleep through Asia-region cron windows without anxiety.

What I'd do differently

If I were starting over: I'd build the baseline-capture mode first, before any fault injection. Half the early scenarios I wrote were wrong because I was guessing what a healthy response looked like. Capture mode took a few hours to add later but made every subsequent scenario writable in minutes instead of hours.

I'd also be stricter about scenario hygiene from day one. We accumulated some scenarios that were really just "random failures everywhere" — fun to watch, useless as tests. The good scenarios all reproduce a real past incident or a specific worry, and have a tight assertion that would have caught the incident if it had been in place. A scenario without a falsifiable assertion is just a stress test with extra steps.

Chaos engineering for a small ops team isn't about resilience platforms or service meshes. It's about taking the specific ways your specific system has burned you, encoding them as scenarios, and running them on every change. The 200-line PHP wrapper and 70-line Python verifier above have prevented more incidents than any monitoring dashboard I've ever built. Start with your last three postmortems — that's your scenario backlog.

DEV Community