The trending spike that hammered a partner's ingest
A few weeks ago a single skateboarding clip we track on ViralVidVault jumped from roughly 4,000 to 1.2 million views in under an hour. Our trend detector did exactly what it was built to do: it emitted a stream of video.trending, video.velocity_changed, and video.rank_updated events. The problem was downstream. We fan those events out to a dozen webhook subscribers — partner dashboards, a Slack relay, two analytics sinks, and a Cloudflare Worker that warms our edge cache. Within ninety seconds one partner's ingest endpoint started returning 429 Too Many Requests, then 503, and then stopped answering entirely. We had effectively DoS'd a partner with our own good news.
This is the pipeline we shipped to fix it: durable, rate-limited webhook delivery built on PHP 8.4 and SQLite in WAL mode, with a token-bucket limiter per subscriber, exponential backoff with jitter, HMAC request signing, and a small Go worker for the high-throughput path. Every piece below runs in production on LiteSpeed today.
Why naive fan-out breaks
The first version was the obvious one: when an event fires, loop over subscribers and POST synchronously. That works fine at ten events an hour. It falls apart the moment a video goes viral, for reasons that compound:
- No backpressure. A burst of 5,000 events becomes 60,000 outbound POSTs (5,000 × 12 subscribers) with no governor. Slow subscribers block the very request that emitted the event.
- No isolation. One dead subscriber stalls delivery for every other subscriber, because they share a single synchronous loop.
- No durability. If the PHP worker is recycled mid-burst — and LiteSpeed will recycle it — in-flight events vanish and subscribers silently miss data.
-
No retry discipline. A transient
502means the event is lost, or it gets retried in a tight loop that turns a partner's blip into a full outage.
The fix has three parts: decouple emitting an event from delivering it, put a durable queue in between, and make the deliverer respect a per-subscriber rate limit.
The queue and subscriber schema
Everything hangs off two tables. SQLite in WAL mode is a good fit here: writers don't block readers, and a single video site rarely exceeds a few hundred queued deliveries per second — comfortably inside what one SQLite file does on an NVMe-backed LiteSpeed box.
CREATE TABLE webhook_subscriber (
id INTEGER PRIMARY KEY,
url TEXT NOT NULL,
secret TEXT NOT NULL,
event_types TEXT NOT NULL DEFAULT '*',
rate_per_sec REAL NOT NULL DEFAULT 5.0,
burst INTEGER NOT NULL DEFAULT 10,
tokens REAL NOT NULL DEFAULT 10.0,
tokens_updated_at REAL NOT NULL DEFAULT 0,
active INTEGER NOT NULL DEFAULT 1
);
CREATE TABLE webhook_delivery (
id INTEGER PRIMARY KEY,
subscriber_id INTEGER NOT NULL REFERENCES webhook_subscriber(id),
event_type TEXT NOT NULL,
payload TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
attempts INTEGER NOT NULL DEFAULT 0,
next_attempt_at REAL NOT NULL DEFAULT 0,
locked_until REAL NOT NULL DEFAULT 0,
created_at REAL NOT NULL
);
CREATE INDEX idx_delivery_due
ON webhook_delivery(status, next_attempt_at)
WHERE status IN ('pending', 'retry');
Three deliberate choices:
-
next_attempt_atis the scheduling primitive. A row is "due" whennext_attempt_at <= now. Retries simply push this into the future. -
locked_untillets multiple worker processes cooperate without a separate lock table. A worker claims rows by stamping a short lease on them. - Token state (
tokens,tokens_updated_at) lives on the subscriber row, so the rate limiter survives process restarts. An in-memory limiter forgets everything the moment LiteSpeed recycles the process.
A persistent token bucket and a cheap emit path
The token bucket is the heart of the rate limit. Each subscriber refills at rate_per_sec up to a burst ceiling. Because the state is in SQLite, refill is computed lazily from elapsed time rather than from a background ticker — no cron job needed to "top up" buckets. Emitting an event is then cheap: it writes one durable row per matching subscriber and returns, doing no outbound HTTP on the request path.
<?php
declare(strict_types=1);
/** Per-subscriber token bucket. State lives in SQLite so it survives restarts. */
final class TokenBucket
{
public function __construct(private PDO $db) {}
public function tryConsume(int $subscriberId, float $now): bool
{
$this->db->beginTransaction();
try {
$stmt = $this->db->prepare(
'SELECT rate_per_sec, burst, tokens, tokens_updated_at
FROM webhook_subscriber WHERE id = :id AND active = 1'
);
$stmt->execute([':id' => $subscriberId]);
$row = $stmt->fetch(PDO::FETCH_ASSOC);
if ($row === false) {
$this->db->commit();
return false;
}
$elapsed = max(0.0, $now - (float) $row['tokens_updated_at']);
$tokens = min(
(float) $row['burst'],
(float) $row['tokens'] + $elapsed * (float) $row['rate_per_sec']
);
$granted = $tokens >= 1.0;
$upd = $this->db->prepare(
'UPDATE webhook_subscriber
SET tokens = :t, tokens_updated_at = :now WHERE id = :id'
);
$upd->execute([
':t' => $granted ? $tokens - 1.0 : $tokens,
':now' => $now,
':id' => $subscriberId,
]);
$this->db->commit();
return $granted;
} catch (\Throwable $e) {
$this->db->rollBack();
throw $e;
}
}
}
/** Fans an event into the durable queue: one row per matching subscriber. */
final class WebhookEmitter
{
public function __construct(private PDO $db) {}
public function emit(string $eventType, array $data, float $now): int
{
$payload = json_encode([
'event' => $eventType,
'id' => bin2hex(random_bytes(16)),
'occurred_at' => gmdate('c', (int) $now),
'data' => $data, // video IDs + aggregate counts only — no PII
], JSON_UNESCAPED_SLASHES | JSON_THROW_ON_ERROR);
$subs = $this->db
->query('SELECT id, event_types FROM webhook_subscriber WHERE active = 1')
->fetchAll(PDO::FETCH_ASSOC);
$ins = $this->db->prepare(
'INSERT INTO webhook_delivery
(subscriber_id, event_type, payload, next_attempt_at, created_at)
VALUES (:sid, :etype, :payload, :now, :now)'
);
$queued = 0;
foreach ($subs as $s) {
if (!$this->matches($s['event_types'], $eventType)) {
continue;
}
$ins->execute([
':sid' => (int) $s['id'],
':etype' => $eventType,
':payload' => $payload,
':now' => $now,
]);
$queued++;
}
return $queued;
}
private function matches(string $patterns, string $eventType): bool
{
foreach (explode(',', $patterns) as $p) {
$p = trim($p);
if ($p === '*' || $p === $eventType) {
return true;
}
if (str_ends_with($p, '.*')
&& str_starts_with($eventType, substr($p, 0, -1))) {
return true;
}
}
return false;
}
}
The lazy refill (elapsed * rate_per_sec, capped at burst) means a subscriber that has been idle for a minute can absorb a short burst, then settles back to its steady rate. That matches how viral traffic actually arrives: spiky, then sustained. Two more things worth calling out:
-
Event matching supports
*, exact types, andvideo.*prefixes, so a subscriber opts into only the events it cares about. Fewer rows queued means less rate-limit pressure downstream. - GDPR by construction. The payload carries a video ID, an event type, and aggregate counts — never an end-user identifier, IP, or session. For a European-market product that is not a nice-to-have: it means the webhook payload is not personal data, so a subscriber outage cannot leak anything sensitive.
The delivery worker: claim, rate-limit, sign, retry
The worker is a loop. Each tick claims a batch of due rows with a short lease, then for each row it checks the token bucket, signs the payload, sends it, and reschedules on failure with jittered exponential backoff.
<?php
declare(strict_types=1);
final class DeliveryWorker
{
private const MAX_ATTEMPTS = 8;
public function __construct(
private PDO $db,
private TokenBucket $bucket,
) {}
public function tick(float $now, int $batch = 100): int
{
// Claim due rows with a 30s lease so parallel workers don't collide.
$lease = $now + 30.0;
$claim = $this->db->prepare(
"UPDATE webhook_delivery SET locked_until = :lease
WHERE id IN (
SELECT id FROM webhook_delivery
WHERE status IN ('pending', 'retry')
AND next_attempt_at <= :now
AND locked_until <= :now
ORDER BY next_attempt_at LIMIT " . (int) $batch . "
)"
);
$claim->execute([':lease' => $lease, ':now' => $now]);
$rows = $this->db->prepare(
"SELECT d.id, d.attempts, d.payload, s.id AS sid, s.url, s.secret
FROM webhook_delivery d
JOIN webhook_subscriber s ON s.id = d.subscriber_id
WHERE d.locked_until = :lease
ORDER BY d.next_attempt_at"
);
$rows->execute([':lease' => $lease]);
$sent = 0;
foreach ($rows->fetchAll(PDO::FETCH_ASSOC) as $d) {
if (!$this->bucket->tryConsume((int) $d['sid'], $now)) {
// Over budget right now: defer ~1s and release the lease.
$this->reschedule((int) $d['id'], $now + 1.0, (int) $d['attempts']);
continue;
}
[$ok, $code] = $this->send($d['url'], $d['secret'], $d['payload']);
$attempts = (int) $d['attempts'] + 1;
if ($ok) {
$this->finish((int) $d['id'], 'done', $attempts);
$sent++;
} elseif ($attempts >= self::MAX_ATTEMPTS) {
$this->finish((int) $d['id'], 'dead', $attempts);
} else {
$this->reschedule((int) $d['id'], $now + $this->backoff($attempts), $attempts);
}
}
return $sent;
}
private function send(string $url, string $secret, string $payload): array
{
$ts = (string) time();
$sig = hash_hmac('sha256', $ts . '.' . $payload, $secret);
$ch = curl_init($url);
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $payload,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_TIMEOUT => 8,
CURLOPT_CONNECTTIMEOUT => 4,
CURLOPT_HTTPHEADER => [
'Content-Type: application/json',
'X-VVV-Timestamp: ' . $ts,
'X-VVV-Signature: sha256=' . $sig,
'User-Agent: ViralVidVault-Webhooks/1.0',
],
]);
curl_exec($ch);
$code = (int) curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
curl_close($ch);
return [$code >= 200 && $code < 300, $code];
}
private function backoff(int $attempt): float
{
// Exponential base with full jitter, capped at one hour.
$base = min(3600.0, 2 ** $attempt);
return $base / 2 + (mt_rand() / mt_getrandmax()) * ($base / 2);
}
private function reschedule(int $id, float $at, int $attempts): void
{
$this->db->prepare(
"UPDATE webhook_delivery
SET status = 'retry', attempts = :a, next_attempt_at = :at, locked_until = 0
WHERE id = :id"
)->execute([':a' => $attempts, ':at' => $at, ':id' => $id]);
}
private function finish(int $id, string $status, int $attempts): void
{
$this->db->prepare(
"UPDATE webhook_delivery
SET status = :st, attempts = :a, locked_until = 0 WHERE id = :id"
)->execute([':st' => $status, ':a' => $attempts, ':id' => $id]);
}
}
The signing scheme is deliberately boring and standard:
- We sign
timestamp.payloadwith HMAC-SHA256 and send bothX-VVV-TimestampandX-VVV-Signature. - Subscribers reject signatures computed over a stale timestamp, which blunts replay attacks.
-
Full jitter on backoff (
base/2 + random(0, base/2)) spreads retries out. Without jitter, every failed delivery for a downed partner retries at the same instant and stampedes them again the moment they recover.
After MAX_ATTEMPTS — eight here, spanning several minutes of total backoff — a delivery is marked dead and surfaced on an ops dashboard rather than retried forever.
Running the worker on LiteSpeed
There is no daemon manager on shared LiteSpeed hosting, so we run the worker as a self-terminating CLI loop kicked off every minute by cron: * * * * * php /home/vvv/app/worker.php. The script runs tick() in a loop for about 55 seconds, then exits cleanly before the next cron fires. Overlap is harmless because of the locked_until lease — a second process simply cannot claim rows the first one still holds.
A Go drain worker for the spikes
PHP-per-minute is fine for steady state. During a genuine viral spike — tens of thousands of queued deliveries — we switch the hot path to a single long-running Go process. It uses golang.org/x/time/rate for in-process per-subscriber limiting and keeps one SQLite connection open in WAL mode.
package main
import (
"crypto/hmac"
"crypto/sha256"
"database/sql"
"encoding/hex"
"fmt"
"net/http"
"strings"
"sync"
"time"
_ "github.com/mattn/go-sqlite3"
"golang.org/x/time/rate"
)
type limiters struct {
mu sync.Mutex
m map[int64]*rate.Limiter
}
func (l *limiters) get(id int64, rps float64, burst int) *rate.Limiter {
l.mu.Lock()
defer l.mu.Unlock()
lim, ok := l.m[id]
if !ok {
lim = rate.NewLimiter(rate.Limit(rps), burst)
l.m[id] = lim
}
return lim
}
func sign(secret, ts, payload string) string {
mac := hmac.New(sha256.New, []byte(secret))
mac.Write([]byte(ts + "." + payload))
return "sha256=" + hex.EncodeToString(mac.Sum(nil))
}
func main() {
db, err := sql.Open("sqlite3", "file:webhooks.db?_journal=WAL&_busy_timeout=5000")
if err != nil {
panic(err)
}
db.SetMaxOpenConns(1) // SQLite is a single writer
lims := &limiters{m: map[int64]*rate.Limiter{}}
client := &http.Client{Timeout: 8 * time.Second}
for {
now := float64(time.Now().UnixNano()) / 1e9
rows, _ := db.Query(`
SELECT d.id, d.payload, d.attempts, s.id, s.url, s.secret, s.rate_per_sec, s.burst
FROM webhook_delivery d
JOIN webhook_subscriber s ON s.id = d.subscriber_id
WHERE d.status IN ('pending', 'retry') AND d.next_attempt_at <= ?
ORDER BY d.next_attempt_at LIMIT 200`, now)
type job struct {
id, sid int64
payload, url, secret string
attempts int
rps float64
burst int
}
var jobs []job
for rows.Next() {
var j job
rows.Scan(&j.id, &j.payload, &j.attempts, &j.sid, &j.url, &j.secret, &j.rps, &j.burst)
jobs = append(jobs, j)
}
rows.Close()
for _, j := range jobs {
if !lims.get(j.sid, j.rps, j.burst).Allow() {
db.Exec(`UPDATE webhook_delivery SET next_attempt_at = ? WHERE id = ?`, now+1, j.id)
continue
}
ts := fmt.Sprintf("%d", time.Now().Unix())
req, _ := http.NewRequest("POST", j.url, strings.NewReader(j.payload))
req.Header.Set("Content-Type", "application/json")
req.Header.Set("X-VVV-Timestamp", ts)
req.Header.Set("X-VVV-Signature", sign(j.secret, ts, j.payload))
resp, err := client.Do(req)
ok := err == nil && resp.StatusCode >= 200 && resp.StatusCode < 300
if resp != nil {
resp.Body.Close()
}
if ok {
db.Exec(`UPDATE webhook_delivery SET status = 'done', attempts = attempts + 1 WHERE id = ?`, j.id)
} else {
backoff := float64(int64(1) << uint(j.attempts+1))
db.Exec(`UPDATE webhook_delivery SET status = 'retry', attempts = attempts + 1, next_attempt_at = ? WHERE id = ?`,
now+backoff, j.id)
}
}
time.Sleep(200 * time.Millisecond)
}
}
The tradeoff is explicit: the Go limiter is in-memory, so it is faster but forgets its bucket on restart. We accept that on the spike path because the process runs for hours at a time during an event. The SQLite-backed PHP limiter remains the source of truth for steady-state delivery, where restarts are frequent. Same queue, two drainers — but never both pointed at the same subscriber set at once.
What subscribers must do
Rate limiting only works if the other side cooperates. Our subscriber docs require three things, and here is the reference receiver we ship:
import hashlib
import hmac
import time
from flask import Flask, request, abort
app = Flask(__name__)
SECRET = b"the-shared-subscriber-secret"
TOLERANCE = 300 # seconds
@app.post("/webhooks/vvv")
def receive():
ts = request.headers.get("X-VVV-Timestamp", "")
sig = request.headers.get("X-VVV-Signature", "")
body = request.get_data() # raw bytes, before any parsing
# 1. Reject stale timestamps to blunt replay attacks.
try:
if abs(time.time() - int(ts)) > TOLERANCE:
abort(401)
except ValueError:
abort(400)
# 2. Recompute the signature over "timestamp.payload".
signed = ts.encode() + b"." + body
expected = "sha256=" + hmac.new(SECRET, signed, hashlib.sha256).hexdigest()
if not hmac.compare_digest(expected, sig):
abort(401)
# 3. Only now parse and act. Return 2xx fast; do real work async.
event = request.get_json()
enqueue_locally(event) # your job queue, not inline work
return "", 204
def enqueue_locally(event):
# Dedupe on event["id"] here — delivery is at-least-once.
...
- Verify the signature over the raw body, before JSON parsing. Parsing first and re-serializing changes bytes and breaks the HMAC.
- Respond 2xx fast. Anything slow eats the subscriber's own latency budget and triggers our retries. Push real work onto a local queue.
-
Be idempotent on the event ID. Retries and at-least-once delivery mean the same event can arrive twice; dedupe on the
idfield.
Coalescing redundant events at the edge
One more lever specific to our stack: several events for the same video within a few seconds are usually redundant for cache-warming subscribers. A Cloudflare Worker sits in front of the cache-warm endpoint and collapses repeated video.rank_updated events for the same ID inside a short window, using the Cache API as a dedupe key. That cuts warm-path traffic during a spike by more than half without touching the core queue — the queue still records everything; only the edge consumer thins it.
Results
Since shipping this:
- Zero partner
429/503cascades across the last three viral spikes, including one clip that crossed two million views. - p95 delivery latency in steady state is under two seconds; during spikes it degrades gracefully to tens of seconds rather than dropping events.
- The dead-letter rate sits near zero, and the few rows that land there are genuinely dead endpoints, not transient failures.
Conclusion
The core idea is small: never deliver on the request that emits the event. Put a durable SQLite queue in the middle, govern each subscriber with a token bucket whose state outlives the process, sign every request, and back off with jitter. PHP 8.4 and SQLite WAL handle the steady state on cheap LiteSpeed hosting, a single Go process absorbs the spikes, and a Cloudflare Worker thins redundant edge traffic. None of it is exotic — it is just the difference between celebrating a viral video and apologizing to a partner whose ingest you melted. If you run event fan-out at any real volume, build the queue and the limiter before you need them, not during your first big spike.
Top comments (0)