ahmet gedik

Posted on Jun 4

Building an Event Sourced Video Moderation Audit Log with PHP and SQLite

#php #sqlite #eventsourcing #moderation

The 3am call from a Japanese rightsholder

Last quarter a Japanese label sent a takedown notice for fourteen music videos we'd been aggregating across our trending feeds. Standard request — except the email landed at 03:14 JST and the legal contact wanted to know not just that we'd removed the videos, but who had hidden them previously, when they'd been re-enabled, which moderator approved the rehab, and what reason was attached to each transition. Our audit log at the time was a single moderation_log row per video with a last_action column and a timestamp. It told the rightsholder almost nothing.

That morning TopVideoHub shipped the first version of an event-sourced moderation log. Six months in, it has paid for itself four times over — DMCA responses, KCC inquiries from Korea, an internal investigation about a moderator who was force-hiding K-pop clips, and a rollback after we shipped a bad auto-moderation rule. The pattern is well known in finance and banking and curiously underused in content platforms. This post is the version I wish someone had written for me before that Tokyo email arrived.

Stack context, since it matters for what follows: PHP 8.4 on LiteSpeed, SQLite as the only persistent store (with FTS5 + the unicode61 tokenizer for CJK search), Cloudflare in front. Single-region origin, multi-region trending ingestion across nine APAC markets. No Kafka, no Postgres, no separate audit service.

Why the CRUD log failed

The old schema was the obvious one — and it broke in three specific ways.

-- The broken design (don't do this)
CREATE TABLE moderation_log (
    video_id TEXT PRIMARY KEY,
    status TEXT NOT NULL,       -- 'visible' | 'hidden' | 'blocked'
    last_actor TEXT,
    last_reason TEXT,
    last_action_at INTEGER NOT NULL,
    region_block_mask INTEGER DEFAULT 0
);

First, history was destructive. Every UPDATE clobbered the previous state. When a video had been hidden, restored, hidden again, and finally region-blocked in JP/KR, the row showed only the last transition. We had no way to answer "was this video ever visible to Korean viewers between March 1 and April 3?" without trawling Cloudflare logs, which only retain seven days on the free tier.

Second, multi-actor scenarios got smeared. An auto-moderation rule and a human reviewer often touched the same video in the same hour. The CRUD row attributed the final state to whichever wrote last, hiding the fact that a human had already approved the video and an automated rule overrode the decision. That is exactly the kind of behavior you want to detect.

Third, region-specific decisions were unrepresentable. APAC moderation is not binary. A Taiwanese variety clip might be visible in TW/HK/SG but blocked in CN because of cross-strait sensitivities; a Japanese drama might be allowed only in JP/KR due to licensing. Encoding this as a bitmask kept failing because the reason per region varied. We needed to attach a justification to each regional transition, not just to the video as a whole.

Event sourcing solves all three because every change becomes an immutable, append-only fact with its own actor, reason, region scope, and timestamp.

Event schema in SQLite

The core insight is that the table is not the model — the stream is. Each row in moderation_events is a single observed fact. The current state of a video is a derived value, computed by folding events in order.

CREATE TABLE moderation_events (
    event_id      INTEGER PRIMARY KEY AUTOINCREMENT,
    video_id      TEXT    NOT NULL,
    sequence      INTEGER NOT NULL,             -- per-video monotonic
    event_type    TEXT    NOT NULL,
    actor_type    TEXT    NOT NULL,             -- 'human' | 'rule' | 'system'
    actor_id      TEXT    NOT NULL,
    region_scope  TEXT    NOT NULL DEFAULT '*', -- '*' or CSV like 'JP,KR'
    reason_code   TEXT    NOT NULL,
    reason_text   TEXT,
    payload_json  TEXT,
    occurred_at   INTEGER NOT NULL,
    recorded_at   INTEGER NOT NULL,
    UNIQUE (video_id, sequence)
) STRICT;

CREATE INDEX idx_events_video_seq ON moderation_events(video_id, sequence);
CREATE INDEX idx_events_occurred  ON moderation_events(occurred_at);
CREATE INDEX idx_events_actor     ON moderation_events(actor_type, actor_id);

-- FTS5 over the human-readable reason, with unicode61 for CJK
CREATE VIRTUAL TABLE moderation_events_fts USING fts5(
    reason_text,
    content='moderation_events',
    content_rowid='event_id',
    tokenize = 'unicode61 remove_diacritics 2'
);

CREATE TRIGGER moderation_events_ai AFTER INSERT ON moderation_events BEGIN
    INSERT INTO moderation_events_fts(rowid, reason_text)
    VALUES (new.event_id, new.reason_text);
END;

The event types we settled on after three months of iteration:

video.published — video first became eligible for the feed
video.hidden — globally hidden, with reason
video.restored — un-hidden, partial or full
region.blocked — blocked in one or more regions
region.unblocked — inverse
flag.raised — a moderator or rule flagged it for review
flag.resolved — review decision recorded
metadata.amended — title/description corrected
legal.takedown_received — DMCA-style notice attached
legal.takedown_reversed — counter-notice or expiry

Keep the list small. Ten events cover every moderation scenario we have hit; tempting additions like video.scored belong in a separate analytics stream, not the audit log.

STRICT mode is worth calling out. SQLite is famously type-lax, which is fine for cache tables but disastrous for an audit log where you might one day join an event to a court filing. STRICT gives you actual column types in 3.37+, and LiteSpeed's bundled SQLite has supported it since late 2022.

Appending events safely

Two invariants matter: events must be append-only, and the per-video sequence must be gap-free and monotonic. The sequence is what lets a projection know whether it has seen all events for a video.

<?php
declare(strict_types=1);

namespace App\Moderation;

use PDO;
use RuntimeException;

final class EventStore
{
    public function __construct(private readonly PDO $db) {}

    public function append(
        string $videoId,
        string $eventType,
        string $actorType,
        string $actorId,
        string $reasonCode,
        ?string $reasonText = null,
        string $regionScope = '*',
        array $payload = [],
        ?int $occurredAt = null,
    ): int {
        $occurredAt ??= time();
        $recordedAt  = time();

        $this->db->beginTransaction();
        try {
            $nextStmt = $this->db->prepare(
                'SELECT COALESCE(MAX(sequence), 0) + 1
                   FROM moderation_events
                  WHERE video_id = :vid'
            );
            $nextStmt->execute([':vid' => $videoId]);
            $next = (int) $nextStmt->fetchColumn();

            $stmt = $this->db->prepare(
                'INSERT INTO moderation_events
                 (video_id, sequence, event_type, actor_type, actor_id,
                  region_scope, reason_code, reason_text, payload_json,
                  occurred_at, recorded_at)
                 VALUES (?,?,?,?,?,?,?,?,?,?,?)'
            );

            $stmt->execute([
                $videoId, $next, $eventType, $actorType, $actorId,
                $regionScope, $reasonCode, $reasonText,
                $payload ? json_encode($payload, JSON_UNESCAPED_UNICODE) : null,
                $occurredAt, $recordedAt,
            ]);

            $eventId = (int) $this->db->lastInsertId();
            $this->db->commit();
            return $eventId;
        } catch (\Throwable $e) {
            $this->db->rollBack();
            if (str_contains($e->getMessage(), 'UNIQUE')) {
                throw new RuntimeException(
                    "Concurrent moderation write for {$videoId}, retry",
                    409, $e
                );
            }
            throw $e;
        }
    }
}

Two things deserve explanation. The transaction wraps the MAX(sequence) + 1 read and the insert because SQLite's WAL mode gives you snapshot isolation but not gap-free sequence allocation. If two requests race to append events for the same video, one will hit the UNIQUE(video_id, sequence) constraint and must retry. We bubble that as HTTP 409 and the caller retries with exponential backoff — typically resolves in one retry because video-scoped concurrency is low.

We deliberately keep occurred_at and recorded_at separate. Backfilling events from older logs (we imported six months of legacy moderation_log rows on day one) sets occurred_at to the historical timestamp while recorded_at reflects the actual write. Any analysis that asks "what did we know on date X" uses recorded_at; anything modeling actual moderation timing uses occurred_at.

Projecting current state

Reading the current state of a video means folding its events:

<?php
declare(strict_types=1);

namespace App\Moderation;

use PDO;

final class VideoStateProjector
{
    public function __construct(private readonly PDO $db) {}

    public function currentState(string $videoId): VideoState
    {
        $rows = $this->db->prepare(
            'SELECT event_type, region_scope, reason_code, reason_text,
                    actor_type, actor_id, occurred_at
               FROM moderation_events
              WHERE video_id = :vid
              ORDER BY sequence ASC'
        );
        $rows->execute([':vid' => $videoId]);

        $state = new VideoState($videoId);

        foreach ($rows as $e) {
            match ($e['event_type']) {
                'video.published'         => $state->markPublished(),
                'video.hidden'            => $state->hideGlobally(
                    $e['reason_code'], $e['actor_id'], (int) $e['occurred_at']
                ),
                'video.restored'          => $state->restore(
                    $e['actor_id'], (int) $e['occurred_at']
                ),
                'region.blocked'          => $state->blockRegions(
                    explode(',', $e['region_scope']),
                    $e['reason_code'], $e['actor_id']
                ),
                'region.unblocked'        => $state->unblockRegions(
                    explode(',', $e['region_scope'])
                ),
                'flag.raised'             => $state->raiseFlag(
                    $e['reason_code'], $e['actor_id']
                ),
                'flag.resolved'           => $state->resolveFlag(
                    $e['reason_code'], $e['actor_id']
                ),
                'legal.takedown_received' => $state->attachTakedown(
                    $e['reason_text'], (int) $e['occurred_at']
                ),
                'legal.takedown_reversed' => $state->reverseTakedown(),
                default                   => null,
            };
        }
        return $state;
    }
}

The class VideoState is a plain value object — visible-in-region map, current flags, attached takedowns, the actor who last touched each region. Building it from events on every request would be wasteful for the hot trending feed, so we keep a denormalized projection table:

CREATE TABLE video_moderation_state (
    video_id        TEXT PRIMARY KEY,
    is_visible      INTEGER NOT NULL DEFAULT 1,
    blocked_regions TEXT    NOT NULL DEFAULT '',
    has_takedown    INTEGER NOT NULL DEFAULT 0,
    last_event_seq  INTEGER NOT NULL DEFAULT 0,
    updated_at      INTEGER NOT NULL
) STRICT;

After each append(), a single UPSERT updates the projection. Crash recovery is trivial — if the projection ever falls behind, replay all events for that video where sequence > last_event_seq. We run a nightly self-check that compares MAX(sequence) in moderation_events against last_event_seq in the projection for every video; mismatches replay automatically and page on-call only if replay fails twice. In six months we have had three mismatches, all caused by a SIGKILL during deploy, all auto-repaired.

The feed query reads only the projection, which makes it indistinguishable in latency from the old CRUD design. LiteSpeed's page cache sees the same response shape, so cache hit ratios did not move.

Searching reasons across CJK content

The most-used query in the system, by a wide margin, is "show me every event where a moderator mentioned a specific term." Korean and Japanese reasons are normal — "저작권 침해 신고" (copyright infringement report) or "日本国内のライセンス未取得" (no license inside Japan). FTS5's unicode61 tokenizer handles Latin scripts well but is famously weak for CJK because it falls back to per-codepoint tokens for languages without spaces. There are three workable approaches; we use a hybrid.

-- Pure FTS5 query — works for Korean (spaces) and Latin
SELECT e.video_id, e.occurred_at, e.actor_id, e.reason_text
  FROM moderation_events_fts f
  JOIN moderation_events e ON e.event_id = f.rowid
 WHERE moderation_events_fts MATCH :term
 ORDER BY rank
 LIMIT 50;

-- For Japanese/Chinese without spaces, fall back to LIKE with a
-- pre-filtered set so we never scan the whole table.
SELECT video_id, occurred_at, actor_id, reason_text
  FROM moderation_events
 WHERE actor_type IN ('human', 'rule')
   AND occurred_at >= :since
   AND reason_text LIKE :needle ESCAPE '\'
 ORDER BY occurred_at DESC
 LIMIT 50;

The Python sync job that imports moderation reasons from our queue normalizes everything to NFC and strips zero-width characters before insertion. That alone resolved most of our "search misses obvious matches" complaints — moderators were pasting reasons from Slack which used different Unicode normalization than our admin UI.

import sqlite3
import unicodedata
from typing import Iterable

ZW_CHARS = {"\u200b", "\u200c", "\u200d", "\ufeff"}

def normalize_reason(text: str | None) -> str | None:
    if not text:
        return text
    cleaned = "".join(c for c in text if c not in ZW_CHARS)
    return unicodedata.normalize("NFC", cleaned).strip()

def import_pending(db_path: str, batch: Iterable[dict]) -> int:
    conn = sqlite3.connect(db_path, isolation_level=None)
    conn.execute("PRAGMA journal_mode=WAL")
    written = 0
    for event in batch:
        reason = normalize_reason(event.get("reason_text"))
        conn.execute(
            """INSERT INTO moderation_events
               (video_id, sequence, event_type, actor_type, actor_id,
                region_scope, reason_code, reason_text, payload_json,
                occurred_at, recorded_at)
               VALUES (?, (SELECT COALESCE(MAX(sequence),0)+1
                             FROM moderation_events WHERE video_id=?),
                       ?, ?, ?, ?, ?, ?, ?, ?, strftime('%s','now'))""",
            (event["video_id"], event["video_id"],
             event["event_type"], event["actor_type"], event["actor_id"],
             event.get("region_scope", "*"),
             event["reason_code"], reason,
             event.get("payload"),
             event["occurred_at"]),
        )
        written += 1
    return written

The sub-select for sequence allocation works inside a single connection because SQLite serializes writers; the Python ingestion runs single-threaded by design. For PHP web traffic we use the explicit transaction shown earlier because requests are concurrent.

Replaying for compliance reports

Here is where event sourcing earns its keep. A Korean broadcaster asked us to prove that a specific clip was not visible to KR viewers between 14:00 and 18:00 on a specific date. With the CRUD log this is unanswerable. With events it is a fold over a time window — return the list of [start, end] Unix timestamps during which the video was visible in the requested region within the window. Empty list means "never visible during this period," which is exactly the answer the broadcaster needed, signed by us as a JSON document. Total query time for a typical video with thirty-odd events: under 2 ms on our origin box.

The code is straightforward: stream events for the video ordered by sequence, maintain two boolean flags (isGloballyVisible and regionBlocked), and emit a closed interval each time the derived visibility flips. Because every event is immutable, the answer is reproducible — running the query a month later against the same database yields the same intervals, which is the property a legal team actually wants when they sign off on a compliance response.

Operational notes after six months

A handful of things that were not obvious upfront and only emerged from running this in production.

Event volume is modest. We average around 9,000 events per day across roughly half a million tracked videos. The table is 12 MB at six months. SQLite handles this without any partitioning; we have not yet needed monthly rollover tables.
Cloudflare caching is unaffected. Moderation state lives in the projection table, which is read by exactly the same endpoints as before. Cache TTLs (max-age=21600 on watch pages, max-age=10800 on category pages) work identically. The event store is only read by admin and compliance routes which were already Cache-Control: private.
The hardest bug was clock skew. Two of our cron hosts had clocks drifting by 40+ seconds. Events with occurred_at ordering disagreed with sequence ordering, which broke time-window queries when events were close together. Fix: trust sequence for ordering, treat occurred_at purely as display metadata. NTP on all hosts is now part of the deploy checklist.
STRICT mode caught two type bugs in week one. A moderator tool was passing region_scope as a Python list serialized by str() instead of a CSV string, producing "['JP', 'KR']". STRICT did not catch that — both are TEXT — but a follow-up CHECK constraint regex did.
VACUUM the events table never. Append-only tables fragment less than you would think; PRAGMA optimize after the nightly cron is enough. A misguided VACUUM once locked the DB for 47 seconds and we got our first ever 503 wave from LiteSpeed.

Migration playbook for existing CRUD logs

If you have an existing moderation_log table, the cleanest path I have found is this. Run it in a maintenance window of about ten minutes for a million-row source table.

Create the new tables and projection with the schema above.
For each row in the legacy table, emit a synthetic video.published event at the video's creation time, then a single event matching its current state (video.hidden, region.blocked, etc.) at the legacy last_action_at. Set actor_id = 'legacy-import', actor_type = 'system'.
Run the projector over every video to populate video_moderation_state.
Switch the read path to query the projection. The schema is intentionally compatible with the old CRUD shape, so adapters are short.
Switch the write path to call EventStore::append() instead of UPDATE moderation_log. Keep the old table for a release cycle as a sanity check.
After two weeks of clean reconciliation, drop moderation_log.

The whole migration took us a single afternoon; the cleanup was a one-line DROP TABLE three weeks later.

Closing

Event sourcing has a reputation for being a heavyweight pattern that demands Kafka and CQRS and a team to maintain it. None of that is true. The implementation above is roughly 600 lines of PHP, two SQLite tables, one virtual FTS5 table, and a Python sync script — running on the same single-origin box that serves every page on the site. What it gives you is the ability to answer questions you did not anticipate when you designed the schema: who, when, why, in which region, under whose authority. That is the question every legal team eventually asks, and having a real answer instead of an apologetic email is what makes it worth the afternoon of plumbing.

DEV Community