Building a Video URL Canonicalization Pipeline for Deduplicated Ingest

#php #sqlite #go #webdev

The duplicate that wasn't a duplicate

Last quarter our ingest counters at ViralVidVault told a confusing story: the videos table grew by roughly 38% week-over-week, but unique watch sessions barely moved. When I pulled a sample, the cause was embarrassingly mundane. The same YouTube clip had landed in our database eleven times. Once as https://www.youtube.com/watch?v=dQw4w9WgXcQ, once as https://youtu.be/dQw4w9WgXcQ, once with ?feature=share, once with a UTM-stuffed referral query string from a partner feed, once as youtube.com/shorts/dQw4w9WgXcQ, once through m.youtube.com, and several more with trailing slashes, fragment timestamps, and mixed-case hosts.

Every one of those rows pointed at the same underlying asset. To a trend-tracking product that ranks videos by velocity, eleven phantom rows is not a cosmetic bug. It splits engagement signals across duplicates, poisons the "trending" ranking, wastes thumbnail-fetch quota, and inflates storage. For a GDPR-first European product where we keep ingest lean on purpose, it also meant we were storing more URLs than we needed to justify.

The fix is a canonicalization pipeline: a deterministic function that maps any incoming video URL to one stable identity, applied at the edge and again at write time, backed by a database constraint that makes duplicates structurally impossible. This post walks through how we built ours on PHP 8.4, SQLite in WAL mode, LiteSpeed, and a Cloudflare Worker at the edge.

What "canonical" actually means for a video URL

Canonicalization is not the same as URL normalization. Normalization (lowercasing the host, resolving . and .. path segments, percent-encoding consistently) is the generic, RFC 3986-flavored part. Canonicalization is the domain-specific part: knowing that for YouTube the only thing that identifies the asset is the 11-character video ID, and that everything else in the URL is noise we are free to discard.

So our pipeline has two layers:

Structural normalization that applies to every URL regardless of platform: lowercase scheme and host, strip the default port, drop the fragment, sort and filter query parameters, remove tracking junk.
Platform extraction that recognizes a known provider (YouTube, Vimeo, Dailymotion, TikTok, and a few European players like Rutube-style hosts) and reduces the URL to a provider:id tuple plus a single rebuilt canonical URL.

The output of the pipeline is two things we store on every row: a short opaque canonical_key (used as a unique index) and a clean canonical_url (used for display, embeds, and outbound links). The key is what prevents duplicates; the URL is what humans and crawlers see.

A few rules we settled on after getting burned:

The canonical key must be stable across reorderings. ?a=1&b=2 and ?b=2&a=1 are the same resource.
It must be case-correct, not case-blind. Hosts are case-insensitive, but YouTube video IDs are case-sensitive. dQw4w9WgXcQ is not dqw4w9wgxcq. Lowercasing the whole URL is a classic way to silently corrupt your identity layer.
It must fail safe. An unrecognized provider should still produce a deterministic key from the normalized URL, never throw, never collapse two genuinely different videos into one.

The structural normalizer in PHP 8.4

Here is the base normalizer. It does the provider-agnostic work and returns a structured value object. I am leaning on PHP 8.4 features: readonly properties, constructor promotion, and a typed enum for the provider.

<?php
declare(strict_types=1);

enum VideoProvider: string
{
    case YouTube     = 'youtube';
    case Vimeo       = 'vimeo';
    case Dailymotion = 'dailymotion';
    case TikTok      = 'tiktok';
    case Unknown     = 'unknown';
}

final readonly class CanonicalVideo
{
    public function __construct(
        public VideoProvider $provider,
        public string $videoId,
        public string $canonicalUrl,
        public string $canonicalKey,
    ) {}
}

final class UrlNormalizer
{
    /** Query params we always strip before hashing. */
    private const TRACKING_PREFIXES = ['utm_', 'fbclid', 'gclid', 'mc_', '_hs'];
    private const TRACKING_EXACT = [
        'feature', 'app', 'si', 'pp', 'ref', 'ref_src',
        'source', 'spm', 'igshid', 'context',
    ];

    /** Normalize host + path + query. Does NOT touch provider semantics. */
    public function normalize(string $raw): array
    {
        $raw = trim($raw);
        $parts = parse_url($raw);
        if ($parts === false || !isset($parts['host'])) {
            throw new InvalidArgumentException("unparseable url: {$raw}");
        }

        $scheme = strtolower($parts['scheme'] ?? 'https');
        $host   = strtolower($parts['host']);
        $host   = preg_replace('/^(www|m|mobile)\./', '', $host);

        // Drop default ports.
        $port = $parts['port'] ?? null;
        if (($scheme === 'https' && $port === 443) || ($scheme === 'http' && $port === 80)) {
            $port = null;
        }

        $path = $parts['path'] ?? '/';
        $path = rtrim($path, '/') ?: '/';

        $query = $this->cleanQuery($parts['query'] ?? '');

        return [
            'scheme' => $scheme,
            'host'   => $host,
            'port'   => $port,
            'path'   => $path,
            'query'  => $query, // associative, already filtered + sorted
        ];
    }

    private function cleanQuery(string $query): array
    {
        if ($query === '') {
            return [];
        }
        parse_str($query, $params);
        $kept = [];
        foreach ($params as $key => $value) {
            $lower = strtolower((string) $key);
            if (in_array($lower, self::TRACKING_EXACT, true)) {
                continue;
            }
            foreach (self::TRACKING_PREFIXES as $prefix) {
                if (str_starts_with($lower, $prefix)) {
                    continue 2;
                }
            }
            $kept[$key] = $value;
        }
        ksort($kept); // stable ordering -> stable hash
        return $kept;
    }
}

The important details are the small ones. We strip www, m, and mobile host prefixes because they are routing artifacts, not identity. We drop default ports so :443 and the absence of a port hash identically. We rtrim trailing slashes so /watch and /watch/ agree. And we ksort the surviving query parameters so the hash is independent of parameter order. Tracking parameters get dropped wholesale; that single step killed about 60% of our duplicates because partner feeds love appending utm_*.

Platform extraction: reducing to provider plus ID

The normalizer gives us a clean URL skeleton. The extractor turns it into identity. This is the part most people get wrong by trying to write one mega-regex. Instead, dispatch on host, then apply the smallest rule that matches.

<?php
declare(strict_types=1);

final class VideoCanonicalizer
{
    public function __construct(private UrlNormalizer $normalizer) {}

    public function canonicalize(string $raw): CanonicalVideo
    {
        $n = $this->normalizer->normalize($raw);
        [$provider, $id] = $this->extract($n['host'], $n['path'], $n['query']);

        if ($provider !== VideoProvider::Unknown && $id !== '') {
            $url = $this->rebuild($provider, $id);
            $key = $provider->value . ':' . $id;
            return new CanonicalVideo($provider, $id, $url, $key);
        }

        // Fail-safe path: deterministic key from the normalized skeleton.
        $url = $this->rebuildGeneric($n);
        $key = 'url:' . substr(hash('sha256', $url), 0, 24);
        return new CanonicalVideo(VideoProvider::Unknown, '', $url, $key);
    }

    /** @return array{0: VideoProvider, 1: string} */
    private function extract(string $host, string $path, array $query): array
    {
        return match (true) {
            str_ends_with($host, 'youtube.com') => $this->youtube($path, $query),
            $host === 'youtu.be'                => [VideoProvider::YouTube, ltrim($path, '/')],
            str_ends_with($host, 'vimeo.com')   => [VideoProvider::Vimeo, $this->lastSegment($path)],
            str_ends_with($host, 'dailymotion.com') => [VideoProvider::Dailymotion, $this->dmId($path)],
            $host === 'dai.ly'                  => [VideoProvider::Dailymotion, ltrim($path, '/')],
            str_ends_with($host, 'tiktok.com')  => [VideoProvider::TikTok, $this->tiktokId($path)],
            default                             => [VideoProvider::Unknown, ''],
        };
    }

    private function youtube(string $path, array $query): array
    {
        // /watch?v=ID, /shorts/ID, /embed/ID, /live/ID
        if ($path === '/watch' && isset($query['v'])) {
            return [VideoProvider::YouTube, (string) $query['v']];
        }
        if (preg_match('#^/(shorts|embed|live|v)/([A-Za-z0-9_-]{11})#', $path, $m)) {
            return [VideoProvider::YouTube, $m[2]];
        }
        return [VideoProvider::Unknown, ''];
    }

    private function dmId(string $path): string
    {
        // /video/x8abcde or /video/x8abcde_title-slug
        if (preg_match('#/video/([a-z0-9]+)#i', $path, $m)) {
            return $m[1];
        }
        return '';
    }

    private function tiktokId(string $path): string
    {
        return preg_match('#/video/(\d+)#', $path, $m) ? $m[1] : '';
    }

    private function lastSegment(string $path): string
    {
        $seg = substr(strrchr($path, '/') ?: '', 1);
        return ctype_digit($seg) ? $seg : '';
    }

    private function rebuild(VideoProvider $p, string $id): string
    {
        return match ($p) {
            VideoProvider::YouTube     => "https://www.youtube.com/watch?v={$id}",
            VideoProvider::Vimeo       => "https://vimeo.com/{$id}",
            VideoProvider::Dailymotion => "https://www.dailymotion.com/video/{$id}",
            VideoProvider::TikTok      => "https://www.tiktok.com/video/{$id}",
            VideoProvider::Unknown     => '',
        };
    }

    private function rebuildGeneric(array $n): string
    {
        $url = $n['scheme'] . '://' . $n['host'];
        if ($n['port'] !== null) {
            $url .= ':' . $n['port'];
        }
        $url .= $n['path'];
        if ($n['query'] !== []) {
            $url .= '?' . http_build_query($n['query']);
        }
        return $url;
    }
}

Notice we rebuild a fresh canonical URL from the ID rather than reusing the input. This matters: an attacker (or a sloppy partner) cannot smuggle an open-redirect or a malicious query string into our stored URL, because we throw the input away once we have the ID. The stored URL is always one we constructed from a hardcoded template. For the unknown-provider case we hash the normalized skeleton into a 24-character key, which is collision-resistant enough for an ingest table and short enough to index cheaply.

Making duplicates impossible at the storage layer

Application-level dedup is necessary but not sufficient. Two ingest workers can canonicalize the same URL concurrently, both check "does this key exist," both see no, and both insert. The only reliable guard is a database constraint. With SQLite in WAL mode this is cheap and the concurrency story is good: WAL lets readers continue while a writer commits, which suits our bursty fetch-then-write ingest pattern.

CREATE TABLE IF NOT EXISTS videos (
    id            INTEGER PRIMARY KEY,
    canonical_key TEXT NOT NULL,
    canonical_url TEXT NOT NULL,
    provider      TEXT NOT NULL,
    video_id      TEXT NOT NULL DEFAULT '',
    title         TEXT,
    first_seen    INTEGER NOT NULL,
    last_seen     INTEGER NOT NULL,
    view_velocity REAL NOT NULL DEFAULT 0
);

-- The structural guard: one row per canonical identity.
CREATE UNIQUE INDEX IF NOT EXISTS idx_videos_canonical
    ON videos(canonical_key);

The write path then becomes an idempotent upsert. We never "check then insert"; we let the unique index arbitrate and use ON CONFLICT to merge signal into the existing row instead of failing:

INSERT INTO videos (canonical_key, canonical_url, provider, video_id, title, first_seen, last_seen)
VALUES (:key, :url, :provider, :vid, :title, :now, :now)
ON CONFLICT(canonical_key) DO UPDATE SET
    last_seen     = excluded.last_seen,
    view_velocity = view_velocity + 1,
    title         = COALESCE(videos.title, excluded.title);

This is the whole game. Even if ten workers race on the same trending clip, the database collapses them into one row and accumulates velocity correctly. The first writer sets first_seen; every subsequent sighting bumps last_seen and velocity. We get accurate trend tracking for free, and the storage footprint stays honest.

Pushing canonicalization to the edge

Most of our duplicate URLs arrive from share links and partner referrals that hit the site directly. Cleaning them at write time works, but we also normalize at the edge in a Cloudflare Worker so that cache keys collapse too. If youtu.be/X and youtube.com/watch?v=X produce different cache keys, our LiteSpeed origin gets hit twice for the same logical page. A tiny edge normalizer fixes the cache-hit ratio before a request ever reaches PHP.

The edge logic is intentionally a thin subset of the full pipeline; it only needs to agree on identity, not rebuild display URLs. We keep the authoritative version in Go because we also run a batch backfill tool that re-canonicalizes historical rows, and sharing a single tested implementation between the backfiller and an edge sidecar avoids drift.

package canon

import (
    "crypto/sha256"
    "encoding/hex"
    "net/url"
    "regexp"
    "strings"
)

var ytShort = regexp.MustCompile(`^/(shorts|embed|live|v)/([A-Za-z0-9_-]{11})`)

// Key returns a stable canonical key for a video URL, or a hashed
// fallback for unknown providers. It never panics on bad input.
func Key(raw string) string {
    u, err := url.Parse(strings.TrimSpace(raw))
    if err != nil || u.Host == "" {
        return hashKey(strings.ToLower(raw))
    }
    host := strings.ToLower(u.Host)
    host = strings.TrimPrefix(host, "www.")
    host = strings.TrimPrefix(host, "m.")

    switch {
    case strings.HasSuffix(host, "youtube.com"):
        if u.Path == "/watch" {
            if v := u.Query().Get("v"); v != "" {
                return "youtube:" + v
            }
        }
        if m := ytShort.FindStringSubmatch(u.Path); m != nil {
            return "youtube:" + m[2]
        }
    case host == "youtu.be":
        if id := strings.Trim(u.Path, "/"); id != "" {
            return "youtube:" + id
        }
    }
    // Fallback: scheme-stripped, host-normalized, path only.
    return hashKey(host + strings.TrimRight(u.Path, "/"))
}

func hashKey(s string) string {
    sum := sha256.Sum256([]byte(s))
    return "url:" + hex.EncodeToString(sum[:])[:24]
}

The Go and PHP implementations share a contract enforced by a shared fixtures file: a JSON list of input -> expected_key pairs that both test suites load. If the edge and the origin ever disagree on a key, the build fails. That fixture file has saved us twice when someone "improved" a regex on one side only.

Verifying canonicalization with a property test

The scary failure mode is not a missed duplicate; it is a false merge, where two genuinely different videos collapse into one key. Unit tests with hand-picked URLs catch the obvious cases, but I sleep better with a property-based check. The invariant: distinct video IDs must always produce distinct keys, and the same ID under any URL shape must always produce the same key. Here is the batch validator we run nightly against a sample of the live table, in Python against the SQLite file directly.

import sqlite3
import itertools

def load_pairs(db_path: str):
    con = sqlite3.connect(db_path)
    con.row_factory = sqlite3.Row
    rows = con.execute(
        "SELECT canonical_key, provider, video_id, canonical_url FROM videos"
    ).fetchall()
    con.close()
    return rows

def check_invariants(rows) -> list[str]:
    errors = []
    # Invariant 1: one key never maps to two different (provider, video_id).
    by_key: dict[str, set] = {}
    for r in rows:
        ident = (r["provider"], r["video_id"])
        by_key.setdefault(r["canonical_key"], set()).add(ident)
    for key, idents in by_key.items():
        # ignore unknown-provider rows, whose id is empty by design
        real = {i for i in idents if i[1] != ""}
        if len(real) > 1:
            errors.append(f"FALSE MERGE on {key}: {sorted(real)}")

    # Invariant 2: a known (provider, video_id) maps to exactly one key.
    by_ident: dict[tuple, set] = {}
    for r in rows:
        if r["video_id"] == "":
            continue
        ident = (r["provider"], r["video_id"])
        by_ident.setdefault(ident, set()).add(r["canonical_key"])
    for ident, keys in by_ident.items():
        if len(keys) > 1:
            errors.append(f"SPLIT IDENTITY for {ident}: {sorted(keys)}")
    return errors

if __name__ == "__main__":
    problems = check_invariants(load_pairs("data/videos.db"))
    if problems:
        for p in problems:
            print(p)
        raise SystemExit(1)
    print("canonicalization invariants hold")

Invariant 1 catches false merges; invariant 2 catches the opposite, an identity that somehow ended up under two keys (which would happen if you forgot to backfill after changing a rule). Running both nightly turns a silent data-corruption bug into a loud CI failure.

A few hard-won lessons

Never lowercase the whole URL. Host is case-insensitive; the video ID and path often are not. Lowercase the host only.
Rebuild, do not reuse. Construct the stored canonical URL from a template plus the extracted ID. It sanitizes input and gives you one display format forever.
Strip tracking before hashing. A single utm_* filter eliminated the majority of our duplicates. Maintain the strip-list as data, not code.
Let the database arbitrate races. A unique index plus ON CONFLICT DO UPDATE is more correct than any application-level check-then-insert, and under SQLite WAL it is fast.
Share one implementation across tiers. Edge (Go/Worker) and origin (PHP) must agree byte-for-byte on keys. Enforce it with a shared fixtures file in both test suites.
Test for false merges explicitly. The dangerous bug is collapsing two different videos, not missing one duplicate. A property check is worth more than fifty hand-written cases.

The pipeline took about two days to build and a week to roll out behind a backfill. Our videos table shrank by roughly a third overnight, the trending ranking stabilized because engagement stopped fragmenting across duplicates, and our Cloudflare cache-hit ratio climbed several points because share-link variants now collapse to one cache key. For a lean, GDPR-conscious product, that is the kind of change that pays for itself in storage, quota, and signal quality all at once. Canonicalization is unglamorous plumbing, but getting your identity layer right is the difference between a trend tracker that measures reality and one that measures its own duplicates.