Building a Video URL Canonicalization Pipeline for a Discovery Platform

#php #sqlite #backend #webdev

A single YouTube video can reach our crawler under a dozen different URLs. https://www.youtube.com/watch?v=dQw4w9WgXcQ, https://youtu.be/dQw4w9WgXcQ?t=43, https://m.youtube.com/watch?v=dQw4w9WgXcQ&feature=share&utm_source=newsletter, the /shorts/ variant, the embed iframe src, and the consent-redirect wrapper that Google bounces EU traffic through. They all point at the same 3 minutes 33 seconds of video. If you treat those as distinct rows, you end up with six near-duplicate cards on a discovery page, six entries fighting each other in search ranking, and a deduplication job that gets slower every week.

I run DailyWatch, a free video discovery platform that ingests links from a lot of sources, and URL canonicalization turned out to be one of those unglamorous pieces of infrastructure that quietly decides whether the whole product feels coherent or broken. This post is the pipeline we settled on: how we normalize, extract a stable identity, deduplicate, and store it in a way that survives the messiness of real-world URLs. The stack is PHP 8.4 with SQLite (FTS5 for the search side), fronted by LiteSpeed and Cloudflare, but the ideas port cleanly to anything.

The problem is identity, not strings

The mistake I made early was treating this as a string-cleaning problem: strip tracking params, lowercase the host, done. That works until you realize the thing you actually care about is the video's identity, and the URL is just one lossy encoding of it. Two URLs are the same video if they resolve to the same (platform, video_id) pair, regardless of how the path, query string, or host is shaped.

So the pipeline has two distinct jobs that people tend to conflate:

Normalization — produce a canonical URL string that is safe to store, display, and link to. This is reversible-ish and human-facing.
Identity extraction — produce a canonical key like youtube:dQw4w9WgXcQ that is the deduplication primary key. This is not a URL; it never gets shown to a user.

Keeping these separate is the single most important design decision. The canonical URL can change (YouTube could rename a host) without breaking your dedup history, because dedup keys off the identity, not the string.

Step one: normalize the URL string

Normalization is a fixed sequence of transformations applied in order. Order matters — you cannot strip query params before you've parsed the URL, and you cannot decide which params to keep before you know the host. Here's the core normalizer we run on ingest:

<?php
declare(strict_types=1);

final class UrlNormalizer
{
    // Params that are pure tracking noise across every platform.
    private const GLOBAL_DROP = [
        'utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content',
        'feature', 'fbclid', 'gclid', 'spm', 'share', 'si', 'app',
    ];

    // Params that carry meaning and must survive, keyed by host family.
    private const KEEP = [
        'youtube' => ['v', 't', 'list'],
        'vimeo'   => [],
        'dailymotion' => [],
    ];

    public function normalize(string $raw): ?string
    {
        $raw = trim($raw);
        if ($raw === '') {
            return null;
        }

        // Force a scheme so parse_url behaves predictably on bare hosts.
        if (!preg_match('#^https?://#i', $raw)) {
            $raw = 'https://' . $raw;
        }

        $parts = parse_url($raw);
        if ($parts === false || empty($parts['host'])) {
            return null;
        }

        $host = $this->normalizeHost($parts['host']);
        $family = $this->hostFamily($host);

        // Always https, never a trailing port for the default.
        $scheme = 'https';
        $path = $this->normalizePath($parts['path'] ?? '/');
        $query = $this->filterQuery($parts['query'] ?? '', $family);

        $url = $scheme . '://' . $host . $path;
        if ($query !== '') {
            $url .= '?' . $query;
        }
        return $url;
    }

    private function normalizeHost(string $host): string
    {
        $host = strtolower(rtrim($host, '.'));
        // Collapse mobile / regional subdomains onto the canonical host.
        $host = preg_replace('#^(m|mobile|www)\.#', '', $host) ?? $host;
        if (str_ends_with($host, 'youtu.be')) {
            return 'youtu.be';
        }
        if (str_contains($host, 'youtube')) {
            return 'youtube.com';
        }
        return $host;
    }

    private function hostFamily(string $host): string
    {
        return match (true) {
            str_contains($host, 'youtu')      => 'youtube',
            str_contains($host, 'vimeo')      => 'vimeo',
            str_contains($host, 'dailymotion') => 'dailymotion',
            default => 'unknown',
        };
    }

    private function normalizePath(string $path): string
    {
        // Decode once, then re-encode safely; collapse duplicate slashes.
        $path = rawurldecode($path);
        $path = preg_replace('#/{2,}#', '/', $path) ?? $path;
        return $path === '' ? '/' : $path;
    }

    private function filterQuery(string $query, string $family): string
    {
        if ($query === '') {
            return '';
        }
        parse_str($query, $params);

        $keep = self::KEEP[$family] ?? [];
        $out = [];
        foreach ($params as $k => $v) {
            $k = (string) $k;
            if (in_array($k, self::GLOBAL_DROP, true)) {
                continue;
            }
            // For known families, allowlist; for unknown, just drop tracking.
            if ($keep !== [] && !in_array($k, $keep, true)) {
                continue;
            }
            $out[$k] = $v;
        }

        // Deterministic ordering so identical inputs always hash identically.
        ksort($out);
        return http_build_query($out);
    }
}

A few decisions in there are deliberate and worth calling out:

Allowlist for known platforms, blocklist for unknown ones. For YouTube we know exactly which params matter (v, t, list), so we drop everything else. For a host we've never seen, we don't dare strip params we don't understand — we only remove the universally-useless tracking ones.
ksort before rebuilding the query. Two URLs that differ only in param order must normalize to the same string. Without this, ?a=1&b=2 and ?b=2&a=1 produce different hashes and your dedup leaks.
Decode-then-encode the path exactly once. Double-decoding is a security and correctness hazard — %252F should not become /.

Step two: extract a stable identity

The normalized URL is nice for display, but dedup runs on the identity key. The extractor is a set of small, ordered matchers — first one to claim the URL wins. I keep them ordered most-specific to least-specific so /shorts/ID doesn't get misread by a greedy watch?v= rule.

<?php
declare(strict_types=1);

final class VideoIdentity
{
    public function __construct(
        public readonly string $platform,
        public readonly string $videoId,
    ) {}

    public function key(): string
    {
        return $this->platform . ':' . $this->videoId;
    }
}

final class IdentityExtractor
{
    public function extract(string $normalizedUrl): ?VideoIdentity
    {
        $p = parse_url($normalizedUrl);
        $host = $p['host'] ?? '';
        $path = $p['path'] ?? '';
        parse_str($p['query'] ?? '', $q);

        // youtu.be/<id>
        if ($host === 'youtu.be' && preg_match('#^/([\w-]{11})$#', $path, $m)) {
            return new VideoIdentity('youtube', $m[1]);
        }
        // youtube.com/watch?v=<id>
        if ($host === 'youtube.com' && isset($q['v']) && $this->isYtId($q['v'])) {
            return new VideoIdentity('youtube', $q['v']);
        }
        // youtube.com/shorts/<id>  and  /embed/<id>  and  /live/<id>
        if ($host === 'youtube.com'
            && preg_match('#^/(?:shorts|embed|live|v)/([\w-]{11})#', $path, $m)) {
            return new VideoIdentity('youtube', $m[1]);
        }
        // vimeo.com/<numeric id>
        if (str_contains($host, 'vimeo') && preg_match('#/(\d{6,})#', $path, $m)) {
            return new VideoIdentity('vimeo', $m[1]);
        }
        // dailymotion.com/video/<id>
        if (str_contains($host, 'dailymotion')
            && preg_match('#/video/([a-z0-9]+)#i', $path, $m)) {
            return new VideoIdentity('dailymotion', strtolower($m[1]));
        }
        return null;
    }

    private function isYtId(string $v): bool
    {
        return (bool) preg_match('#^[\w-]{11}$#', $v);
    }
}

The YouTube ID validation ([\w-]{11}) catches a surprising amount of garbage — truncated IDs from broken share buttons, IDs with a trailing punctuation mark glued on by a chat client, query strings where v is empty. If the ID doesn't match the exact shape, we reject it rather than store a poisoned key. A bad identity is worse than no identity, because it silently merges two unrelated videos.

Step three: build the canonical URL from the identity

Here's the part people skip and regret. Don't store whatever messy URL came in, even after normalization. Once you have a clean identity, regenerate the canonical URL from it. This guarantees every YouTube video on the site uses the identical link format, which matters for SEO (one canonical per video) and for the rel=canonical tags we emit.

<?php
declare(strict_types=1);

final class CanonicalUrlBuilder
{
    public function build(VideoIdentity $id): string
    {
        return match ($id->platform) {
            'youtube'     => "https://www.youtube.com/watch?v={$id->videoId}",
            'vimeo'       => "https://vimeo.com/{$id->videoId}",
            'dailymotion' => "https://www.dailymotion.com/video/{$id->videoId}",
            default       => throw new \InvalidArgumentException("unknown platform {$id->platform}"),
        };
    }
}

Notice the canonical URL puts www. back on, even though the normalizer stripped it. The normalizer's host stripping is for comparison; the builder's job is to emit the form that the platform itself treats as canonical (and that won't trigger a redirect). Comparison form and display form are allowed to differ — that's the whole point of separating identity from string.

Step four: deduplicate at the database boundary

The identity key is a natural unique constraint. We let SQLite enforce it rather than checking-then-inserting in PHP, which would race under our cron-driven concurrent ingest. The schema:

CREATE TABLE IF NOT EXISTS videos (
    id            INTEGER PRIMARY KEY,
    identity_key  TEXT NOT NULL UNIQUE,   -- 'youtube:dQw4w9WgXcQ'
    platform      TEXT NOT NULL,
    video_id      TEXT NOT NULL,
    canonical_url TEXT NOT NULL,
    title         TEXT NOT NULL DEFAULT '',
    first_seen    INTEGER NOT NULL,
    last_seen     INTEGER NOT NULL
);

-- FTS5 mirror for search; rebuilt from videos via triggers.
CREATE VIRTUAL TABLE IF NOT EXISTS videos_fts USING fts5(
    title,
    content='videos',
    content_rowid='id'
);

Ingest becomes an idempotent upsert. If the identity already exists, we just bump last_seen (useful for trending) and never create a duplicate row:

<?php
declare(strict_types=1);

final class VideoIngestor
{
    public function __construct(
        private readonly \PDO $db,
        private readonly UrlNormalizer $normalizer,
        private readonly IdentityExtractor $extractor,
        private readonly CanonicalUrlBuilder $builder,
    ) {}

    public function ingest(string $rawUrl, int $now): ?int
    {
        $normalized = $this->normalizer->normalize($rawUrl);
        if ($normalized === null) {
            return null;
        }
        $identity = $this->extractor->extract($normalized);
        if ($identity === null) {
            return null;
        }
        $canonical = $this->builder->build($identity);

        $stmt = $this->db->prepare(<<<SQL
            INSERT INTO videos
                (identity_key, platform, video_id, canonical_url, first_seen, last_seen)
            VALUES
                (:key, :platform, :vid, :url, :now, :now)
            ON CONFLICT(identity_key) DO UPDATE SET
                last_seen = excluded.last_seen
            RETURNING id
        SQL);

        $stmt->execute([
            ':key'      => $identity->key(),
            ':platform' => $identity->platform,
            ':vid'      => $identity->videoId,
            ':url'      => $canonical,
            ':now'      => $now,
        ]);
        return (int) $stmt->fetchColumn();
    }
}

ON CONFLICT ... DO UPDATE with RETURNING id gives us insert-or-update in a single round trip, and the UNIQUE constraint makes the whole thing safe under concurrency without any application-level locking. SQLite handles this fine for our write volume; the moment two cron workers both try to ingest the same trending video, one wins the insert and the other falls through to the update.

The redirect-resolution problem

Normalization handles syntactic variants, but some URLs are semantically a different string for the same video. Short-link wrappers, consent redirects, and link shorteners (bit.ly, t.co) hide the real URL behind an HTTP redirect. You cannot canonicalize what you cannot see.

We resolve these once, at ingest, with a tightly bounded HEAD request — and we cache the resolution so we never hit the network for a URL shape we've seen before. Here's the resolver in Python, since our crawler frontend is Python while the API is PHP:

import httpx

SHORTENERS = {"bit.ly", "t.co", "tinyurl.com", "ow.ly", "buff.ly"}

def resolve(url: str, max_hops: int = 5, timeout: float = 4.0) -> str:
    """Follow redirects only for known shorteners; bounded and safe."""
    seen = set()
    current = url
    for _ in range(max_hops):
        host = httpx.URL(current).host or ""
        if host not in SHORTENERS:
            return current  # already a real URL, stop early
        if current in seen:
            return current  # redirect loop guard
        seen.add(current)
        try:
            r = httpx.head(current, follow_redirects=False, timeout=timeout)
        except httpx.HTTPError:
            return current  # network failure: fall back to what we have
        loc = r.headers.get("location")
        if not loc or r.status_code not in (301, 302, 303, 307, 308):
            return current
        current = str(httpx.URL(current).join(loc))
    return current

The guardrails matter more than the happy path. We only follow redirects for hosts we've explicitly decided are shorteners — we do not follow redirects on arbitrary URLs, because that turns your ingest pipeline into a server-side request forgery vector. The hop limit, the loop guard, the timeout, and the fall-back-on-failure behavior are all there because a crawler that blocks on a slow redirect is a crawler that falls behind. Resolution failures degrade gracefully: we store the unresolved URL and try again on the next pass.

What this buys you operationally

Once identity extraction is the dedup key, a lot of downstream problems simply stop existing:

Search dedup is free. Because the FTS5 table mirrors the videos table one-to-one, and videos is already deduplicated, search results never show the same video twice. We don't need a post-query dedup pass.
Cache keys are stable. LiteSpeed and Cloudflare cache our video pages keyed off the canonical URL. Since every variant collapses to one canonical, the cache hit rate went up immediately — we stopped caching six copies of the same page.
Trending math is correct. last_seen bumps on every re-ingest of the same identity, so a video shared across ten sources accumulates signal on one row instead of splitting it across ten.
rel=canonical is trivial. Every page already knows its canonical URL because we built it deterministically from the identity.

Mistakes I'd warn you about

A few things cost me time:

Don't canonicalize the timestamp away if your product uses it. We strip ?t=43 from the identity (it's the same video) but some platforms genuinely treat a clipped start time as a distinct shareable unit. Decide consciously per platform; we keep t in the normalized display URL but not in the identity key.
Validate ID shape ruthlessly. A single malformed identity that merges two videos is a data-corruption bug that's miserable to unwind after it's been live for a month.
Make the pipeline pure and testable. Every component above is a plain function from input to output with no I/O except the ingestor and resolver. That meant I could write a few hundred table-driven tests of real-world URLs — including the weird ones from production logs — and run them in milliseconds.
Backfill is a migration, not a script you run once. When you change a normalization rule, you have to re-derive identities for existing rows, which can create new conflicts. Plan for the merge.

URL canonicalization looks like a 20-line utility function until you actually run it against the open web. Splitting it into normalize → extract identity → build canonical → dedup-at-the-DB turned a constant source of duplicate-content bugs into a part of the system I no longer think about. The total code is small; the discipline of keeping identity separate from string is what makes it hold up.