Implementing Video Search Autocomplete with PostgreSQL Trigrams

#postgres #search #autocomplete #webdev

Why Fuzzy Autocomplete Matters for European Content

At ViralVidVault, we index video titles from 7 European regions. Users searching for a Swedish video might type "goteborg" instead of "Goteborg" or misspell a Dutch title. Standard LIKE queries fail hard on typos. PostgreSQL trigrams handle this gracefully.

Setting Up pg_trgm

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE TABLE videos (
    id SERIAL PRIMARY KEY,
    video_id VARCHAR(16) UNIQUE NOT NULL,
    title TEXT NOT NULL,
    channel_title TEXT,
    region VARCHAR(4) NOT NULL DEFAULT 'GB',
    view_count BIGINT DEFAULT 0,
    virality_score FLOAT DEFAULT 0,
    fetched_at TIMESTAMP DEFAULT NOW()
);

-- GIN trigram indexes for fast fuzzy matching
CREATE INDEX idx_videos_title_trgm ON videos USING GIN (title gin_trgm_ops);
CREATE INDEX idx_videos_channel_trgm ON videos USING GIN (channel_title gin_trgm_ops);

Region-Weighted Autocomplete

The unique challenge for a European video vault: a search should prefer results from the user's browsed region but still include cross-region matches. Here's the function:

CREATE OR REPLACE FUNCTION autocomplete_videos(
    query TEXT,
    user_region VARCHAR(4) DEFAULT 'GB',
    max_results INT DEFAULT 10
)
RETURNS TABLE (
    video_id VARCHAR,
    title TEXT,
    channel_title TEXT,
    region VARCHAR,
    view_count BIGINT,
    score FLOAT
) AS $$
BEGIN
    RETURN QUERY
    SELECT v.video_id, v.title, v.channel_title, v.region, v.view_count,
        (
            similarity(v.title, query) * 0.6
            + similarity(COALESCE(v.channel_title, ''), query) * 0.2
            -- Boost same-region results
            + CASE WHEN v.region = user_region THEN 0.15 ELSE 0 END
            -- Boost viral content
            + LEAST(v.virality_score / 100.0 * 0.05, 0.05)
        ) AS score
    FROM videos v
    WHERE v.title % query
       OR v.channel_title % query
       OR lower(v.title) LIKE lower(query) || '%'
    ORDER BY score DESC, v.view_count DESC
    LIMIT max_results;
END;
$$ LANGUAGE plpgsql STABLE;

Usage examples:

-- Dutch user searching with a typo
SELECT * FROM autocomplete_videos('muziek videoclp', 'NL', 8);
-- Matches "Muziek Videoclip" despite the missing 'i'

-- Swedish user, prefix match
SELECT * FROM autocomplete_videos('svensk', 'SE', 8);
-- Boosts Swedish results in the ranking

Python API Endpoint

Wire it to a FastAPI endpoint:

from fastapi import FastAPI, Query, Request
import asyncpg

app = FastAPI()
pool: asyncpg.Pool = None

@app.on_event("startup")
async def startup():
    global pool
    pool = await asyncpg.create_pool("postgresql://user:pass@localhost/viralvidvault")

@app.get("/api/autocomplete")
async def autocomplete(
    q: str = Query(min_length=2, max_length=100),
    region: str = Query(default="GB", max_length=4),
):
    rows = await pool.fetch(
        "SELECT * FROM autocomplete_videos($1, $2, $3)",
        q, region.upper(), 8,
    )
    return {
        "query": q,
        "region": region,
        "suggestions": [
            {
                "video_id": r["video_id"],
                "title": r["title"],
                "channel": r["channel_title"],
                "region": r["region"],
                "views": r["view_count"],
                "score": round(r["score"], 3),
            }
            for r in rows
        ],
    }

Frontend with Debouncing

const searchInput = document.getElementById('search-input');
let timer;

searchInput.addEventListener('input', (e) => {
  clearTimeout(timer);
  timer = setTimeout(async () => {
    const q = e.target.value.trim();
    if (q.length < 2) return;

    // Detect user's region from page context
    const region = document.documentElement.dataset.region || 'GB';
    const res = await fetch(`/api/autocomplete?q=${encodeURIComponent(q)}&region=${region}`);
    const data = await res.json();

    const dropdown = document.getElementById('autocomplete-list');
    dropdown.innerHTML = data.suggestions.map(s => `
      <a href="/watch/${s.video_id}" class="suggestion">
        <span class="title">${s.title}</span>
        <span class="meta">${s.channel} · ${s.region}</span>
      </a>
    `).join('');
  }, 200);
});

Performance on European Content

On ViralVidVault's dataset with titles in English, Polish, Dutch, Swedish, and Norwegian:

Approach	Avg Latency	Fuzzy Matching
`ILIKE '%query%'`	38ms	No
`pg_trgm` GIN	2.5ms	Yes
`pg_trgm` + region boost	3.1ms	Yes + regional

Trigram similarity works across Latin-alphabet languages without any language-specific configuration. Polish characters like "ś", "ż", and "ó" are handled naturally because trigrams operate on raw character sequences.

Tuning the Similarity Threshold

The default threshold of 0.3 works for English but may need adjustment for shorter European words:

-- For Dutch and Swedish titles, lower threshold catches more matches
SET pg_trgm.similarity_threshold = 0.2;

-- Test threshold impact
SELECT title, similarity(title, 'zweed') as score
FROM videos
WHERE title % 'zweed'
ORDER BY score DESC;

At ViralVidVault, we settled on 0.22 as the sweet spot -- low enough to catch European language typos, high enough to avoid irrelevant noise. You can also make this dynamic per-region by adjusting the threshold in the application layer before each query.

The combination of trigram indexes and region-weighted scoring gives users a search experience that feels intelligent. Misspellings, partial words, and cross-language queries all return relevant results in under 5 milliseconds.

This article is part of the Building ViralVidVault series. Check out ViralVidVault to see these techniques in action.