Implementing Video Search Autocomplete with PostgreSQL Trigrams

#postgres #search #autocomplete #webdev

The Search-as-You-Type Problem

Type "bts" into a search box — a user in Korea probably wants BTS music videos. Type "けん" and they want results for any Japanese title containing those kana. Generic substring matching is too slow at scale; full-text search is too rigid for autocomplete. PostgreSQL's pg_trgm extension sits in the sweet spot.

TopVideoHub serves video search across 9 Asia-Pacific regions, where multi-script autocomplete is non-trivial. Here's how we built it.

What Is pg_trgm?

A trigram is a group of three consecutive characters. The word hello decomposes to h, he, hel, ell, llo, lo. PostgreSQL can index these trigrams and use them to compute string similarity in microseconds.

CREATE EXTENSION IF NOT EXISTS pg_trgm;

Index Strategy

CREATE INDEX idx_videos_title_trgm
    ON videos USING GIN (title gin_trgm_ops);

CREATE INDEX idx_videos_channel_trgm
    ON videos USING GIN (channel_title gin_trgm_ops);

CREATE INDEX idx_videos_region_title_trgm
    ON videos USING GIN (title gin_trgm_ops)
    WHERE region IN ('JP', 'KR', 'TW', 'HK', 'VN', 'TH', 'SG');

GIN indexes support set-containment lookups and are optimal for search. For autocomplete, GIN with a WHERE clause beats plain LIKE by 10-50x.

The Autocomplete Query

SELECT
    video_id,
    title,
    channel_title,
    thumbnail_url,
    region,
    similarity(title, $1) AS sim
FROM videos
WHERE
    region = $2
    AND title % $1
ORDER BY sim DESC, view_count DESC
LIMIT 8;

Tune the similarity threshold:

SET pg_trgm.similarity_threshold = 0.2;  -- more permissive
SET pg_trgm.similarity_threshold = 0.4;  -- stricter

Word Similarity for Partial Prefix Match

SELECT
    video_id,
    title,
    word_similarity($1, title) AS wsim
FROM videos
WHERE
    region = $2
    AND $1 <% title
ORDER BY wsim DESC, view_count DESC
LIMIT 8;

word_similarity(needle, haystack) checks if needle is similar to any word within haystack — ideal when users type partial channel or title words.

Python FastAPI Endpoint

from fastapi import FastAPI, Query
from pydantic import BaseModel
import asyncpg

app = FastAPI()
pool = None

class AutocompleteResult(BaseModel):
    video_id: str
    title: str
    channel: str
    thumbnail: str
    region: str

@app.on_event('startup')
async def startup():
    global pool
    pool = await asyncpg.create_pool(
        dsn='postgresql://user:pass@localhost/topvideohub',
        min_size=2, max_size=10,
    )

@app.get('/api/autocomplete', response_model=list[AutocompleteResult])
async def autocomplete(
    q: str = Query(min_length=2, max_length=80),
    region: str = Query(default='JP', regex='^[A-Z]{2}$'),
):
    query = """
        SELECT video_id, title, channel_title, thumbnail_url, region
        FROM videos
        WHERE region = $2
          AND ($1 <% title OR title ILIKE '%' || $1 || '%')
        ORDER BY word_similarity($1, title) DESC, view_count DESC
        LIMIT 8
    """
    async with pool.acquire() as conn:
        rows = await conn.fetch(query, q, region)
    return [
        AutocompleteResult(
            video_id=r['video_id'], title=r['title'],
            channel=r['channel_title'], thumbnail=r['thumbnail_url'], region=r['region'],
        )
        for r in rows
    ]

Frontend Debounce Integration

const autocomplete = (() => {
  let timer;
  const cache = new Map();

  async function suggest(query, region) {
    if (cache.has(query)) return cache.get(query);
    const res = await fetch(`/api/autocomplete?q=${encodeURIComponent(query)}&region=${region}`);
    const data = await res.json();
    cache.set(query, data);
    return data;
  }

  return (input, region, onResults) => {
    clearTimeout(timer);
    timer = setTimeout(async () => {
      if (input.length < 2) return;
      const results = await suggest(input, region);
      onResults(results);
    }, 150);
  };
})();

CJK Trigram Behaviour

CJK characters are single-character "words", so trigrams look like "ア", "アニ", "ニメ" for "アニメ" (anime). The similarity threshold may need lowering for CJK to 0.15-0.2 since short CJK strings share fewer trigrams with long titles.

Query performance on TopVideoHub with 200,000 videos and a GIN trigram index: median autocomplete latency is under 8ms.

This article is part of the Building TopVideoHub series. Check out TopVideoHub to see these techniques in action.

DEV Community