The Search-as-You-Type Problem
Type "bts" into a search box — a user in Korea probably wants BTS music videos. Type "けん" and they want results for any Japanese title containing those kana. Generic substring matching is too slow at scale; full-text search is too rigid for autocomplete. PostgreSQL's pg_trgm extension sits in the sweet spot.
TopVideoHub serves video search across 9 Asia-Pacific regions, where multi-script autocomplete is non-trivial. Here's how we built it.
What Is pg_trgm?
A trigram is a group of three consecutive characters. The word hello decomposes to h, he, hel, ell, llo, lo. PostgreSQL can index these trigrams and use them to compute string similarity in microseconds.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
Index Strategy
CREATE INDEX idx_videos_title_trgm
ON videos USING GIN (title gin_trgm_ops);
CREATE INDEX idx_videos_channel_trgm
ON videos USING GIN (channel_title gin_trgm_ops);
CREATE INDEX idx_videos_region_title_trgm
ON videos USING GIN (title gin_trgm_ops)
WHERE region IN ('JP', 'KR', 'TW', 'HK', 'VN', 'TH', 'SG');
GIN indexes support set-containment lookups and are optimal for search. For autocomplete, GIN with a WHERE clause beats plain LIKE by 10-50x.
The Autocomplete Query
SELECT
video_id,
title,
channel_title,
thumbnail_url,
region,
similarity(title, $1) AS sim
FROM videos
WHERE
region = $2
AND title % $1
ORDER BY sim DESC, view_count DESC
LIMIT 8;
Tune the similarity threshold:
SET pg_trgm.similarity_threshold = 0.2; -- more permissive
SET pg_trgm.similarity_threshold = 0.4; -- stricter
Word Similarity for Partial Prefix Match
SELECT
video_id,
title,
word_similarity($1, title) AS wsim
FROM videos
WHERE
region = $2
AND $1 <% title
ORDER BY wsim DESC, view_count DESC
LIMIT 8;
word_similarity(needle, haystack) checks if needle is similar to any word within haystack — ideal when users type partial channel or title words.
Python FastAPI Endpoint
from fastapi import FastAPI, Query
from pydantic import BaseModel
import asyncpg
app = FastAPI()
pool = None
class AutocompleteResult(BaseModel):
video_id: str
title: str
channel: str
thumbnail: str
region: str
@app.on_event('startup')
async def startup():
global pool
pool = await asyncpg.create_pool(
dsn='postgresql://user:pass@localhost/topvideohub',
min_size=2, max_size=10,
)
@app.get('/api/autocomplete', response_model=list[AutocompleteResult])
async def autocomplete(
q: str = Query(min_length=2, max_length=80),
region: str = Query(default='JP', regex='^[A-Z]{2}$'),
):
query = """
SELECT video_id, title, channel_title, thumbnail_url, region
FROM videos
WHERE region = $2
AND ($1 <% title OR title ILIKE '%' || $1 || '%')
ORDER BY word_similarity($1, title) DESC, view_count DESC
LIMIT 8
"""
async with pool.acquire() as conn:
rows = await conn.fetch(query, q, region)
return [
AutocompleteResult(
video_id=r['video_id'], title=r['title'],
channel=r['channel_title'], thumbnail=r['thumbnail_url'], region=r['region'],
)
for r in rows
]
Frontend Debounce Integration
const autocomplete = (() => {
let timer;
const cache = new Map();
async function suggest(query, region) {
if (cache.has(query)) return cache.get(query);
const res = await fetch(`/api/autocomplete?q=${encodeURIComponent(query)}®ion=${region}`);
const data = await res.json();
cache.set(query, data);
return data;
}
return (input, region, onResults) => {
clearTimeout(timer);
timer = setTimeout(async () => {
if (input.length < 2) return;
const results = await suggest(input, region);
onResults(results);
}, 150);
};
})();
CJK Trigram Behaviour
CJK characters are single-character "words", so trigrams look like "ア", "アニ", "ニメ" for "アニメ" (anime). The similarity threshold may need lowering for CJK to 0.15-0.2 since short CJK strings share fewer trigrams with long titles.
Query performance on TopVideoHub with 200,000 videos and a GIN trigram index: median autocomplete latency is under 8ms.
This article is part of the Building TopVideoHub series. Check out TopVideoHub to see these techniques in action.
Top comments (0)