PostgreSQL Full-Text Search for CJK Video Metadata
When I started thinking about migrating TopVideoHub to PostgreSQL, full-text search was the feature I was most curious about. Our video database spans Japanese, Korean, Traditional Chinese (Taiwan and Hong Kong), Vietnamese, and Thai — a real gauntlet for any search system.
Here's what I learned implementing PostgreSQL FTS for a multi-script video platform.
Why PostgreSQL FTS Over SQLite FTS5
SQLite FTS5 works, but PostgreSQL offers something more powerful: weighted tsvector columns with configurable text search configurations (TSC) per language. You can weight the title field more than the description, rank results by relevance, and even combine searches across multiple languages in a single query.
The downside: CJK text (Chinese, Japanese, Korean) doesn't tokenize well in PostgreSQL's built-in configurations. We'll address that head-on.
Setting Up the Schema
First, create the videos table with a dedicated tsvector column:
CREATE TABLE videos (
id BIGSERIAL PRIMARY KEY,
video_id TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
channel_title TEXT,
description TEXT,
region TEXT NOT NULL,
category_id INTEGER,
view_count BIGINT DEFAULT 0,
published_at TIMESTAMPTZ,
fetched_at TIMESTAMPTZ DEFAULT now(),
-- Weighted search vector: title gets weight A (highest), channel B, description C
search_vec TSVECTOR GENERATED ALWAYS AS (
setweight(to_tsvector('simple', coalesce(title, '')), 'A') ||
setweight(to_tsvector('simple', coalesce(channel_title, '')), 'B') ||
setweight(to_tsvector('simple', coalesce(description, '')), 'C')
) STORED
);
-- GIN index for fast FTS queries
CREATE INDEX idx_videos_search ON videos USING GIN(search_vec);
-- Index for region-filtered searches
CREATE INDEX idx_videos_region ON videos(region);
The GENERATED ALWAYS AS ... STORED syntax is PostgreSQL 12+. It automatically keeps search_vec in sync — no triggers needed.
Why 'simple' Configuration?
The simple text search configuration tokenizes by whitespace and lowercases text, but doesn't stem words. For CJK content on TopVideoHub, this is actually what we want: a Japanese title like "人気動画" should be searchable by its exact characters, not a stemmed form.
For pure English content you might use 'english' instead, which handles stemming (running → run, videos → video).
Querying with ts_rank
-- Search with relevance ranking
SELECT
video_id,
title,
channel_title,
region,
ts_rank(search_vec, query, 32) AS rank
FROM videos,
to_tsquery('simple', plainto_tsquery('simple', $1)::text) AS query
WHERE search_vec @@ query
ORDER BY rank DESC
LIMIT 20;
The ts_rank function scores results based on weight of matched tokens (A > B > C), number of matching terms, and document length normalization (flag 32 = divide by document length).
Handling CJK Text
CJK languages don't use spaces between words. "人気動画" (popular video in Japanese) produces a single token — no character-level breakdown. For Asian content searches, I use a character-splitting function:
-- Split CJK text into individual characters for better matching
CREATE OR REPLACE FUNCTION cjk_to_tokens(input TEXT)
RETURNS TEXT AS $$
DECLARE
result TEXT := '';
i INTEGER;
ch TEXT;
code INTEGER;
BEGIN
FOR i IN 1..char_length(input) LOOP
ch := substr(input, i, 1);
code := ascii(ch);
-- CJK Unified Ideographs: U+4E00 to U+9FFF
IF code >= 19968 AND code <= 40959 THEN
result := result || ' ' || ch;
-- Hiragana: U+3040-U+309F, Katakana: U+30A0-U+30FF
ELSIF code >= 12352 AND code <= 12543 THEN
result := result || ' ' || ch;
ELSE
result := result || ch;
END IF;
END LOOP;
RETURN trim(result);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
-- Expression index for CJK-aware search
CREATE INDEX idx_videos_cjk ON videos
USING GIN(to_tsvector('simple', cjk_to_tokens(title)));
This breaks "人気動画" into "人 気 動 画" — individual tokens that can be matched separately when a user searches for any of those characters.
Comparing with SQLite FTS5
| Feature | PostgreSQL FTS | SQLite FTS5 |
|---|---|---|
| Weighted fields | Native (A/B/C/D) | Via bm25() |
| CJK bigrams | Requires function | unicode61 built-in |
| Concurrent reads | Excellent | Good (WAL mode) |
| Hosted cost | $15-20/mo managed | $0 (embedded) |
| Maintenance | Requires server | Zero |
For a platform like TopVideoHub serving 9 Asia-Pacific regions, the CJK support tradeoff matters significantly. SQLite FTS5 with unicode61 handles CJK better out of the box, but PostgreSQL's weighted ranking gives better relevance for mixed-script searches.
Timezone Awareness for Asia-Pacific
One PostgreSQL advantage: timezone-aware timestamps. When calculating "trending in the last 24 hours," Asia-Pacific regions span UTC+7 to UTC+9. PostgreSQL handles this cleanly:
-- Find videos trending in Japan in the last 24 hours (JST)
SELECT video_id, title
FROM videos
WHERE region = 'JP'
AND fetched_at > now() AT TIME ZONE 'Asia/Tokyo' - interval '24 hours';
SQLite stores everything as text or Unix timestamps, requiring manual timezone math in application code. For a multi-region platform spanning nine different timezone offsets, this PostgreSQL feature pays for itself in reduced bugs.
This article is part of the Building TopVideoHub series. Check out TopVideoHub to see these techniques in action.
Top comments (0)