ahmet gedik

Posted on Mar 25

PostgreSQL Full-Text Search for CJK Video Metadata

#postgres #database #tutorial #webdev

PostgreSQL Full-Text Search for CJK Video Metadata

When I started thinking about migrating TopVideoHub to PostgreSQL, full-text search was the feature I was most curious about. Our video database spans Japanese, Korean, Traditional Chinese (Taiwan and Hong Kong), Vietnamese, and Thai — a real gauntlet for any search system.

Here's what I learned implementing PostgreSQL FTS for a multi-script video platform.

Why PostgreSQL FTS Over SQLite FTS5

SQLite FTS5 works, but PostgreSQL offers something more powerful: weighted tsvector columns with configurable text search configurations (TSC) per language. You can weight the title field more than the description, rank results by relevance, and even combine searches across multiple languages in a single query.

The downside: CJK text (Chinese, Japanese, Korean) doesn't tokenize well in PostgreSQL's built-in configurations. We'll address that head-on.

Setting Up the Schema

First, create the videos table with a dedicated tsvector column:

CREATE TABLE videos (
    id          BIGSERIAL PRIMARY KEY,
    video_id    TEXT UNIQUE NOT NULL,
    title       TEXT NOT NULL,
    channel_title TEXT,
    description TEXT,
    region      TEXT NOT NULL,
    category_id INTEGER,
    view_count  BIGINT DEFAULT 0,
    published_at TIMESTAMPTZ,
    fetched_at  TIMESTAMPTZ DEFAULT now(),
    -- Weighted search vector: title gets weight A (highest), channel B, description C
    search_vec  TSVECTOR GENERATED ALWAYS AS (
        setweight(to_tsvector('simple', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('simple', coalesce(channel_title, '')), 'B') ||
        setweight(to_tsvector('simple', coalesce(description, '')), 'C')
    ) STORED
);

-- GIN index for fast FTS queries
CREATE INDEX idx_videos_search ON videos USING GIN(search_vec);

-- Index for region-filtered searches
CREATE INDEX idx_videos_region ON videos(region);

The GENERATED ALWAYS AS ... STORED syntax is PostgreSQL 12+. It automatically keeps search_vec in sync — no triggers needed.

Why 'simple' Configuration?

The simple text search configuration tokenizes by whitespace and lowercases text, but doesn't stem words. For CJK content on TopVideoHub, this is actually what we want: a Japanese title like "人気動画" should be searchable by its exact characters, not a stemmed form.

For pure English content you might use 'english' instead, which handles stemming (running → run, videos → video).

Querying with ts_rank

-- Search with relevance ranking
SELECT
    video_id,
    title,
    channel_title,
    region,
    ts_rank(search_vec, query, 32) AS rank
FROM videos,
     to_tsquery('simple', plainto_tsquery('simple', $1)::text) AS query
WHERE search_vec @@ query
ORDER BY rank DESC
LIMIT 20;

The ts_rank function scores results based on weight of matched tokens (A > B > C), number of matching terms, and document length normalization (flag 32 = divide by document length).

Handling CJK Text

CJK languages don't use spaces between words. "人気動画" (popular video in Japanese) produces a single token — no character-level breakdown. For Asian content searches, I use a character-splitting function:

-- Split CJK text into individual characters for better matching
CREATE OR REPLACE FUNCTION cjk_to_tokens(input TEXT)
RETURNS TEXT AS $$
DECLARE
    result TEXT := '';
    i INTEGER;
    ch TEXT;
    code INTEGER;
BEGIN
    FOR i IN 1..char_length(input) LOOP
        ch := substr(input, i, 1);
        code := ascii(ch);
        -- CJK Unified Ideographs: U+4E00 to U+9FFF
        IF code >= 19968 AND code <= 40959 THEN
            result := result || ' ' || ch;
        -- Hiragana: U+3040-U+309F, Katakana: U+30A0-U+30FF
        ELSIF code >= 12352 AND code <= 12543 THEN
            result := result || ' ' || ch;
        ELSE
            result := result || ch;
        END IF;
    END LOOP;
    RETURN trim(result);
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Expression index for CJK-aware search
CREATE INDEX idx_videos_cjk ON videos
    USING GIN(to_tsvector('simple', cjk_to_tokens(title)));

This breaks "人気動画" into "人気動画" — individual tokens that can be matched separately when a user searches for any of those characters.

Comparing with SQLite FTS5

Feature	PostgreSQL FTS	SQLite FTS5
Weighted fields	Native (A/B/C/D)	Via bm25()
CJK bigrams	Requires function	unicode61 built-in
Concurrent reads	Excellent	Good (WAL mode)
Hosted cost	$15-20/mo managed	$0 (embedded)
Maintenance	Requires server	Zero

For a platform like TopVideoHub serving 9 Asia-Pacific regions, the CJK support tradeoff matters significantly. SQLite FTS5 with unicode61 handles CJK better out of the box, but PostgreSQL's weighted ranking gives better relevance for mixed-script searches.

Timezone Awareness for Asia-Pacific

One PostgreSQL advantage: timezone-aware timestamps. When calculating "trending in the last 24 hours," Asia-Pacific regions span UTC+7 to UTC+9. PostgreSQL handles this cleanly:

-- Find videos trending in Japan in the last 24 hours (JST)
SELECT video_id, title
FROM videos
WHERE region = 'JP'
  AND fetched_at > now() AT TIME ZONE 'Asia/Tokyo' - interval '24 hours';

SQLite stores everything as text or Unix timestamps, requiring manual timezone math in application code. For a multi-region platform spanning nine different timezone offsets, this PostgreSQL feature pays for itself in reduced bugs.

This article is part of the Building TopVideoHub series. Check out TopVideoHub to see these techniques in action.

DEV Community

PostgreSQL Full-Text Search for CJK Video Metadata

PostgreSQL Full-Text Search for CJK Video Metadata

Why PostgreSQL FTS Over SQLite FTS5

Setting Up the Schema

Why 'simple' Configuration?

Querying with ts_rank

Handling CJK Text

Comparing with SQLite FTS5

Timezone Awareness for Asia-Pacific

Top comments (0)