ahmet gedik

Posted on Mar 27

PostgreSQL Full-Text Search for Multi-Script Video Metadata

#database #postgres #webdev #tutorial

PostgreSQL Full-Text Search for Multi-Script Video Metadata

TrendVidStream aggregates trending video content from 8 regions spanning the UAE (Arabic), Nordic countries (Danish, Finnish), and Central Europe (Czech, Belgian French/Dutch). Each region brings different script requirements to full-text search — and PostgreSQL handles most of them well.

Here's what I learned building FTS for a multi-region, multi-script video platform.

Setting Up Weighted tsvectors

For video metadata, titles matter more than descriptions. PostgreSQL's weight system (A through D) lets you encode this in the index:

CREATE TABLE videos (
    id          BIGSERIAL PRIMARY KEY,
    video_id    TEXT UNIQUE NOT NULL,
    title       TEXT NOT NULL,
    channel_title TEXT,
    description TEXT,
    region      TEXT NOT NULL,
    category_id INTEGER,
    view_count  BIGINT DEFAULT 0,
    language    TEXT,  -- 'ar', 'da', 'fi', 'cs', 'nl', 'fr', 'en'
    published_at TIMESTAMPTZ,
    fetched_at  TIMESTAMPTZ DEFAULT now(),
    -- Weighted vector: title (A) > channel (B) > description (C)
    search_vec  TSVECTOR GENERATED ALWAYS AS (
        setweight(to_tsvector('simple', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('simple', coalesce(channel_title, '')), 'B') ||
        setweight(to_tsvector('simple', coalesce(left(description, 500), '')), 'C')
    ) STORED
);

CREATE INDEX idx_videos_search ON videos USING GIN(search_vec);
CREATE INDEX idx_videos_region_cat ON videos(region, category_id);

RTL Arabic Text Considerations

UAE content on TrendVidStream includes Arabic titles. Arabic is RTL but PostgreSQL's FTS tokenizer treats it as UTF-8 text regardless of direction — the direction is a rendering concern, not a storage concern.

The simple configuration lowercases and splits on whitespace, which works reasonably for Arabic since Arabic words are space-separated unlike CJK languages:

-- Arabic search works with 'simple' config
SELECT video_id, title
FROM videos
WHERE search_vec @@ to_tsquery('simple', 'موسيقى')
  AND region = 'AE';

For better Arabic FTS, you'd want an Arabic dictionary with root-word stemming. But for a video discovery platform where users search for exact artist names and song titles, simple works well enough.

Language-Specific Configurations

For Western European languages in the TrendVidStream region set, PostgreSQL has built-in configurations:

-- Finnish has an excellent built-in stemmer
SELECT to_tsvector('finnish', 'musiikkivideo suosittu');
-- Stems to: 'musiikkivideo' + 'suosittu' (root forms)

-- Danish stems work well too
SELECT to_tsvector('danish', 'populær musik video');

-- For multi-language search across all regions, use 'simple'
-- to avoid language-specific stemming artifacts
SELECT to_tsvector('simple', 'popular music');

The tradeoff: language-specific configs give better search for single-language queries (searching "musik" in Danish finds "musikvideo"), but mix poorly across languages. I use simple for the cross-region search index.

GIN Index Strategies for Multi-Region Queries

-- Partial index for each major region (smaller, faster for region-filtered searches)
CREATE INDEX idx_videos_search_ae ON videos USING GIN(search_vec)
    WHERE region = 'AE';

CREATE INDEX idx_videos_search_fi ON videos USING GIN(search_vec)
    WHERE region = 'FI';

-- Full index for cross-region search
CREATE INDEX idx_videos_search_all ON videos USING GIN(search_vec);

Querying with ts_rank and Region Filtering

-- Search across Nordic regions (FI + DK) with relevance ranking
SELECT
    video_id,
    title,
    region,
    ts_rank(search_vec, query, 32) AS rank
FROM videos,
     to_tsquery('simple', plainto_tsquery('simple', $1)::text) AS query
WHERE search_vec @@ query
  AND region IN ('FI', 'DK')
ORDER BY rank DESC
LIMIT 20;

-- Multi-region trending for Central European markets
SELECT
    video_id,
    title,
    region,
    view_count
FROM videos
WHERE region IN ('CZ', 'BE', 'CH')
  AND fetched_at > now() - interval '24 hours'
ORDER BY view_count DESC
LIMIT 50;

Python Batch Indexing

import psycopg2
from psycopg2.extras import execute_values

def bulk_index_videos(conn, videos: list[dict]) -> None:
    """Bulk upsert; GENERATED ALWAYS AS updates search_vec automatically."""
    with conn.cursor() as cur:
        execute_values(cur, """
            INSERT INTO videos (video_id, title, channel_title, description,
                                region, category_id, view_count, language)
            VALUES %s
            ON CONFLICT (video_id) DO UPDATE SET
                title = EXCLUDED.title,
                view_count = EXCLUDED.view_count,
                fetched_at = now()
        """, [
            (v['video_id'], v['title'], v.get('channel_title', ''),
             v.get('description', '')[:500], v['region'],
             v.get('category_id', 0), v.get('view_count', 0),
             v.get('language', 'en'))
            for v in videos
        ])
    conn.commit()

The GENERATED ALWAYS AS STORED column means search_vec updates automatically on every upsert — no triggers, no manual index maintenance.

Regulatory Considerations for Multi-Region Content

One non-technical challenge unique to TrendVidStream's region set: content that's available in one region may be restricted in another. Finnish content regulations, UAE content standards, and Czech licensing rules are all different.

From a database perspective, this means storing region availability explicitly and filtering at query time, not assuming cross-region availability.

This article is part of the Building TrendVidStream series. Check out TrendVidStream to see these techniques in action.

DEV Community

PostgreSQL Full-Text Search for Multi-Script Video Metadata

PostgreSQL Full-Text Search for Multi-Script Video Metadata

Setting Up Weighted tsvectors

RTL Arabic Text Considerations

Language-Specific Configurations

GIN Index Strategies for Multi-Region Queries

Querying with ts_rank and Region Filtering

Python Batch Indexing

Regulatory Considerations for Multi-Region Content

Top comments (0)