PostgreSQL Full-Text Search for Multi-Script Video Metadata
TrendVidStream aggregates trending video content from 8 regions spanning the UAE (Arabic), Nordic countries (Danish, Finnish), and Central Europe (Czech, Belgian French/Dutch). Each region brings different script requirements to full-text search — and PostgreSQL handles most of them well.
Here's what I learned building FTS for a multi-region, multi-script video platform.
Setting Up Weighted tsvectors
For video metadata, titles matter more than descriptions. PostgreSQL's weight system (A through D) lets you encode this in the index:
CREATE TABLE videos (
id BIGSERIAL PRIMARY KEY,
video_id TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
channel_title TEXT,
description TEXT,
region TEXT NOT NULL,
category_id INTEGER,
view_count BIGINT DEFAULT 0,
language TEXT, -- 'ar', 'da', 'fi', 'cs', 'nl', 'fr', 'en'
published_at TIMESTAMPTZ,
fetched_at TIMESTAMPTZ DEFAULT now(),
-- Weighted vector: title (A) > channel (B) > description (C)
search_vec TSVECTOR GENERATED ALWAYS AS (
setweight(to_tsvector('simple', coalesce(title, '')), 'A') ||
setweight(to_tsvector('simple', coalesce(channel_title, '')), 'B') ||
setweight(to_tsvector('simple', coalesce(left(description, 500), '')), 'C')
) STORED
);
CREATE INDEX idx_videos_search ON videos USING GIN(search_vec);
CREATE INDEX idx_videos_region_cat ON videos(region, category_id);
RTL Arabic Text Considerations
UAE content on TrendVidStream includes Arabic titles. Arabic is RTL but PostgreSQL's FTS tokenizer treats it as UTF-8 text regardless of direction — the direction is a rendering concern, not a storage concern.
The simple configuration lowercases and splits on whitespace, which works reasonably for Arabic since Arabic words are space-separated unlike CJK languages:
-- Arabic search works with 'simple' config
SELECT video_id, title
FROM videos
WHERE search_vec @@ to_tsquery('simple', 'موسيقى')
AND region = 'AE';
For better Arabic FTS, you'd want an Arabic dictionary with root-word stemming. But for a video discovery platform where users search for exact artist names and song titles, simple works well enough.
Language-Specific Configurations
For Western European languages in the TrendVidStream region set, PostgreSQL has built-in configurations:
-- Finnish has an excellent built-in stemmer
SELECT to_tsvector('finnish', 'musiikkivideo suosittu');
-- Stems to: 'musiikkivideo' + 'suosittu' (root forms)
-- Danish stems work well too
SELECT to_tsvector('danish', 'populær musik video');
-- For multi-language search across all regions, use 'simple'
-- to avoid language-specific stemming artifacts
SELECT to_tsvector('simple', 'popular music');
The tradeoff: language-specific configs give better search for single-language queries (searching "musik" in Danish finds "musikvideo"), but mix poorly across languages. I use simple for the cross-region search index.
GIN Index Strategies for Multi-Region Queries
-- Partial index for each major region (smaller, faster for region-filtered searches)
CREATE INDEX idx_videos_search_ae ON videos USING GIN(search_vec)
WHERE region = 'AE';
CREATE INDEX idx_videos_search_fi ON videos USING GIN(search_vec)
WHERE region = 'FI';
-- Full index for cross-region search
CREATE INDEX idx_videos_search_all ON videos USING GIN(search_vec);
Querying with ts_rank and Region Filtering
-- Search across Nordic regions (FI + DK) with relevance ranking
SELECT
video_id,
title,
region,
ts_rank(search_vec, query, 32) AS rank
FROM videos,
to_tsquery('simple', plainto_tsquery('simple', $1)::text) AS query
WHERE search_vec @@ query
AND region IN ('FI', 'DK')
ORDER BY rank DESC
LIMIT 20;
-- Multi-region trending for Central European markets
SELECT
video_id,
title,
region,
view_count
FROM videos
WHERE region IN ('CZ', 'BE', 'CH')
AND fetched_at > now() - interval '24 hours'
ORDER BY view_count DESC
LIMIT 50;
Python Batch Indexing
import psycopg2
from psycopg2.extras import execute_values
def bulk_index_videos(conn, videos: list[dict]) -> None:
"""Bulk upsert; GENERATED ALWAYS AS updates search_vec automatically."""
with conn.cursor() as cur:
execute_values(cur, """
INSERT INTO videos (video_id, title, channel_title, description,
region, category_id, view_count, language)
VALUES %s
ON CONFLICT (video_id) DO UPDATE SET
title = EXCLUDED.title,
view_count = EXCLUDED.view_count,
fetched_at = now()
""", [
(v['video_id'], v['title'], v.get('channel_title', ''),
v.get('description', '')[:500], v['region'],
v.get('category_id', 0), v.get('view_count', 0),
v.get('language', 'en'))
for v in videos
])
conn.commit()
The GENERATED ALWAYS AS STORED column means search_vec updates automatically on every upsert — no triggers, no manual index maintenance.
Regulatory Considerations for Multi-Region Content
One non-technical challenge unique to TrendVidStream's region set: content that's available in one region may be restricted in another. Finnish content regulations, UAE content standards, and Czech licensing rules are all different.
From a database perspective, this means storing region availability explicitly and filtering at query time, not assuming cross-region availability.
This article is part of the Building TrendVidStream series. Check out TrendVidStream to see these techniques in action.
Top comments (0)